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Preface 


We are honored and happy to be able to make available this Springer Handbook of 
Computational Intelligence, a large and comprehensive account of both the state-of- 
the-art of the research discipline, complemented with some historical remarks, main 
challenges, and perspectives of the future. To follow a predominant tradition, we have 
divided this Springer Handbook into parts that correspond to main fields that are meant 
to constitute the area of computational intelligence, that is, fuzzy sets theory and fuzzy 
logic, rough sets, evolutionary computation, neural networks, hybrid approaches and 
systems, all of them complemented with a thorough coverage of some foundational 
issues, methodologies, tools, and techniques. 

We hope that the handbook will serve as an indispensable and useful source of 
information for all readers interested in both the theory and various applications of 
computational intelligence. The formula of the Springer Handbook as a convenient 
single-volume publication project should help the potential readers find a proper tool 
or technique for solving their problems just by simply browsing through a clearly 
composed and well-indexed contents. The authors of the particular chapters, who are 
the best known specialists in their respective fields worldwide, are the best assurance 
for the handbook to serve as an excellent and timely reference. 

On behalf of the entire computational intelligence community, we wish to express 
sincere thanks, first of all, to the Part Editors responsible for the scope, authors, and 
composition of the particular parts for their great job to arrange the most appropriate 
topics, their coverage, and identify expert authors. Second, we wish to thank all the 
authors for their great contributions in the sense of clarity, comprehensiveness, novelty, 
vision, and — above all — understanding of the real needs of readers of diverse interests. 

All that efforts would not end up with the success without a total and multifaceted 
publisher’s dedication and support. We wish to thank very much Dr. Werner Skolaut, 
Ms. Constanze Ober, and their collaborators from Springer, Heidelberg, and le-tex 
publishing GmbH, Leipzig, respectively, for their extremely effective and efficient han- 
dling of this huge and difficult project. 


September 2014 
Janusz Kacprzyk Warsaw 
Witold Pedrycz Edmonton 
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This Springer Handbook of Computational Intel- 
ligence is a result of a broad project that has 
been launched by us to respond to an urgent 
need of a wide scientific and scholarly community 
for a comprehensive reference source on Compu- 
tational Intelligence, a field of science that has 
for some decades enjoyed a growing popularity 
both in terms of the theory and methodology as 
well as numerous applications. As it is always the 
case in such situations, after some time once an 
area has reached maturity, and there is to some 
extent a consent in the community as to which 
paradigms, and tools and techniques may be use- 
ful and promising, the time will come when some 


The first and most important question that can be posed 
by many people, notably those who work in more tra- 
ditional and relatively well-defined fields of science, 
is what Computational Intelligence is. There are many 
definitions that try to capture the very essence of that 
field, emphasize different aspects, and — by necessity — 
somehow reflect the individual research interests, pref- 
erences, prospective application areas, etc. 

However, it seems that in recent years there has has 
been a wider and wider consent as to what basically 
Computational Intelligence is. Let us start with a cita- 
tion coming from the Constitution of the IEEE (Institute 
of Electrical and Electronics Engineers) CIS (Computa- 
tional Intelligence Society) — Article I, Section 5: 


The Field of Interest of the Society is the theory, 
design, application, and development of biolog- 
ically and linguistically motivated computational 
paradigms emphasizing neural networks, connec- 
tionist systems, genetic algorithms, evolutionary 
programming, fuzzy systems, and hybrid intelli- 
gent systems in which these paradigms are con- 
tained. 
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1.2 Conclusions and Acknowledgments.......... 


state of the art exposition, exemplified by this 
Springer Handbook, would be welcome. We think 
that this is the right moment. 


It seems that this is extremely up to the point, and we 
have basically followed this general philosophy in the 
composition of the Springer Handbook. 

Let us first extend a little bit the above essential 
land comprehensive description of what computational 
intelligence is interested in, what deals with, which 
tools and techniques it uses, etc. Computational In- 
telligence is a broad and diverse collection of nature 
inspired computational methodologies and approaches, 
and tools and techniques that are meant to be used 
to model and solve complex real-world problems in 
various areas of science and technology in which the 
traditional approaches based on strict and well-defined 
tools and techniques, exemplified by hard mathemat- 
ical modeling, optimization, control theory, stochastic 
analyses, etc., are either not feasible or not efficient. 
Of course, the term nature inspired should be meant in 
a broader sense of being biologically inspired, socially 
inspired, etc. 

Those complex problems that are of interest of com- 
putational intelligence are often what may be called 
ill-posed which may make their exact solution, using 
the traditional hard tools and techniques, impossible. 
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However, we all know that such problems are quite ef- 
fectively and efficiently solved in real life by human 
being, or — more generally — by living species. One 
can easily come to a conclusion that one should de- 
velop new tools, maybe less precise and not so well 
mathematically founded, that would provide a solution, 
maybe not optimal but good enough. This is exactly 
what Computational Intelligence is meant to provide. 
Briefly speaking, it is most often considered that 
computational intelligence includes as its main com- 
ponents fuzzy logic, (artificial) neural networks, and 


1.1 Details of the Contents 


The afore-mentioned view of the very essence of com- 
putational intelligence has been followed by us when 
dividing the handbook into parts, which correspond to 
the particular fields that constitute the area of Compu- 
tational Intelligence, and also in the selection of field 
editors, and then their selection of proper authors. We 
have been very fortunate to be able to attract as the field 
editors and authors of chapters the best people in the 
respective fields. 


1.1.1 Part A Foundations 


For obvious reasons, we start the Springer Handbook 
of Computational Intelligence with Part A, Founda- 
tions, which deliver a constructive survey of some 
carefully chosen topics that are of importance for vir- 
tually all ensuing parts of the handbook. This part, 
edited by Professors Radko Mesiar and Bernard De 
Baets, involved foundational works on multivalued log- 
ics, possibility theory, aggregation functions, measure- 
based integrals, the essence of extensions of fuzzy sets, 
F-transforms, mathematical programming, and games 
under imprecision and fuzziness. It is easy to see that 
the contributions cover topics that are of profound 
relevance. 


1.1.2 Part B Fuzzy Logic 


Part B, Fuzzy Logic, edited by Professors Luis 
Magdalena and Enrique Herrera-Viedma, attempts to 
present the most relevant elements and issues related 
to a vast area of fuzzy logic. First of all, a comprehen- 
sive account of foundations of fuzzy sets theory has 
been provided, emphasizing both theoretical and ap- 
plication oriented aspects. This has been followed by 


evolutionary computation. Of course, these main el- 
ements should be properly meant. For instance, one 
should understand that fuzzy logic is to be comple- 
mented by rough set theory or multivalued logic, neural 
networks should be more generally meant as including 
all kinds of connectionist systems and also learning sys- 
tems, and evolutionary computation should be viewed 
as the area including swarm intelligence, artificial im- 
mune systems, bacterial algorithms, etc. One can also 
add in this context many other approaches like the 
Dempster-Shafer theory, chaos theory, etc. 


a state-of-the-art presentation of the concept, properties, 
and applications of fuzzy relations, including a brief 
historical perspective and future challenges. A similar 
account of the past, present, and future of an extremely 
important concept of fuzzy implications has then been 
authoritatively covered. 

Then, various issues related to fuzzy systems mod- 
eling have been presented; notably the concept and 
properties of fuzzy-rule-based systems which are the 
core of fuzzy modeling, and the problem of inter- 
pretability of fuzzy systems. Fuzzy clustering, which 
is one of the most widely used tools and techniques, 
encountered in virtually all problems related to data 
analysis, modeling, control, etc., is then exposed, with 
focus on the past, present, and future challenges. Then, 
issues related to Zadeh’s seminal idea of comput- 
ing with words have been thoroughly studied from 
many perspectives, in particular a logical and algebraic 
one. 

Since fuzzy (logic) control is undoubtedly the 
most vigorously reported industrial application of fuzzy 
logic, this subject has been presented in detail, both for 
the conventional fuzzy sets and their extensions, espe- 
cially type 2 and interval type-2 fuzzy sets. Applications 
of fuzzy logic in autonomous robotics have been pre- 
sented. 

An account of fundamental issues and solutions 
related to the use of fuzzy logic in database and infor- 
mation management has then been given. 


1.1.3 Part C Rough Sets 


Part C, Rough Sets, edited by Professors Roman Słow- 
inski and Yiyu Yao, starts with a comprehensive, 
rigorous, yet readable presentation of foundations of 


1.1 Details of the Contents 


rough sets, followed by a similar exposition focused 
on the use of rough sets to decision making, aid- 
ing, and support. Then, rule induction is considered 
as a tool for modeling, decision making, and data 
analysis. 

A number of important extensions of the basic con- 
cept of a rough set have then been presented, including 
the concept of a probabilistic rough set and a general- 
ized rough set, along with a lucid exposition of their 
properties and possible applications. 

A crucial problem of a fuzzy-rough hybridization is 
then discussed, followed by a more general exposition 
of rough systems. 


1.1.4 Part D Neural Networks 


Part D, Neural Networks, edited by Professors Cesare 
Alippi and Marios Polycarpou, starts with a general pre- 
sentation of artificial neural network models, followed 
by presentations of some mode specific types exempli- 
fied by deep and modular neural networks. 

Much attention in this part is devoted to machine 
learning, starting from a very general overview of 
the area and main tools and techniques employed, 
theoretical methods in machine learning, probabilistic 
modeling in machine learning, kernel methods in ma- 
chine learning, etc. 

An important problem area called neurodynamics 
is the subject of a next state-of-the-art survey, followed 
by a review of basic aspects, models, and challenges of 
computational neuroscience considered basically from 
the point of view of biophysical modeling of neural sys- 
tems. Cognitive architectures, notably for agent-based 
systems, and a related problem of computational mod- 
els of cognitive and motor control have then been 
presented in much detail. 

Advanced issues involved in the so-called embod- 
ied intelligence, and neuroengineering, and neuromor- 
phic engineering, emerging as promising paradigms for 
modeling and problem solving, have then been dealt 
with. 

Evolving connectionist systems, which constitute 
a novel, very promising architectures for the modeling 
of various processes and systems have been surveyed. 
An important part is on real-world applications of ma- 
chine learning completes this important part. 


1.1.5 Part E Evolutionary Computation 


Part E, Evolutionary Computation, edited by Profes- 
sors Frank Neumann, Carsten Witt, Peter Merz, Car- 


los A. Coello Coello, Oliver Schiitze, Thomas Bartz- 
Beielstein, Jorn Mehnen, and Giinther Raidl, concerns 
the third fundamental element of what is tradition- 
ally being considered to be the core of Computational 
Intelligence. 

First, comprehensive surveys of genetic algorithms, 
genetic programming, evolution strategies, parallel evo- 
lutionary algorithms are presented, which are readable 
and constructive so that a large audience might find 
them useful and — to some extent — ready to use. Some 
more general topics like the estimation of distribution 
algorithms, indicator-based selection, etc., are also dis- 
cussed. 

An important problem, from a theoretical and prac- 
tical point of view, of learning classifier systems is 
presented in depth. 

Multiobjective evolutionary algorithms, which con- 
stitute one of the most important group, both from the 
theoretical and applied points of view, are discussed in 
detail, followed by an account of parallel multiobjec- 
tive evolutionary algorithms, and then a more general 
analysis of many multiobjective problems. 

Considerable attention has also been paid to a pre- 
sentation of hybrid evolutionary algorithms, such as 
memetic algorithms, which have emerged as a very 
promising tool for solving many real-world problems in 
a multitude of areas of science and technology. More- 
over, parallel evolutionary combinatorial optimization 
has been presented. 

Search operators, which are crucial in all kinds of 
evolutionary algorithms, have been prudently analyzed. 
This analysis was followed by a thorough analysis of 
various issues involved in stochastic local search algo- 
rithms. 

An interesting survey of various technological and 
industrial applications in mechanical engineering and 
design has been provided. Then, an account of the use 
of evolutionary combinatorial optimization in bioinfor- 
matics is given. 

An analysis of a synergistic integration of meta- 
heuristics, notably evolutionary computation, and con- 
straint satisfaction, constraint programming, graph col- 
oring, tree decomposition, and similar relevant prob- 
lems completes the part. 


1.1.6 Part F Swarm Intelligence 


Part F, Swarm Intelligence, edited by Professors Chris- 
tian Blum and Roderich Gross, starts with a con- 
cise yet comprehensive introduction to swarm intel- 
ligence in optimization and robotics, two fields of 
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science in which this type of metaheuristics has been 
considered, and demonstrated to be powerful and 
useful. 

Then, a preference-based multiobjective particle 
swarm optimization model is covered as a good tool 
for airfoil design. An ant colony optimization model 
for the minimum-weight rooted arborescence problem, 
which may be a good model for many diverse problem 
in computer science, decision analysis, data analysis, 
etc. is discussed. 

An intelligent swarm of Markovian agents is 
the topic of a thorough analysis, which highlights 
the power and universality of this model. More- 
over, a probabilistic modeling of swarm systems is 
surveyed. 

Then some explicitly nature inspired algorithms 
based on how some species behave are presented, 
notably, a honey bee social foraging algorithm for 
resource allocation. Collective behavior modes and re- 
configurability are discussed in swarm robotics, com- 
plemented by an exposition of problems and so- 
lutions related to the collective manipulation and 
construction. 


1.1.7 Part G Hybrid Systems 


Part G, Hybrid Systems, edited by Professors Oscar 
Castillo and Patricia Melin, starts with papers on var- 
ious types of controllers developed with the aid of 
computational intelligence tools and techniques that are 
employed in a highly synergistic way. 

First, an interesting and visionary study of robust 
evolving cloud-based controller which combines new 
conceptual ideas with novel computing architecture is 
provided. Then evolving embedded fuzzy controllers as 
well as the bio-inspired optimization of type-2 fuzzy 
controllers are surveyed. New hybrid modeling and so- 
lution tools are then presented; notably multiobjective 
genetic fuzzy systems. The use of modular neural net- 
works and type-2 fuzzy logic is shown to be effective 
and efficient in various pattern recognition problems. 

A novel idea of using chemical algorithms for the 
optimization of interval type-2 and type-1 fuzzy con- 
trollers for autonomous mobile robots is shown. 

Finally, the implementation of bio-inspired opti- 
mization methods on graphic processing units is pre- 
sented and its efficiency is emphasized. 


1.2 Conclusions and Acknowledgments 


To summarize, the coverage of topics in the particular 
parts has certainly provided a comprehensive, rigorous 
yet readable state-of-the-art survey of main research 
directions, developments and challenges in Computa- 
tional Intelligence. 


Both more advanced readers, who may look for 
details, and novice readers, who may look for some 
more general and readable introduction that could be 
employed in their later works, would certainly find this 
handbook useful. 
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2. Many-Valued and Fuzzy Logics 


Siegfried Gottwald 


In this chapter, we consider particular classes 
of infinite-valued propositional logics which are 
strongly related to t-norms as conjunction con- 
nectives and to the real unit interval as a set of 
their truth degrees, and which have their impli- 
cation connectives determined via an adjointness 
condition. 

Such systems have in the last 10 years been of 
considerable interest, and the topic of important 
results. They generalize well-known systems of 
infinite-valued logic, and form a link to as differ- 
ent areas as, e.g., linear logic and fuzzy set theory. 

We survey the most important ones of these 
systems, always explaining suitable algebraic se- 
mantics and adequate formal calculi, but also 
mentioning complexity issues. 

Finally, we mention a type of extension which 
allows for graded notions of provability and 
entailment. 
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Classical two-valued logic is characterized by two ba- 
sic principles. The principle of extensionality states that 
the truth value of any compound sentence depends only 
on the truth values of the components. The Principle 
of bivalence, also known as tertium non datur, states 
that any sentence is either true or false, nothing else 
is possible. Intuitively, a sentence is understood here 
as a formulation which has a truth value, i.e., which 
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is true or false. In everyday language this excludes 
formulations like questions, requests, and commands. 
Nevertheless, this explanation sounds like a kind of cir- 
cular formulation. So formally one first fixes a certain 
formalized language, and then lays down formal cri- 
teria of what should count as a well-formed formula 
of this language, and particularly what should count as 
a sentence. 
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The principle of bivalence excludes hence self- 
contradictory formulations which are true as well as 
false, and it also excludes so-called truth value gaps, 
i. e., formulations which are neither true nor false. 

Based on the understanding that a sentence is a for- 
mulation which has a truth value, many-valued logic 
generalizes the understanding of what a truth value is 
and hence allows for more values as only the two classi- 
cal values true and false. To indicate this generalization, 
we speak here of truth degrees in the case of many- 
valued logics. And this allows the additional convention 
to have the name truth value reserved for the values true 
and false in their standard understanding in the sense of 
classical logic. 

There are many possibilities to choose particular 
sets of truth degrees. Thus there are quite different 
systems of many-valued logic. However, each partic- 


2.1 Basic Many-Valued Logics 


If one looks systematically for many-valued logics 
which have been designed for quite different applica- 
tions, one finds four main types of systems: 


The Lukasiewicz logics Ly as explained in [2.1]; 
The Gödel logics G, from [2.2]; 

The product logic J studied in [2.3]; 

The Post logics P, for 2 < m € N from [2.4]. 


The first two types of many-valued logics each of- 
fer a uniformly defined family of systems which differ 
in their sets of truth degrees and comprise finitely val- 
ued logics for each one of the truth degree sets W, = 
{0, i =. ..., 1}, n> 2, together with an infinite- 
valued system with truth degree set Woo = (0, 1]. 
Common reference to the finite-valued and the infinite- 
valued cases is formally indicated by choosing k € {n € 
N | n> 2}U {oo} as an index. 

For the fourth type an infinite-valued version is 
lacking. 

In their original presentations, these logics look 
rather different, regarding their propositional parts. For 
the first-order extensions, however, there is a unique 
strategy: one adds a universal and an existential quan- 
tifier such that quantified formulas get, respectively, as 
their truth degrees the infimum and the supremum of all 
the particular cases in the range of the quantifiers. 

As areference for these and also other many-valued 
logics in general, the reader may consult [2.5]. 


ular one of these systems L has a fixed set of truth 
degrees Wg. Furthermore, each such system has its 
set DL C WL of designated truth degrees: formulas 
of the corresponding formalized language are logi- 
cally valid iff they always have a designated truth 
degree. 

Instead of the principle of bivalence each system L 
of many-valued logic satisfy a principle of multivalence 
in the sense that any sentence has to have exactly one 
truth degree out of Wy. And the principle of extension- 
ality now states that the truth degree of any compound 
sentence depends only on the truth degrees of the com- 
ponents. 

Fuzzy logics are particular infinite-valued logics 
which have, at least in their most simple forms, the real 
unit interval [0, 1] as their truth degree sets, and which 
have the degree 1 as their only designated truth degree. 


Our primary interest here is in the infinite-valued 
versions of these logics. These ones have the clos- 
est connections to the fuzzy logics discussed later on. 
Therefore, we further on write simply L instead of Loo, 
and G instead of Goo. 

For simplicity of notation, later on we often will use 
the same symbol for a connective and its truth degree 
function. It shall always become clear from the context 
what actually is meant. 


2.1.1 The Gödel Logics 


The simplest ones of these logics are the Gédel logics 
Gx which have a conjunction A and a disjunction V de- 
fined by the minimum and the maximum, respectively, 
of the truth degrees of the constituents 
uAv=min{u, v}, uV v = max{u, v}. (2.1) 
These Gödel logics have also a negation ~ and an 
implication —> g defined by the truth degree functions 


l, ifu=0; l, ifu<v; 

Rus u>GgVv= 
0, ifu>0. v, ifu>v. 
(2.2) 


The systems differ in their truth degree sets: for each 
2<« < œ the truth degree set of Gy is Wx. 
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As shown by Dummett [2.6], the infinite-valued 
propositional Gödel logic G has an adequate axiomati- 
zation which is provided by an adequate axiomatization 
of the intuitionistic propositional logic enriched with 
the additional axiom schema 


>Y) => gp). (2.3) 


Later on, in Sect. 2.5, we will recognize another 
axiomatization because G is a particular t-norm-based 
logic. 


2.1.2 The tukasiewicz Logics 


The Łukasiewicz logics Lę, again with 2 < k < œo, have 
originally been designed in [2.1] with only two primi- 
tive connectives, an implication —>, and a negation — 
characterized by the truth degree functions 


su = l —u, u —>, v = min{1, 1—u +v}. (2.4) 


The systems differ in their truth degree sets: for each 
2 <« < œ the truth degree set of Ly is We. 

However, it is possible to define further connectives 
from these primitive ones. With 


g & y =a le > Yy), 
o Y Y =a >Y (2.5) 


one gets a (strong) conjunction and a (strong) disjunc- 
tion with truth degree functions 


u & v= max{u+v—1,0}, 


u Y v= min{u +v, 1}, (2.6) 


usually called the Łukasiewicz (arithmetical) conjunc- 
tion and the Łukasiewicz (arithmetical) disjunction. It 
should be mentioned that these connectives are linked 
together via a De Morgan’s law using the standard nega- 
tion of this system 


a(u&v) = >u Y =v. (2.7) 
With the additional definitions 


PAW =a 9 & (p> y) 
oV Y =a lp >Y) LV (2.8) 


one gets another (weak) conjunction ^ with truth de- 
gree function min, and a further (weak) disjunction V 


with max as truth degree function, i. e., one has the con- 
junction and the disjunction of the Gédel logics also 
available. 

The infinite-valued propositional Lukasiewicz logic 
L, with implication and negation as primitive connec- 
tives, has an adequate axiomatization consisting of the 
axiom schemata: 


(lool) 9. (W1¢9), 

(lLoo2) (>Y) L(Y >x) > (YL), 
(loo3) Cy L779) > (p > V), 

(L4) (> Y)> Y)> (Y > p)> p) 


together with the rule of detachment as the only infer- 
ence rule. 

Later on, in Sect. 2.5, we will recognize another 
axiomatization because L is a particular t-norm-based 
logic. 


2.1.3 The Product Logic 


The product logic IT, in detail explained in [2.3], has the 
real unit interval as truth degree set, has a fundamental 
conjunction © with the ordinary product of reals as its 
truth degree function, and has an implication — 77 with 
the truth degree function 


ifu<v; 


1, 
ur>a7av= Vu. (2.9) 
-, ifu<v. 
v 


Additionally, it has a truth degree constant 0 to denote 
the truth degree zero. 

In this context, a negation and a further conjunction 
are defined as 


X P =a P > 0, 
gp^Y =a pO l>r Y). (2.10) 


Routine calculations show that both connectives coin- 
cide with the corresponding ones of the infinite-valued 
Gödel logic G. And also the disjunction V of this Gödel 
logic becomes available, now via the definition 


oyy =a (>n Y)>n Y) 
Ay >n 9) >n p). (2.11) 


There is, however, no natural way to combine with 
this (infinite valued) product logic a whole family of 
finite-valued systems by simply restricting the set of 
truth degrees to some W, as in the previous two cases: 
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besides W, no such set is closed under the ordinary 
product, and for W, the product coincides, e.g., with 
the minimum operation. 

Later on, in Sect. 2.4, we will recognize an ade- 
quate axiomatization because also the product logic M 
is a particular t-norm-based logic. Contrary to the pre- 
vious cases of G and L, however, there is no essentially 
different axiomatization known as this later one. 


2.1.4 The Post Logics 


The Post system P,,, for m > 2 has truth degree set W. 
These propositional systems have been originally for- 
mulated uniformly in negation and disjunction as basic 
connectives with the following truth degree functions 


Í; foru=0, 
foru#0, 


u— > 
m— 


uV v = max{u, v} . 


Contrary to the previous systems, the definition of nega- 
tion here does not seem to be given in a uniform way 
independent of the number of truth degrees. However, 
it is always just a cyclic permutation of all the truth de- 
grees (in their natural order). 

For the sets of designated truth degrees, a canonical 
choice does not exist; already Post [2.4] has discussed 
the possibility that there may be chosen truth de- 
grees different from 1 as designated ones. Nevertheless, 
DP = {1} is a kind of standard choice. 

The set of basic connectives of each one of the Post 
systems P,,, is functionally complete, i. e., allows to rep- 
resent every possible truth degree function (over W,,). 
Therefore, each one of the Post systems P,,,, with Dp = 
{1} as the set of designated truth degrees, covers its cor- 
responding Lukasiewicz system with the same set of 
truth degrees — in the sense that the set of L,,-tautologies 
is a subset of the set of P,,-tautologies, and that this 
set of P,,-tautologies does not contain any formula g 
whose Lukasiewicz negation —¢ is L,,-satisfiable. And 
the same holds true for the corresponding m-valued 
Gödel system Gn. 

If one enriches all the finitely many-valued (propo- 
sitional) Łukasiewicz systems L,, with truth degree 
constants for all their truth degrees, then these enriched 
systems L* become functionally complete. And this 
means that the extended m-valued Łukasiewicz systems 
L* and the m-valued Post logics become interdefinable 
(for each fixed number m of truth degrees). Hence there 
is in principle no essential difference between both 


types of (finitely valued) systems: all what can be ex- 
pressed in the Post world can also be expressed in the 
(extended) Lukasiewicz world, and vice versa. 

We omit to discuss adequate axiomatizations be- 
cause these Post logics will not be of particular interest 
later on in this chapter. The interested reader might con- 
sult [2.5]. 


2.1.5 Algebraic Semantics 


All these previously discussed many-valued logics have 
been introduced by their standard semantics. 

Besides these standard semantics, all these many- 
valued logics have also algebraic semantics determined 
by suitable classes K of truth degree structures. The 
situation is similar here to the case of classical logic: 
the logically valid formulas in classical logic are also 
just all those formulas which are valid in all Boolean 
algebras. 

Of course, these structures have the same signature 
as the language £ of the corresponding logic, and they 
have to have — in the case that one discusses the cor- 
responding first order logics — suprema and infima for 
all those subsets which may appear as value sets of for- 
mulas. Particularly, hence, they have to be (partially) 
ordered, or at least preordered. 

For each formula ¢ of the language £ of the corre- 
sponding logic, for each such (generalized truth degree) 
structure A, and for each evaluation e which maps the 
set of atomic formulas of £ into the carrier of A, one 
has to define a value Val(g, e), and finally one has to 
define what it means that such a formula ¢ is valid in 
A. Then a formula ¢ is logically valid w.r.t. this class 
K iff g is valid in all structures from K. 


Gödel and tukasiewicz Logics 
It is remarkable that for both these types of many- 
valued logics corresponding algebraic semantics have 
mainly been developed for the infinite-valued systems, 
and have been considered in the context of complete- 
ness proofs. 

For the infinite-valued Gödel logic G such a class of 
structures is, according to the completeness proof given 
by Dummett [2.6], the class of all Heyting algebras, i. e., 
of all relatively pseudo-complemented lattices, which 
satisfy the prelinearity condition 

(u->v)UW>u)=1. (2.12) 
Here U is the lattice join and >> the relative pseudo- 
complement. 
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For the infinite-valued Lukasiewicz logic L the cor- 
responding class of structures is the class of all MV- 
algebras, first introduced again within a completeness 
proof by Chang [2.7], and more recently extensively 
studied in [2.8]. 

It is interesting to recognize that all these struc- 
tures — prelinear Heyting algebras, MV-algebras, and 
product algebras — are Abelian lattice-ordered semi- 
groups with an additional residuation operation. 

For the finite-valued logics from both families, sep- 
arately developed algebraic semantics did not yet find 
considerable interest. 


Product Logic 
The product logic, as introduced in [2.3], was from the 
very beginning designed as a logic which had, in par- 
allel, a standard semantics — provided by the real unit 
interval and by a product-based conjunction as a funda- 
mental connective — as well as an algebraic semantics, 
formed by the class of all product algebras — introduced 
in [2.3] again within a completeness proof. 

We shall not explain more details here because this 
whole approach proved to become paradigmatic for 
the development of t-norm-based infinite-valued logics, 
a topic which shall be discussed later on, starting with 
Sect. 2.3. 


Post Logics 
Contrary to the situation for the Lukasiewicz and the 
Gödel systems, for the Post systems in their original 
form there exist only very few syntactically oriented 
studies toward constituting or investigating logical cal- 
culi for these systems. Instead, for the Post systems one 
mainly was interested in the corresponding algebraic 
structures, which were suitable to form an algebraic se- 
mantics, and investigated such structures earlier, and in 
more detail, as similar structures for the Lukasiewicz 
and the Gédel systems. Rosenbloom in a paper [2.9] of 
1942 was the first one to do this. His algebraic struc- 


2.2 Fuzzy Sets 


A fuzzy set A is usually a fuzzy subset of a given 
set X and characterized by its membership function 
Ha : X + [0, 1]. The set X is often called the universe 
of discourse. This notation derives from [2.18]. So these 
fuzzy sets are (possibly) first-level objects of a cumula- 
tive hierarchy, with the elements of X as urelements. 
But the usual applications do not need higher level 


tures shall here be called P-algebras for short — but not 
be considered in detail: the interested reader might, e.g., 
consult [2.5]. 

One of the main reasons for the difficulty and com- 
plexity of the defining conditions of P-algebras is the 
fact that the Post systems as well as the P-algebras have 
only two primitive notions, their connectives resp. their 
basic operations, but have maximal expressive power 
in the sense of being functionally complete. That this 
choice of the primitive notions really is the main 
obstacle toward a simplification became clear as Ep- 
stein [2.10] in 1960 changed these basic operations and 
found a much simpler class of definitionally equivalent 
algebras, now called Post algebras. 

What are not covered by these basic considerations 
are possible infinite-valued generalizations of these log- 
ical calculi, or of these Post algebras. Approaches 
toward this problem started, e.g., with papers on gen- 
eralizations of the notion of Post algebras like [2.11- 
13]. The most influential paper, however, which also 
discussed the corresponding logical systems was the 
paper [2.14] of Rasiowa in which Post algebras of the 
order œw + 1 and the corresponding systems of infinitely 
many-valued (first-order) logic have been introduced. 
The algebraic theory of these Post algebras of the order 
æ + 1 is partly given in [2.14]. 

Another such infinitely many-valued generalization 
of the standard Post systems is discussed, e.g., in [2.15, 
16], Post algebras of the order w + w*. 

The Post algebras of finite or infinite order and 
the systems of many-valued logic related with them 
seem to be of particular importance for investigations in 
computer science, which rely on many-valued logic as 
a toolbox, because these Post systems are functionally 
complete and well suited to study the representabil- 
ity of truth degree functions on the basis of some 
predetermined set of basic truth degree functions, as 
determined, e.g., by available electronic components, 
cf. [2.17] for a still good introduction. 


fuzzy sets. And also for our discussion of the back- 
ground logic such higher level fuzzy sets do not matter. 


2.2.1 Set Algebra for Fuzzy Sets 


Mathematically, it is customary to identify such a fuzzy 
subset of X with its membership function. Accordingly, 
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Hala) and A(a) both are used to denote the member- 
ship degree of the object a € X w.r.t. the fuzzy set A. 
For any binary operation * between membership de- 
grees the pointwise approach means to define from it 
a binary operation ® for fuzzy sets such that the fuzzy 
set A ® B is characterized by 


A @® B(x) = A(x) * B(x) forallxeX. (2.13) 


Hence, Zadeh’s standard intersection A N B and union 
AUB are characterized by 


AN B(x) = min{A (x), B}, 
AU B(x) = max{A(a), B(x)} . (2.14) 


Additionally, again following the first proposal 
from [2.18], one usually also defines the complement 
CxA of a fuzzy set A by the condition 


CxA(x) =1—A(x)  forallxe X. (2.15) 


However, in [2.18] also other versions of such binary 
operations had been mentioned: an algebraic product 
AB and an algebraic sum A + B defined through 


AB(x) = min{A(x)- B(x)}, 
A+ B(x) = min{A(x) + B(x), 1} (2.16) 


as well as an absolute difference |A — B| defined by 
|A — B| (x) = |A(x) — B(x)| . (2.17) 


It is interesting to notice, and shall be explained in 
more detail in Sect. 2.2.2, that the operations (2.16) can 
be seen as generalized kinds of intersection and union 
operations, respectively. However, these operations are 
not idempotent: one has in general AA Æ A as well as 
A+AFA. 


2.2.2 Fuzzy Sets and Many-Valued Logic 


It is well known that there is a strong parallelism 
between the standard set algebraic operations of in- 
tersection, union, and complementation, and classical 
logic, namely, the operations of conjunction, disjunc- 
tion, and negation, determined by their truth value 
functions et, vel, non, respectively. So one usually de- 
fines, e.g., the intersection MM N of sets M,N by the 
condition 


xEMNN <= xEMAxEN, (2.18) 


with A for the conjunction operation of classical logic 
here. 


In more abstract terms, the idea here is that these 
operations are defined in such a way that the power set 
algebra P(M) of any set M is (isomorphic to) the direct 
product W™ = [],<,, W of the Boolean algebra W = 
({1, 0}, et, vel, non) of truth values of classical logic. 

A similar relationship can be recognized between 
the set algebra of fuzzy sets and suitable many-valued 
logics. It is simply necessary to consider the set of 
membership degrees for the fuzzy sets as set of truth 
degrees for a corresponding many-valued logic. So one 
can consider the operations (2.14) as intersection and 
union related to the Gédel logic G, or also related to 
the Lukasiewicz logic L. Similarly, the complemen- 
tation (2.15) is related to the negation operation of 
the Lukasiewicz logics, and the algebraic operations 
in (2.16) are an intersection operation with respect to 
product logic, and a union operations with respect to 
(the strong disjunction of) Lukasiewicz logic. Even the 
operation (2.17) can be defined via Lukasiewicz logic: 
one gets immediately via the corresponding truth de- 
gree functions 


ju— v| = 7((u>_ v) & v > u)). (2.19) 


In more abstract terms, again, the set algebraic oper- 
ations with respect to a particular [0, 1]-valued logic L 
should be defined in such a way that the class F(X) = 
[0, rN ag of fuzzy subsets of a universe of discourse X is 
(isomorphic to) the direct product WX = Tex W of 
the truth degree algebra W = ((0, 1],...) of this partic- 
ular [0, 1]-valued logic. 


2.2.3 t-Norms and t-Conorms 


For the previously mentioned nonidempotent intersec- 
tion operation, i.e., the algebraic product from (2.16), 
and for further similar possibilities the mathematically 
oriented part of the fuzzy community reached, mainly 
in the first half of the 1980s, a consensus that such 
generalized intersection operations should be defined 
via (2.13) from a triangular norm *. Such triangular 
norms — t-norms for short — had first been considered in 
the context of probabilistic metric spaces to get a suit- 
able version of a triangle inequality, cf. e.g. [2.19], and 
found since independent interest in different contexts, 
cf. [2.20,21]. They are isotonic, associative, and com- 
mutative binary operations in the unit interval which 
have 1 as their neutral element. This means that they 
make the unit interval an ordered monoid. 

The class of all t-norms is, however, very large 
and not yet really well understood. So the question 
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appears to restrict to suitable subclasses, e.g., to the 
continuous t-norms or to the left-continuous ones. (For 
a t-norm T left-continuity means that all the unary func- 
tions T, with T,(x) = T(a,x) for each a € [0, 1] are 
left-continuous. For continuity the conditions (i) that 
T is continuous as a binary function and (ii) that all 
the T, are continuous coincide [2.5, 20].) Standard ex- 
amples for continuous t-norms are the min-operation 
in [0, 1], also called Gédel t-norm Tg, the arithmetic 
product in [0,1], also called product t-norm Tp, and 
the Lukasiewicz t-norm T, : (u,v) => max{u+ v — 1, 0} 
which is the truth degree function of the strong con- 
junction in the Lukasiewicz many-valued systems. And 
a standard example for a left-continuous t-norm which 
is not continuous is the nilpotent minimum Tym de- 
fined as 


min{u,v}, ifu+tv>1 


Tum (u, v) = (2.20) 


0 otherwise . 


These examples for continuous t-norms are even 
characteristic in the sense that each continuous t-norm 
is an ordinal sum of isomorphic versions of T,, Tp, Tg, 
cf. [2.20] and also [2.5]. 

To explain what is meant by an isomorphic version 
of some f-norm, one has to start from an order automor- 
phism f of the unit interval, i. e., from a continuous 1—1 
onto map f : [0, 1] — [0, 1] with f(0) = 0 and f(1) = 1. 
Is now T a t-norm and T* : [0, 1]? — [0, 1] defined by 


T* (xy) =f '(TEO).FO))). (2.21) 


which equivalently means 


F(T" y)) = TFR).FO)). (2.22) 


2.3 t-Norm-Based Logics 


From the point of view of many-valued logic, a t-norm 
is a suitable candidate for a truth degree function of 
some generalized conjunction connective. Accepting 
this, one is essentially concerned with systems of many- 
valued logic with infinite truth degree set [0, 1]. And 
additionally one prefers to consider such systems which 
have the truth degree | as the only designated truth de- 
gree. (This means, e.g., that a formula of the language 
of such a system counts as logically valid just in case 
it always assumes this designated truth degree 1. This 


then T* is again a t-norm and called an isomorphic ver- 
sion of T, and T, T* are isomorphic t-norms. 

Parallel with t-norms one often also considers t- 
conorms: these are isotonic, associative, and commuta- 
tive binary operations in the unit interval which have 0 
as their neutral element. For the set algebra of fuzzy sets 
they define (possibly) nonidempotent unions, and for 
the background logics they constitute (possibly) non- 
idempotent disjunctions. 

There is a natural 1—1 duality between t-norms and 
t-conorms. By 


1—S(u, v) = TA —u,1—v) (2.23) 


one determines a t-conorm S for any t-norm T, and con- 
versely determines a t-norm T for any f-conorm S. This 
relationship connects, e.g. the truth degree function 
(u, v) => max{u + v— 1,0} of the Lukasiewicz strong 
conjunction with that one of the corresponding strong 
disjunction (u, v) > min{u + v, 1}. 

Obviously (2.23) constitutes, for the background 
logic, a de Morgan connection between suitably chosen 
conjunctions and disjunctions — as long as the function 
ut> l — u acts as the truth degree function of a nega- 
tion. And indeed this is the truth degree function of 
the negation of the Łukasiewicz systems, which was al- 
ready used in the definition (2.15) of the complement of 
a fuzzy set. 

Summing up, one has for the background logic 
idempotent weak connectives for conjunction and dis- 
junction, determined by the minimum and the maxi- 
mum operation in [0, 1]. Furthermore, one is interested 
to have (possibly) nonidempotent strong connectives 
for conjunction and disjunction, determined by a t-norm 
and a t-conorm, usually one the dual of the other ac- 
cording to (2.23). 


notion, as well as the other notions from many-valued 
logic are explained in detail, e.g., in [2.5].) 


2.3.1 Basic Ideas 


Such a system of many-valued logic is called t-norm 
based (on some particular t-norm T) iff all the other 
connectives of it have associated truth degree functions 
which are defined from this t-norm T, using possibly 
some truth degree constants. Usually one considers to- 
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gether with the conjunction connective & with the truth 
degree function T an implication connective — with the 
truth degree function J; characterized by 

Ir(u, v) =ar supiz | T(u, z) < v} , (2.24) 
the so-called R-implication connected with T, and 
a standard negation connective — with truth degree 
function ny, given as 

nr(u) =a¢ Ir (u, 0) . (2:25) 
As shall be explained in the Sect. 2.3.2, the definition 
(2.24) determines a reasonable implication function just 
in the case that the t-norm T is left continuous. Here 
reasonable essentially means that —>p satisfies a suit- 
able version of the rule of detachment. 

In more technical terms it means that for left con- 
tinuous f-norms T condition (2.24) defines a residu- 
ation operator Ir, previously sometimes also called 
y-operator, cf. [2.22]. And it means also, under this as- 
sumption of left continuity of T, that condition (2.24) is 
equivalent to the adjointness condition 

T(u,w) <v = w<lr(u,v), (2.26) 
i. e., that the operations T and Ir form an adjoint pair. 

Forced by these results one usually restricts, in this 
logical context, the considerations to left continuous — 
or even to continuous — t-norms. 

But together with this restriction of the t-norms, 
a generalization of the possible truth degree sets some- 
times is useful: one may accept each subset of the unit 
interval [0, 1] as a truth degree set which is closed under 
the particular t-norm T and its residuum. 

The restriction to continuous t-norms enables even 
the definition of the operations max and min, which 
make [0, 1] into an (linearly) ordered lattice. On the one 
hand, one has from straightforward calculations that al- 
ways 


min{u, v} = T (u, Ir (u, v)) , (2.27) 


and on the other hand one gets always [2.23, 24] 


max{u, v} = min{Ir(Ir(u, v), v), Ir (Ir(v, u), u)} .- 
(2.28) 


It is a routine matter to check that the infinite-valued 
Gödel logic G, the infinite-valued Łukasiewicz logic L, 


and also the product logic TI all are t-norm-based logics 
in the present sense. 

The systems of fuzzy logic we discuss here are 
also sometimes called R-fuzzy logics, stressing the 
fact that our implication connectives — have as truth 
degree functions Ir the residuation operations, char- 
acterized by (2.24) or (2.26). Besides these R-fuzzy 
logics one occasionally, e.g., in [2.25,26], discusses 
so-called S-fuzzy logics which are also based on some 
t-norm, but additionally take the Lukasiewicz nega- 
tion nį (u) = 1 — u or also some other negation function, 
sometimes together with a further t-conorm, as a basic 
connective. 

These S-fuzzy logics define their implication con- 
nective like material implication might be defined in 
classical logic. However, these logics lose, in general, 
the rule of detachment as a sound rule of inference if 
they have the degree 1 as the only designated truth de- 
gree — or they allow all positive reals from (0, 1] as 
designated truth degrees. 

For a complete development of such t-norm-based 
logics one needs adequate axiomatizations. This seems 
to be, however, a difficult goal — essentially because 
of its dependency from the particular choice of the t- 
norms which determine these logics. Therefore, the first 
successful approaches intended to axiomatize common 
parts of a whole class of such logics. This will be dis- 
cuss later in Sect. 2.4. 


2.3.2 Left and Full Continuity of t-Norms 


As had been mentioned in the previous section, the 
adjointness condition is an algebraic equivalent of the 
analytical notion of left continuity. This will be proved 
here. 


Definition 2.1 

A t-norm T is left continuous (continuous) iff all the 
unary functions T; : x> T(x, a) for a € [0, 1] are left 
continuous (continuous). 


This definition of continuity for t-norms via their 
unary parametrizations coincides with the usual defini- 
tion of continuity for a binary function, cf. [2.20]. 


Proposition 2.1 
A t-norm T is left continuous iff T and its R-implication 


Ir form an adjoint pair. 


Proofs are given, e.g., in [2.5, 20, 22]. 
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It is interesting, and important later on, to also no- 
tice that the continuity of a t-norm has an algebraic 
equivalent. 


Proposition 2.2 
A t-norm T is continuous iff T and Ir satisfy the equa- 
tion 


T(a,Ip(a,b)) = min{a, b} . (2.29) 


Proof: Assume first that T is continuous. Then one 
has for a<be[0,1] immediately T(a,I;(a,b)) = 
T(a, 1) = a= min{a, b}. And one has for b < a 


T(a, I7(a, b)) = T(a, max{z | T(a,z) < bY) 


= max{T(a,z) | T(a,z) <b} <b 
(2.30) 


already by the left continuity of T. Continuity of T fur- 
thermore gives from 0 = T(a,0) < b<a=T(a, 1) the 
existence of some c € [0, 1] with b = T(a,c), and thus 
T(a,Ir(a, b)) = b = min{a, b} by (2.30). 

Assume conversely (2.29). Then the adjointness 
condition forces T to be left continuous. Hence for the 
continuity of T one has to show that T is also right con- 
tinuous. 

Suppose that this is not the case. Then there exist 
a,b € (0, 1], and also a decreasing sequence (x;);>o with 
limj;+oo x; =b such that T(a,b) Æ inf; T(a,x;), i.e., 
such that T(a,b) < inf; T(a,x;). Consider now some 
d with T(a, b) < d < inf; T(a, xi) < a. Then there does 
not exist some c € [0, 1] with d = T (a, c), because oth- 
erwise one would have d = T(a,c) > T(a,b), hence 
c > b and thus inf; T(a, x;) < T (a, c) = d from the fact 
that b = lim;—oo x; and there thus exists some integer k 
with x, < c. This means that the lack of right continuity 
for T contradicts condition (2.29). Oo 


2.3.3 Extracting an Algebraic Framework 


For the problem of adequate axiomatization of (classes 
of) t-norm-based systems of many-valued logic there is 
an important difference to the standard approach toward 
semantically based systems of many-valued logic: here 
there is no single, standard semantical matrix for the 
general approach. 

The most appropriate way out of this situation 
seems to be: to find some suitable class(es) of algebraic 
structures which can be used to characterize these log- 
ical systems, and which preferably should be algebraic 
varieties, i. e., equationally definable. 


From an algebraic point of view, the following con- 
ditions seem to be structurally important for t-norms: 


e ([0,1],7,1) is a commutative semigroup with 
a neutral element, i. e., a commutative monoid, 

© < isa (lattice) ordering in [0, 1] which has 0 as uni- 
versal lower bound and | as universal upper bound, 

© Both structures fit together: T is nondecreasing 
w.r.t. this lattice ordering. 


Thus it seems reasonable to consider commutative 
lattice-ordered monoids as the truth degree structures 
for the t-norm-based systems. 

In general, however, commutative lattice-ordered 
monoids may have different elements as the universal 
upper bound of the lattice and as the neutral element of 
the monoid. This is not the case for the t-norm-based 
systems, they make [0, 1] into an integral commuta- 
tive lattice-ordered monoid as truth degree structure, 
namely, one in which the universal upper bound of the 
lattice ordering and the neutral element of the monoidal 
structure coincide. 

Furthermore, one also likes to have the t-norm T 
combined with another operation, its R-implication op- 
erator, which forms together with T an adjoint pair: i. e., 
the commutative lattice-ordered monoid formed by the 
truth degree structure has also to be a residuated one. 

Summing up, hence, we are going to consider resid- 
uated lattices, i.e., algebraic structures (L, N, U, *, > 
,0, 1) such that L is a lattice under N, U with the uni- 
versal lower bound 0 and the universal upper bound 1, 
and a commutative lattice-ordered monoid under * with 
neutral element 1, and such that the operations * and >> 
form an adjoint pair, i.e., satisfy 

x*Z<y SS 7S(x>>Yy). (2.31) 
In this framework one additionally introduces, follow- 
ing the understanding of the negation connective given 
in (2.25), a further operation — by 


Seuss, (2.32) 


Definition 2.2 

A lattice-ordered monoid (L, x, 1, <} is divisible iff for 
all a,b € L with a < b there exists some c € L witha = 
bxc. 


For linearly ordered residuated lattices, one has an- 
other nice and useful characterization of divisibility. 
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Proposition 2.3 

A linearly ordered residuated lattice (L, N, U, *, => 
, 0, 1) is divisible, i. e., corresponds to a divisible lattice- 
ordered monoid (L, x, 1, <), iff one has aN b = a x 
(a > b) for all a, b € L. (Of course, < here is the lattice 
ordering of the lattice (L, N, U).) 


Proof: We first show that one has in each residuated lat- 
tice 

ax(a>>b)=b & Ax(axx=b) (2.33) 
for all a,b € L. Of course, in the case a * (a>> b) = b 
there exists an x such that axx = b. So supposea*c = b 
for some c € L. If one then would have a (a >> b) Æ b, 
this would mean a * (a >> b) < b= a * c because one 
always has a x (a > b) < b by the adjointness condi- 
tion, and this hence would mean c £ a > b (because 
otherwise c < a >> b and hence b = axc < ax(a>> b) 
would be the case) and therefore also axc=cxaZ 
b by the adjointness condition, a contradiction. Thus 
(2.33) is established. 

Supposing now the divisibility of (L, N, U, x, >=> 
,0, 1), then one has for all b < a € L from the existence 
of an x such that (b = ax x) immediately a * (a >> b) = 
b = aN b. Otherwise one has a < b by the linearity of 
the ordering and hence a >> b = 1 from the adjointness 
property, thus a x (a >> b) =a*x1l=a=anb. 

Assuming on the other hand that one always has 
aNb = ax (a > b); furthermore, for all a < b € L from 
a = aNb = bNa one gets the equation a = b (b > a), 
and hence there is an x such that a = b * x. E 

Using this result, we can restate Proposition 2.2 in 
the following way. 


Corollary 2.1 
A t-algebra [0, 1]7 = ([0, 1], min, max, T, Ir, 0, 1) is di- 
visible iff the t-norm T is continuous. 


2.4 Particular Fuzzy Logics 


Now we shall discuss the core systems of t-norm-based 
logics. Of course, it would be preferable to be able 
to axiomatize each single t-norm-based logic directly. 
However, actually there is no way to do so. Hence other 
approaches have been developed. The core idea is first 
to develop systems which cover large parts which are 
common to all those t-norm-based logics. 


A further restriction is suitable w.r.t. the class of 
residuated lattices because each t-algebra [0, 1]r is lin- 
early ordered, and thus makes particularly the wff 
(eg > Y) v (Y > ọ) valid. Following Hajek [2.23, 24], 
one calls BL-algebras those divisible residuated lattices 
which also satisfy the prelinearity condition (2.12). 


Definition 2.3 

A structure L = (L, V, ^, x, >=>, 0, 1) is a BL-algebra 

iff: 

i) (L,V,A,9, 1) is a bounded lattice with lattice order- 
ing <, 

ii) (L, x, 1, <) is a lattice-ordered Abelian monoid, 

iii) The operations * and >> satisfy the adjointness 
condition 

xxysz 4> xX y>>z, (2.34) 

iv) the prelinearity condition (2.12) is satisfied, 

v) the divisibility condition is satisfied, i.e., one has 
always 


X*(X>> y)=XAYy. (2.35) 


It is interesting to notice that the prelinearity condi- 
tion (2.12) can equivalently be characterized in another 
form, which will become important later on. 


Proposition 2.4 
In residuated lattices there are equivalent 


(i) (e> yUQ x= 1, 
(ii) (œ >= y)>> 2) * (>> x)= z) <z. 


The proof is by routine calculations, cf., e.g., [2.5, 
23). 


The first successful approach came from Hájek who 
presented 1998 in the seminal monograph [2.23] the 
logic BL of all continuous t-norms, i.e., the common 
part of all the t-norm-based logics which are determined 
by a continuous f-norm. Inspired by this work a short 
time later Esteva and Godo [2.27] introduced 2001 the 
logic MTL of all left-continuous t-norms. 
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These logics are characterized by algebraic seman- 
tics: BL by the class of all t-algebras with a continuous 
t-norm, and MTL by the class of all t-algebras with a left- 
continuous t-norm. All those ft-algebras are particular 
cases of residuated lattices. 

It should be noticed, however, that already in 1996 
Höhle [2.28] introduced the monoidal logic ML char- 
acterized by the class of all residuated lattices as their 
algebraic semantics. 

And it should also be mentioned that, in the case of 
logics which are determined by an algebraic semantics, 
the problem of their adequate axiomatization becomes 
particularly well manageable, if the algebraic semantics 
is given as a variety of algebraic structures, i.e., as an 
equationally definable class of algebraic structures. 


2.4.1 The Logic BL 
of All Continuous t-Norms 


The class of t-algebras (with a continuous t-norm or 
not) is not a variety: it is not closed under direct prod- 
ucts because each t-algebra is linearly ordered. Hence 
one may expect that it would be helpful for the devel- 
opment of a logic of continuous t-norms to extend the 
class of all divisible t-norm algebras in a moderate way 
to get a variety. 

And indeed this idea works: it was developed by 
Hájek and in detail explained in [2.23]. 

The core point is that one considers instead of the 
divisible t-algebras [0, 1]7, which are linearly ordered 
integral monoids, lattice-ordered integral monoids 
which satisfy the condition (2.35), which have an ad- 
ditional residuation operation connected with the semi- 
group operation via an adjointness condition (2.26), and 
which also satisfy the prelinearity condition 


(x yVOrxx»=1, (2.36) 


or equivalently 


(a> y >> z> (YX) > D> Y= 1. 
(2.37) 


The axiomatization of Hdjek [2.23] for the basic t- 
norm logic BL (in [2.5] denoted BTL), i.e., for the class 
of all well-formed formulas which are valid in all BL- 
algebras, is given in a language £r which has as basic 
vocabulary the connectives —>, & and the truth degree 
constant 0, taken in each BL-algebra (L, N, U, *, >> 
, 0,1) as the operations >, * and the element 0. 


This ¢-norm-based logic BL has the following axiom 
schemata: 


(Axel) (p > Wy) (y > n> > YX), 
(AXgL2) p& Y > ọ, 

(AxgL3) y & Y > Y &gy, 

(AxsL4) (p> (Wry) Y&y-> 1., 

(AXgL5) (p & y > xX) > (9> (Y >21). 

(AxpL6) y & (p > y)—> y & (y> go), 

(AxsL7) (C >Y) > > (M >p)> 1) >21 
(AXgL8) 0 > g, 


and has as its (only) inference rule the rule of detach- 
ment. 

Starting from the primitive connectives —>, &, and 
the truth degree constant 0, the language £r of BL is 
extended by definitions of further connectives 


CAV =49&(QV>P), (2.38) 
oV y =a (p > y)> Y) 

Aly >> y), (2.39) 

~ =49>0, (2.40) 


where g, y are formulas of the language of that system. 
Calculations (in BL-algebras) show that the ad- 
ditional connectives ^, V just have the lattice opera- 
tions N, U as their truth degree functions. 
The system BL is an implicative logic in the sense 
of Rasiowa [2.29]. So one gets a general soundness and 
completeness result. 


Theorem 2.1 General Completeness 
A formula ¢ of the language £r is derivable within the 
axiomatic system BL iff ¢ is valid in all BL-algebras. 


However, it is shown in [2.23] that already the 
class of all BL-chains, i. e., of all linearly ordered BL- 
algebras, provides an adequate algebraic semantics. 


Theorem 2.2 General Chain Completeness 
A formula g of £r is derivable within the axiomatic 
system BL iff ọ is valid in all BL-chains. 


But even more is provable and leads back to the 
starting point of the whole approach: the theorems of BL 
are just those formulas which hold true w.r.t. all divis- 
ible t-algebras. This was, extending preliminary results 
from [2.24], finally proved in [2.30]. 


Theorem 2.3 Standard Completeness 
The class of all formulas which are provable in the sys- 
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tem BL coincides with the class of all formulas which 
are logically valid in all t-algebras with a continuous 
t-norm. 


The main steps in the proof are to show (i) that each 
BL-algebra is a subdirect product of subdirectly irre- 
ducible BL-chains, i. e., of linearly ordered BL-algebras 
which are not subdirect products of other BL-chains, 
and (ii) that each subdirectly irreducible BL-chain can 
be embedded into the ordinal sum of some BL-chains 
which are either trivial one-element BL-chains, or lin- 
early ordered MV-algebras, or linearly ordered product 
algebras, such that (iii) each such ordinal summand 
is locally embedable into a t-norm-based residuated 
lattice with a continuous t-norm, cf. [2.24,30] and 
again [2.5]. 

This is a lot more of algebraic machinery as nec- 
essary for the proof of the General Completeness The- 
orem 2.1 and thus offers a further indication that the 
extension of the class of divisible t-algebras to the class 
of BL-algebras made the development of the intended 
logical system easier. But even more can be seen from 
this proof: the class of BL-algebras is the smallest va- 
riety which contains all the divisible t-algebras, i. e., all 
the t-algebras determined by a continuous t-norm. And 
the algebraic reason for this is that each variety may be 
generated from its subdirectly irreducible elements, cf. 
again [2.31, 32]. 

Yet another generalization of Theorem 2.1 deserves 
to be mentioned. To state it, let us call schematic exten- 
sion of BL every extension which consists in an addition 
of axiom schemata to the axiom schemata of BL. And let 
us denote such an extension by BL+ C. And call BL(C)- 
algebra each BL-algebra A which makes A-valid all 
formulas of C, i. e., which is a model of C. 

Then one can prove, as done in [2.23], an even more 
general completeness result. 


Theorem 2.4 Strong General Completeness 
For each set C of axiom schemata and any formula ¢ of 
Lr there are equivalent: 


i) @ is derivable within BL + C; 
ii) g is valid in all BL(C)-algebras; 
iii) g is valid in all BL(C)-chains. 


For the standard semantics this result holds true 
only in a restricted form: one has to restrict the con- 
sideration to finite sets C of axiom schemata, i.e., to 
finite theories. For the Lukasiewicz logic L, which is 


the extension of BL by the schema ~~g — g of double 
negation, this has already been shown in [2.23]. And for 
arbitrary continuous f-norms this follows from results 
of Hanikova [2.33, 34]. 


Theorem 2.5 Strong Standard Completeness 
For each finite set C of axiom schemata and any formula 
Q of Lr there are equivalent: 


i) @ is derivable within BL + C; 
ii) ¢ is valid in all t-algebras which are models of C. 


2.4.2 The Logic MTL 
of All Left Continuous t-Norms 


The guess of Esteva and Godo [2.27] has been that one 
should arrive at the logic of left continuous t-norms 
if one starts from the logic of continuous t-norms and 
deletes the continuity condition, i.e., the divisibility 
condition (2.35). 

The algebraic approach needs only a small modi- 
fication: in the definition of the BL-algebras one has 
simply to delete the divisibility condition. The resulting 
algebraic structures have been called MTL-algebras. 
They again form a variety. 

Following this idea, one has to modify the previous 
axiom system in a suitable way. And one has to delete 
the definition (2.38) of the connective A, because this 
definition (together with suitable axioms) essentially 
codes the divisibility condition. The definition (2.39) of 
the connective V remains unchanged. 

As a result one now considers a new system MTL 
of mathematical fuzzy logic, characterized semantically 
by the class of all MTL-algebras. It is connected with 
the axiom system: 


(AxurLl) (> yY)> (M >x) > (>), 
(AXmrTL2) p& Y > 4g, 

(AXmTL3) p& Y >Y &Q, 

(AxurL4) >>> GY kY>y, 
(AxmrL5) (P&Y> YY? (YX), 
(AXmrL6) GAY >g, 

(AXmrL7) GAY > YW AQ, 

(AXmrtL8) g & (YW) yAY, 


(AXmTL9) 0—> Q, 
(AXmtL10) (>= WN CW) NH, 


together with the rule of detachment (w.r.t. the implica- 
tion connective —) as (the only) inference rule. 
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Again, the system MTL is an implicative logic in the 
sense of Rasiowa [2.29], giving a general soundness 
and completeness result as for the previous system BL. 
Proofs of these results were given in [2.27]. 


Theorem 2.6 General Completeness 
A formula ¢ of the language £r is derivable within the 
system MTL iff ọ is valid in all MTL-algebras. 


Furthermore it is shown, in [2.27], that again al- 
ready the class of all MTL-chains provides an adequate 
algebraic semantics. 


Theorem 2.7 General Chain Completeness 
A formula g of Lr is derivable within the axiomatic 
system MTL iff ọ is valid in all MTL-chains. 


And again, similar as for the BL-case, even more 
is provable: the system MTL characterizes just these 
formulas which hold true w.r.t. all those t-norm-based 
logics which are determined by a left continuous t- 
norm, cf. [2.35]. 


Theorem 2.8 Standard Completeness 

The class of formulas which are provable in the sys- 
tem MTL coincides with the class of formulas which are 
logically valid in all t-algebras with a left continuous 
t-norm. 


This result again means, as the similar one for the 
logic of continuous f-norms, that the variety of all 
MTL-algebras is the smallest variety which contains all 
t-algebras with a left continuous t-norm. 

Also for MTL an extended completeness theorem 
similar to Theorem 2.4 holds true. (The notions MTL + 
C and MTL(C)-algebra are used similar to the BL 
case.) 


Theorem 2.9 Strong General Completeness 
For each set C of axiom schemata and any formula @ of 
Lr the following are equivalent: 


i) ọ is derivable within the system MTL + C; 
ii) ọ is valid in all MTL(C)-algebras; 
iii) ọ is valid in all MTL(C)-chains. 


For much more information on completeness mat- 
ters for different systems of fuzzy logic the reader may 
consult [2.36]. 


2.4.3 Extensions of MTL 


Because of the fact that the BL-algebras are the divisi- 
ble MTL-algebras, one gets another adequate axiomati- 
zation of the basic t-norm logic BL. 


Proposition 2.5 


BL=MIL+ {p^ Yy > g&l > y). 


Proof: Routine calculations in MTL-algebras give x * 
(x >> y) <x and x*(x>> y) <y, and hence the in- 
equality x* (x>> y) <xNy. In those MTL-algebras 
which are models of 


eprAvro&e>yV). (2.41) 


also the converse inequality holds true, hence even x * 
(x >> y) = xN y. Thus the class of models of (2.41) is 
the class of all BL-algebras. So the result follows from 
the Completeness Theorem 2.1. oO 


Proposition 2.6 


L=BL+ {-7¢9 > p} 


Proof: BL-algebras which also satisfy the equation 
(x >> 0) + 0=x can be shown to be MV-algebras, 
cf. [2.5,37]. And each MV-algebra is also a BL- 
algebra. Hence BL+ {~~o — g} is characterized by 
the class of all MV-algebras, so it is L according to 
Sect. 2.4.1. (There is also a syntactic proof available 
given in [2.23].) 


Proposition 2.7 


IT=BL+{gA-@ > 0, 


~y > (PRY y &xr -> >y). 


Proof: This is essentially the original characterization 
of the product logic TI as given in [2.3]. oO 


Proposition 2.8 


G=BL+{p >y & o}. 
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Proof: The prelinear Heyting algebras are just those 
BL-algebras for which the semigroup operation * co- 
incides with the lattice meet: x = ^, cf. [2.5,38], and 
each Heyting algebra is also a BL-algebra. So the result 
follows again via Sect. 2.4.1. Oo 

Similar remarks apply to further extensions of MTL. 


2.4.4 Logics of Particular t-Norms 


It is easy to recognize that two isomorphic (left contin- 
uous) t-norms 7), T determine the same t-norm-based 
logic: any order automorphism of [0,1] which trans- 
forms T; into T according to (2.21) is an isomorphism 
between the f-algebras [0, 1]7, and [0, 1]7,. 

A continuous t-norm T is called archimedean iff 
T(x, x) < x holds true for all 0 < x < 1. And a t-norm 
T has zero divisors iff there exist 0 < a,b such that 
T(a, b) = 0. 

It is well known that each continuous archimedean 
t-norm with zero divisors is isomorphic to the 
Łukasiewicz t-norm T, cf. [2.5, 20]. 


Proposition 2.9 

Each t-norm-based logic, which is determined by 
a continuous archimedean t-norm with zero divisors, 
has the same axiomatizations as the infinite-valued 
Lukasiewicz logic L. 


Furthermore, a continuous t-norm T is called strict 
iff it is strictly monotonous, i. e., satisfies for all z 4 0 


x<y <&— T(x,2)<T(,2). 


Again it is well known that each strict continuous 
t-norm is isomorphic to the product t-norm Tp, cf. [2.5, 
20]. 


Proposition 2.10 

Each t-norm-based logic which is determined by a strict 
continuous f-norm has the same axiomatizations as the 
infinite-valued product logic JT. 


But there is a general solution of the axiomatization 
problem of those ¢-norm-based logics which are deter- 
mined by a continuous t-norm. 

In [2.39], Esteva et al. study the variety BL of all 
BL-algebras. They prove that each of its subvarieties 
which is generated by a single T-algebra over [0, 1], 
T a continuous t-norm, is finitely axiomatizable. Ad- 
ditionally, they provide an algorithm to determine these 
finitely many axioms. 


So the following main result is reached: 


Theorem 2.10 

Each t-norm-based fuzzy logic £r determined by a con- 
tinuous f-norm T is a finite axiomatic extension of the 
basic fuzzy logic BL. 


For left continuous t-norms a similar result is lack- 
ing. 


2.4.5 Extensions to First-Order Logics 


The extensions of these propositional logics to first- 
order ones follows the standard lines of approach: one 
has to start from a first-order language (£ with the two 
standard quantifiers V,4) and a suitable residuated lat- 
tice A, and has to define A-interpretations M by fixing 
a nonempty domain M = |M| and by assigning to each 
predicate symbol of £ an A-valued relation in M (of 
suitable arity) and to each constant an element from (the 
support of) A. 

Usually one supposes that the first-order language 
£ has only predicate symbols and no function symbols. 
The insertion of function symbols proves to be a del- 
icate matter, essentially because it is not completely 
clear what the basic properties of the identity predicate 
should be. The core problem is whether such an iden- 
tity relation should be a crisp one or should be really 
graded. The paper [2.40] also surveys these problems 
of the identity relation. 

The satisfaction relation is defined in the standard 
way. The quantifiers V and J are interpreted as taking 
the infimum or supremum, respectively, of all the values 
of the relevant instances. 

To be sure that this approach works well one 
has either to suppose that the underlying lattices of 
the interpretations are complete lattices, or at least 
that all the necessary infima and suprema do exist in 
these lattices. Interpretations over lattices which sat- 
isfy this last mentioned condition are called safe by 
Hajek [2.23]. 

For the logic BL of continuous t-norms, Hájek [2.23] 
added the axioms: 


(V1) (Vx)g(x) > y(t), where f is substitutable for x 
in g, 

(31) g(t) > (Ax)g(x), where t is substitutable for x 
in g, 

(Y2) (Vx)(y¥ > p) > (x > (Vx)¢@), where x is not free 
in x, 
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(A2) (Vx)(g > x) > (Ex) —> Xx), where x is not free 
in x, 
(V3) (Vx)(VV o) > XV (Yx)ọ, where x is not free in y 


and the rule of generalization to the propositional cal- 
culus yielding the system BLV. 

Then he was able to prove the following complete- 
ness theorem. 


Theorem 2.11 General Chain Completeness 
A first-order formula g is BLV-provable iff it is valid in 
all safe interpretations over BL-chains. 


This result can be extended to a lot of other first- 
order fuzzy logics, e.g., to MTLY. 

We will not discuss further completeness results 
here but refer to the extended survey [2.40]. But it 
should be mentioned that, as suprema are not always 
maxima and infima not always minima, the truth de- 
gree of an existentially/universally quantified formula 
may not be the maximum/minimum of the truth degrees 
of the instances. It is, however, interesting to have con- 
ditions which characterize models in which the truth 
degrees of each existentially/universally quantified for- 
mula is witnessed as the truth degree of an instance. 
Cintula and Hájek [2.41] study this problem. 

The topic of first-order fuzzy logics with identity 
deserves some attention. The core problem is, as in any 
many-valued logic, whether the identity symbol should 
be interpreted by the standard, i. e., two-valued identity 
relation, or whether one should allow for graded iden- 
tity relations inside the interpretations. 

Direct translations of the identity axioms of clas- 
sical first-order logic into, e.g., the language of the 
Lukasiewicz systems force that the interpretation of the 
identity symbol has to be the standard identity relation, 


2.5 Some Generalizations 


The standard approach toward t-norm-based logics, as 
explained in Sects. 2.4.1 and 2.4.2, has been modified 
in various ways. The main background ideas are the ex- 
tension or the modification of the expressive power of 
these logical systems. 


2.5.1 Adding a Projection Operator 
A first, quite fundamental addition to the standard vo- 


cabulary of the languages of t-norm-based systems was 
proposed in [2.47]: a unary propositional operator A, 


cf. [2.42]. Similarly, for a wide class of first-order fuzzy 
logics the addition of the axioms: 


Idl xxyv=>xxy 

Id2 xxx 

Id3 xx y — (v(x, z) > p(y, 2)), for y substitutable for 
xing 


forces that the identity symbol ~ can only be under- 
stood as meaning standard identity. A general complete- 
ness theorem like Theorem 2.11 remains valid in this 
case too, cf. [2.40]. 

For the case of the Lukasiewicz logics, however, 
a slight modification of the standard identity axioms — 
particularly of the Leibniz schema, as given in [2.43], 
allows for graded identity relations, cf. also [2.5]. For 
fuzzy logics, in general similarity relations, i. e., graded 
equivalence relations offer such an approach [2.23, 
44]. For the restricted case of Horn formulas an ap- 
proach is offered by Bélohldvek and Vychodil [2.45, 
46]. They consider a first-order language with function 
symbols and the identity symbol ~ as the only pred- 
icate symbol. Their models for sets of Horn formulas 
therefore have to be algebraic structures with graded 
identity relations. However, the aim of these authors 
is not to develop an identity logic, they mainly are 
interested to use the approach to characterize classes 
of algebraic structures with graded identity relations 
and to find fuzzified versions of results from universal 
algebra. 

These authors even consider fuzzy sets of Horn for- 
mulas, i. e., they work in a Pavelka-style fuzzy logic as 
explained later in Sect. 2.6.1. But because this type of 
approach can be mirrored in standard fuzzy logics with 
sufficiently many truth degree constants, (Sect. 2.6.1) 
this approach is already discussed here. 


also known as Baaz’ Delta, which has for t-algebras the 
semantics 


A(xy)=1 forx=1, 
A(x) =0 forxAl. (2.42) 


This unary connective can be added to the systems BL 
and MTL via the additional axioms 


(Al) gv-he, 
(A2) (pV ¢) > (Agv Ay), 
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(A3) Ap>ọ, 
(A4) Ag > AAg, 
(A5) A@>Yy)> (Ag Ay). 


This addition leaves all the essential theoretical results, 
like correctness and completeness theorems, valid: of 
course w.r.t. suitably expanded algebraic structures. 


2.5.2 Adding an Idempotent Negation 


A second stream of papers discusses the addition of 
an idempotent negation, i.e., a negation which satis- 
fies the double negation law, for those cases where 
the standard negation of the ¢-norm-based system is 
not idempotent. This is, e.g., the case for the prod- 
uct logic which, as explained at the end of Sect. 2.1.3, 
has the Gédel negation (2.2) as its standard nega- 
tion. By the way, it should be noticed that (routine 
calculations show that) this nonidempotent Gédel nega- 
tion is the standard negation of all those t-norm al- 
gebras with a t-norm &® which does not have zero- 
divisors. A very general approach is given in [2.48], 
and a more particular axiomatization problem discussed 
in [2.49]. 


2.5.3 Logics with Additional 
Strong Conjunctions 


A third stream of papers, partly related to the pre- 
viously mentioned one, is devoted to the problem of 
a unified treatment of different, usually two, t-norms 
and their related connectives within one logical sys- 
tem. Here the focus is on the join of the systems 
based on the Lukasiewicz t-norm and on the product 
t-norm. The great advantage of this unification is that 
the Lukasiewicz t-norm essentially allows to treat the 
addition, as may be seen from the truth degree function 
(2.6) of the Lukasiewicz (arithmetical) disjunction, and 
that the product t-norm adds the treatment of the usual 
product: and this means that the elementary arithmetic 
(in the unit interval) can be discussed in this combined 
system. This combined system has been considered in 
two strongly related forms, denoted by L/7 and LIT i 
The distinction between both systems is that LJ7 has 
both t-norms & and © and their related (residual) im- 
plications and negations among their basic connectives, 
and that LJT } adds a truth degree constant for the truth 
degree Ł. These two systems are discussed in detail 


2 
in [2.50-54]. 


2.5.4 Logics Without 
a Truth Degree Constant 


A fourth stream of papers intends to weaken the sys- 
tems BL and MTL in such a way that one deletes the 
explicit reference to the truth degree constant 0 and con- 
siders the falsity free fragments of the previous systems. 
From the algebraic point of view their characteristic 
structures become the hoops which in general are de- 
fined as algebraic structures H = (H, *, =>, 1) such that 
(H, x, 1) is a commutative monoid and that the further 
binary operation = satisfies the equations 


x=>2= 1, 
x(x => y) =yx(y > x), 
(x*y) > z=x>(y>2). 


The definition 


xCy=¢x>x=1 


provides an ordering E with the universal upper bound 
1 which makes (H, x, 1) an ordered monoid, and which 
has the additional property that the operations *, = be- 
come an adjoint pair w.r.t. this ordering. 

In particular, hoops with the additional property 


x>(VS>7CO0S0>2))>2z 


can in a natural way be generated from t-algebras with 

continuous f-norms, as has been shown in [2.55]. So one 
has a kind of competing generalization of t-algebras. 
And for this kind of algebraic semantics, one can find 
adequate axiomatizations for the corresponding hoop 
logics quite similar to the approaches of Sects. 2.4.1 
and 2.4.2. The details have been developed in [2.56]. 


2.5.5 Logics with a Noncommutative 
Strong Conjunction 


And a fifth stream discusses the generalization of the al- 
gebraic semantics from the case of commutative lattice- 
ordered monoids with residuation to the case of non- 
commutative lattice-ordered semigroups. In this con- 
text, one tries to define noncommutative BL-algebras or 
noncommutative MTL-algebras, and similarly defines 
noncommutative t-norms, also called pseudo-f-norms. 
And these considerations become combined with the 
design of an adequate axiomatization, with similar re- 
sults as in Sects. 2.4.1 and 2.4.2. Important papers on 
this topic are [2.5763]. 
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And finally it should be mentioned that Hájek [2.64] 
even gives a common generalization of all of these gen- 
eralized fuzzy logics, thus giving up divisibility, the 
falsity constant, and commutativity. The corresponding 


algebras are called fleas (or flea algebras), and the logic 
is the flea logic FIL. There are examples of fleas on (0, 1] 
not satisfying divisibility, nor commutativity, and hav- 
ing no least element. 


2.6 Extensions with Graded Notions of Inference 


The systems of ¢-norm-based logics discussed up to 
now have been designed to formalize the logical back- 
ground for fuzzy sets, and they have degrees of truth for 
their formulas. But they all have crisp notions of conse- 
quence, i. e., of entailment and of provability. 

Having in mind that fuzzy logics, also in their form 
as formalized logical systems, should be a (mathemat- 
ical) tool for approximate reasoning makes it desirable 
that they should be able to deal with graded inferences 
too. This means inferences which start from fuzzy sets 
of formulas, and offer consequence hulls which again 
are fuzzy sets of formulas. 


2.6.1 Pavelka-Style Approaches 


This problem was first treated by Pavelka [2.65- 
67]. The basic monograph elaborating this approach 
is [2.44]. Accordingly, such approaches are sometimes 
called Pavelka-style, but they have — with emphasis on 
the syntactic side of the matter — also been coined ap- 
proaches with evaluated syntax. Here we will call them 
GI-approaches. 

Such an approach with graded inferences has to deal 
with fuzzy sets X®™ of formulas, i.e., besides formu- 
las g also their membership degrees © ~(g) in XY. 
And these membership degrees are just the truth de- 
grees. We may assume that these degrees again form 
a residuated lattice L = (L,N,U,*,>>,0,1). Thus 
we (slightly) generalize the standard notion of fuzzy 
set (with membership degrees from the real unit in- 
terval). Therefore, the appropriate language has the 
same logical connectives as in the previous consider- 
ations. 

A Gl-approach is an easy matter as long as the en- 
tailment relationship is considered. An evaluation e is 
a model of a fuzzy set X™ of formulas iff 


x ~ (g) < elp) (2.43) 


holds for each formula gy. This immediately yields that 
the semantic consequence hull of X~ should be char- 


acterized by the membership degrees 


ce™(E~)(w) = N tety) | e model of 2~} 
(2.44) 


for each formula w. 

For a syntactic characterization of this entailment 
relation, it is necessary to have some calculus IK which 
treats formulas of the language together with truth 
degrees. So the language of this calculus has to ex- 
tend the language of the basic logical system by hav- 
ing also symbols for the truth degrees. We indicate 
these symbols by overlined letters like @,¢, and real- 
ize the common treatment of formulas and truth degrees 
by considering evaluated formulas, i.e., ordered pairs 
(a, p) consisting of a truth degree symbol and a for- 
mula. This transforms each fuzzy set X~% of formulas 
into a (usual) set of evaluated formulas, again denoted 
by X”. 

So K has to allow to derive evaluated formulas 
from sets of evaluated formulas, using suitable axioms 
and rules of inference. These axioms are usually only 
formulas g which, however, are used in the deriva- 
tions as the corresponding evaluated formulas (1, ¢). 
The rules of inference have to deal with evaluated 
formulas. 

Each K-derivation of an evaluated formula (a, g) 
counts as a derivation of @ to the degree a € L. The 
provability degree of y from X™ in K is the supremum 
over all these degrees. The syntactic consequence hull 
of X” is the fuzzy set Ce of formulas characterized 
by the membership function 


CR (S~) 
= Via € L| K derives (a, y) out of X~} (2.45) 


for each formula w. 

Despite the fact that KK is a standard calculus for 
evaluated formulas, this is — for infinite truth degree 
structures — an infinitary notion of provability for usual 
formulas. 
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For the infinite-valued Lukasiewicz logic L, this ma- 
chinery works particularly well because it needs in an 
essential way the continuity of the residuation opera- 
tion. The corresponding calculus KK, has as axioms any 
axiom system of the infinite-valued Lukasiewicz logic 
L which provides together with the rule of detachment 
an adequate axiomatization of L, but K, replaces this 
standard rule of detachment by the generalized form 


p> y) 
(axc, y) 


for evaluated formulas. 

The soundness result then says that the K,- 
provability of an evaluated formula (@, g) means that 
a<e(g) holds for every valuation e. And this just 
means that the formula @— ø is valid; however, as 
a formula of an extended propositional language which 
has all the truth degree constants among its vocabulary. 
Of course, for this extended language the evaluations e 
have to satisfy e(a) = a for each a € [0, 1]. 

The soundness and completeness results for K, say 
that a strong completeness theorem holds true giving 


@9) (2.46) 


CIF) y) = CK (SM (2.47) 
for each formula w and each fuzzy set X ™ of formulas. 

If one takes the previously mentioned turn and ex- 
tends the standard language of propositional L by truth 
degree constants for all degrees a € [0, 1], and if one 
reads each evaluated formula (4, g) as the formula 4 > 
gy, then a slight modification Kr of the former calculus 
Kı again provides an adequate axiomatization: one has 
to add the bookkeeping axioms 


(4 &C) =axc, 


(@—>C)=a>>C, 


as explained, e.g., in [2.44]. And if one is interested to 
have evaluated formulas together with the extension of 
the language by truth degree constants, one has also to 
add the logical constant introduction rule 


G, p) 
a>@ i 


However, even a stronger result is available which 
refers only to a notion of derivability over a countable 
language. The completeness result (2.47), for Kt in- 
stead of K,, becomes already provable if one adds truth 
degree constants only for all the rationals in [0, 1], as 


was shown in [2.23]. And this extension of L, known 
as Rational Pavelka Logic, is even a conservative one, 
cf. [2.68], i.e., Kt proves only such constant-free for- 
mulas of the language with rational constants which 
are already provable in the standard infinite-valued 
Łukasiewicz logic L. 

So the Gl-approach with graded notion of prov- 
ability and entailment can suitably be mirrored inside 
standard fuzzy logics with sufficiently many truth de- 
gree constants. 

For more details the reader may also consult, 
e.g., [2.23, 44, 69]. 


2.6.2 A Lattice Theoretic Approach 


For completeness, we also mention a much more ab- 
stract approach toward fuzzy logics with graded notions 
of entailment as the previously explained one for the t- 
norm-based fuzzy logics is. 

The background for this generalization by Gerla, in 
detail explained in [2.70], is that (already) in systems of 
classical logic the syntactic as well as the semantic con- 
sequence relations, i.e., the provability as well as the 
entailment relations, are closure operators within the set 
of formulas. This is a fundamental observation made by 
Tarski [2.71] already in 1930. And the same holds true 
for the Pavelka style extensions of Sect. 2.6.1 and the 
operators C**™ and C° introduced in (2.44) and (2.45), 
respectively: they are generalized closure operators. 

The context, chosen in [2.70], is that of L-fuzzy sets, 
with L = (L, <) an arbitrary complete lattice. A closure 
operator in L is a mapping J : L —> L satisfying for ar- 
bitrary x, y € L the well-known conditions 


x <J(x) (increasingness) , 
x<y=>J(x) <J) (isotonicity) , 
J(J(x)) = J) (idempotency) . 


And a closure system in L is a subclass C C L which is 
closed under arbitrary lattice meets. 

For fuzzy logic such closure operators and closure 
systems are considered in the lattice F; (F) of all fuzzy 
subsets of the set F of formulas of some suitable for- 
malized language. 

An abstract fuzzy deduction system now is an or- 
dered pair D = (F,(F),D) determined by a closure 
operator D in the lattice Fz (F). And the fuzzy theories T 
of such an abstract fuzzy deduction system, also called 
D-theories, are the fixed points of D: T = D(T), i.e., 
the deductively closed fuzzy sets of formulas. 
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A rather abstract setting is also chosen for the se- 
mantics of such an abstract fuzzy deduction system: an 
abstract fuzzy semantics M is nothing but a class of el- 
ements of the lattice F; (F), i.e., a class of fuzzy sets 
of formulas. These fuzzy sets of formulas are called 
models. The only restriction is that the universal set 
over F, i.e., the fuzzy subset of F which has always 
membership degree one, is not allowed as a model. 
The background idea here is that, for each standard 
interpretation 2 (in the sense of many-valued logic — 
including an evaluation of the individual variables) for 
the formulas of F, a model M is determined as the 
fuzzy set which has for each formula gy € F the truth 
degree of gy in A as membership degree. Accordingly, 
the satisfaction relation =m coincides with inclusion: 
for models M € M and fuzzy sets X of formulas one 
has 


MEuD e SCM. (2.48) 


2.7 Some Complexity Results 


Each (left-continuous) t-norm T determines four impor- 
tant sets of formulas: 


@ 1TAUT (T): The set of all 1-tautologies. 

@ posTAUT (T): The set of all positive tautologies. 

@ 1SAT (T): The set of 1-satisfiable formulas. 

@ posSAT (T): The set of all positively satisfiable for- 
mulas. 


Here a 1-tautology is a formula valid in [0, 1]7, i. e., 
having for each evaluation of propositional variables by 
elements of [0, 1] the value 1 in [0, 1]7. And a positive 
tautology is a formula which has for each evalua- 
tion a positive value in [0, 1]7. Similarly 1-satisfiability 
means to have the [0, 1]; value 1 for some evaluation, 
and positive satisfiability means to have for some eval- 
uation a positive [0, 1]7-value. 

In the same way, one defines analogous sets corre- 
sponding to sets of t-norms; in particular, with BL refer- 
ring to the set of all continuous t-norms, one defines 


TTAUT(BL) 

= ( \aTaut(7) | T a continuous t-norm} , 
posTAUT(BL) 

= ( \{posTAUT(7) | T a continuous t-norm} , 


and similarly for the satisfiability cases. 


In this setting, one has a semantic and a syntac- 
tic consequence operator, both being closure operators, 
i.e., one has for each fuzzy set X of formulas from F 
a semantic as well as a syntactic consequence hull, 
given by 


ce™(S) = ( {M € M |M Em X}, 
c (X) = D(X). (2.49) 


Similar to the classical case one has C**™(M) = M for 
each model M € M, i.e., each such model provides 
a C%™-theory. 

However, a general completeness theorem is not 
available. What one needs instead, in search for a com- 
pleteness result, that are specifications which restrict the 
full generality of this approach, and lead mainly back 
to situations which have been discussed in the previous 
sections. 


There are interesting results on the computational 
complexity of these sets. So it was, already in [2.23], 
shown that if the t-norm T is Tı, or Tg, or Tp, then 
TTAUT(T) and posTAUT(T) are co-NP-complete, and 
1SAT(T) and posSAT(T) are NP-complete. This result 
was partly strengthened in [2.34] yielding that 1TAUT(T) 
is co-NP-complete for each continuous t-norm T. 

The corresponding results have been proved 
in [2.72, 73] for the logic BL of continuous t-norms. 


Theorem 2.12 
1TAUT(BL) and posTAUT(BL) are co-NP-complete, and 
1SAT(BL) as well as posSAT(BL) are NP-complete. 


Furthermore, there are several results on equality or 
inequality among the sets involved [2.23,73]. So one 
has, e.g., 


1SAT(G) = posSAT(G) = 1SAT(I7) = posSAT(JT) , 
but also 

1SAT(L) Æ posSAT(L) 
and 

posSAT(BL) = posSAT(L) . 


For the 1-tautologicity the papers [2.74-76] con- 
tain interesting results relating the property of a formula 
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being a l-tautology of one of the logics L,G, to 
the property of being a 1-tautology of finitely many 
finite-valued logics of estimated complexity, and sim- 
ilar results for 1TAUT(BL). For example, ọ is in 1TAUT(L) 
if and only if it is a 1-tautology of the finitely valued 
Lukasiewicz logic Lm for m being 2*) where #(g) is 
the number of occurrences of variables in g. 

Remind that for predicate logics LxV, the gen- 
eral models are safe interpretations over any linearly 
ordered \Lx-algebra, and the standard models — for 
t-norm-based logics — are interpretations over any t- 
algebra which is also an L x-algebra. 


Definition 2.4 
Let ọ be a closed formula of the language of L xY: 


1. ọ is a general Lx V-tautology if ọ is valid in each 
safe interpretation over any linearly ordered Lx- 
algebra; genTAUT(LxV) is the set of all general 
Lx V-tautologies. 

2. ọ is a standard LxV-tautology if g is valid in 
each safe interpretation over any standard Lx- 
algebra; stTAUT(LxV) is the set of all standard 
Lx V-tautologies. 

3. gis LxV-satisfiable if ọ is valid in some safe inter- 
pretation over some linearly ordered L x-algebra; 
genSAT(L xY) is the set of all Lx V-satisfiable sen- 
tences. 

4. ọ is standardly Lx‘-satisfiable if g is valid 
in some interpretation over some standard Lx- 
algebra; stSAT(LxV) is the set of all standardly 
Lx V-satisfiable formulas. 


It was already shown in [2.23] that if L is the 
logic BL or one of its specifications L,G,M, then 
genTAUT(LYV) is © -complete and genSAT(LY) is M- 
complete. And this result has been extended in [2.77] to 
any ¢-norm-based logic LV for a continuous t-norm T. 

For standard semantics the situation 1s different. Al- 
ready Ragaz [2.78, 79] proved that the set stTAUT(LV) of 
standard tautologies of the infinite-valued Lukasiewicz 
logic is [77-complete. Generalizing this, Hájek [2.23] 
also showed that stTAUT(GV) = genTAUT(GY) and that 
therefore the set stTAUT(GV) of standard tautologies of 
the infinite-valued Gödel logic is Xı-complete. 

These results have been considerably extended by 
Montagna [2.80] yielding the following facts. 


Theorem 2.13 


1. For each set K of continuous t-norms containing 
a t-norm different from Tg, the set stTAUT(L xV) is 
T1,-hard. 

2. If K is a nonempty set of continuous t-norms con- 
taining a t-norm which has, in its ordinal sum rep- 
resentation, a product component or a nonextremal 
(this means being neither first nor last summand) 
Lukasiewicz component then stTAUT(L xY) is not 
arithmetical. 


The arithmetical complexity of the set stTAUT(L7V) 
remains undetermined if T is, e.g., one of the t-norms 
LOL, GGL, LOG, or LAGO L, with @ denoting the 
ordinal sum operation. 

For standard satisfiability Hájek [2.23] proved that 
the sets stSAT(GV) and stSAT(LV) are /7,-complete. He 
also proved, in [2.81], that the set stSATUTY) is not 
arithmetical, and gave in [2.82] also the following re- 
sult. 


Theorem 2.14 
If ® is a continuous t-norm whose first ordinal sum 
component is G, or L, then stSAT(L@Y) is [7,-complete. 


The reason is that one has, under these assumptions, 
stSAT(L@V) = stSAT(GV) , 

as well as 
stSAT(L@V) = stSAT(LY) . 


Montagna [2.83] added for the product logic IT the 
more general. 


Theorem 2.15 

If K is a nonempty set of continuous t-norms contain- 
ing Tp, or a t-norm whose first ordinal summand is Tp, 
then stSAT(L xY) is not arithmetical. 


The complexity of stSAT(L;V) for continuous t- 
norms T which do not have a first component in their 
ordinal sum representation is an open problem. 

More complexity results are surveyed in [2.77] and 
more recently in [2.84, 85]. 
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2.8 Concluding Remarks 


The reader who is interested in further results, or in 
more details, might consult the recent Handbook of 
Mathematical Fuzzy Logic [2.86]. This Handbook sur- 
veys the whole field of mathematical fuzzy logics and 
offers the most actual state of the art in this field. For 
the wider topic of many-valued logics, [2.5] is still the 
best reference. 

There is one approach, however, which is not dis- 
cussed here and only shortly mentioned in [2.86]: 
a version of Church-style type theory based on suitable 
mathematical fuzzy logics, called fuzzy type theory, and 
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3. Possibility Theory and Its Applications: 


Didier Dubois, Henry Prade 


This chapter provides an overview of possibility 
theory, emphasizing its historical roots and its re- 
cent developments. Possibility theory lies at the 
crossroads between fuzzy sets, probability, and 
nonmonotonic reasoning. Possibility theory can be 
cast either in an ordinal or in a numerical setting. 
Qualitative possibility theory is closely related to 
belief revision theory, and commonsense reason- 
ing with exception-tainted knowledge in artificial 
intelligence. Possibilistic logic provides a rich rep- 
resentation setting, which enables the handling 
of lower bounds of possibility theory measures, 
while remaining close to classical logic. Quali- 
tative possibility theory has been axiomatically 
justified in a decision-theoretic framework in the 
style of Savage, thus providing a foundation for 
qualitative decision theory. Quantitative possibil- 
ity theory is the simplest framework for statistical 
reasoning with imprecise probabilities. As such, it 
has close connections with random set theory and 
confidence intervals, and can provide a tool for 
uncertainty propagation with limited statistical or 
subjective information. 
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Possibility theory is an uncertainty theory devoted to 
the handling of incomplete information. To a large ex- 
tent, it is comparable to probability theory because 
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it is based on set functions. It differs from the latter 
by the use of a pair of dual set functions (possibility 
and necessity measures) instead of only one. Besides, 
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it is not additive and makes sense on ordinal struc- 
tures. The name Theory of Possibility was coined by 
Zadeh [3.1], who was inspired by a paper by Gaines and 
Kohout [3.2]. In Zadeh’s view, possibility distributions 
were meant to provide a graded semantics to natural 
language statements; on this basis, possibility degrees 
can be attached to other statements, as well as dual ne- 
cessity degrees expressing graded certainty. However, 
possibility and necessity measures can also be the ba- 
sis of a full-fledged representation of partial belief that 
parallels probability, without compulsory reference to 
linguistic information [3.3, 4]. It can be seen either as 
a coarse, nonnumerical version of probability theory, or 
a framework for reasoning with extreme probabilities, 
or yet a simple approach to reasoning with imprecise 
probabilities [3.5]. 

Besides, possibility distributions can also be in- 
terpreted as representations of preference, thus stand- 
ing for a counterpart to a utility function. In this 
case, possibility degrees estimate degrees of feasibil- 
ity of alternative choices, while necessity measures 
can represent priorities [3.6]. The possibility theory 
framework is also bipolar [3.7] because distributions 
may either restrict the possible states of the world 
(negative information pointing out the impossible), or 
model sets of actually observed possibilities (posi- 
tive information pointing out the possible). Negative 
information refers to pieces of knowledge that are 
supposedly correct and act as constraints. Possibility 
and necessity measures rely on negative information. 


3.1 Historical Background 


Zadeh was not the first scientist to speak about for- 
malising notions of possibility. The modalities pos- 
sible and necessary have been used in philosophy 
at least since the Middle Ages in Europe, based 
on Aristotle’s and Theophrastus’ works [3.11]. More 
recently these notions became the building blocks 
of modal logics that emerged at the beginning of 
the 20th century from the works of C.I. Lewis 
(see Cresswell [3.12]). In this approach, possibility 
and necessity are all-or-nothing notions, and han- 
dled at the syntactic level. More recently, and inde- 
pendently from Zadeh’s view, the notion of possi- 
bility, as opposed to probability, was central in the 
works of one economist, and in those of two philoso- 
phers. 


Positive information refers to reports of actually ob- 
served states, or to sets of preferred choices. They 
induce two other set functions: guaranteed possibility 
measures and its dual, that are decreasing w.r.t. set in- 
clusion [3.8]. 

After reviewing pioneering contributions to possi- 
bility theory, we recall its basic concepts namely the 
four set functions at work in possibility theory. Then we 
present the two main directions along which possibility 
theory has developed: the qualitative and quantitative 
settings. Both approaches share the same basic maxitiv- 
ity axiom. They differ when it comes to conditioning, 
and to independence notions. We point out the connec- 
tions with a coarse numerical integer-valued approach 
to belief representation, proposed by Spohn [3.9], now 
known as ranking theory [3.10]. 

In each setting, we discuss current and prospective 
lines of research. In the qualitative approach, we review 
the connections between possibility theory and modal 
logic, possibilistic logic and its applications to non- 
monotonic reasoning, logic programming and the like, 
possibilistic counterparts of Bayesian belief networks, 
the framework of soft constraints and the possibilistic 
approach to qualitative decision theory, and more recent 
investigations in formal concept analysis and learning. 
On the quantitative side, we review quantitative possi- 
bilistic networks, the connections between possibility 
theory, belief functions and imprecise probabilities, the 
connections with non-Bayesian statistics, and the appli- 
cation of quantitative possibility to risk analysis. 


3.1.1 G.L.S. Shackle 


A graded notion of possibility was introduced as a full- 
fledged approach to uncertainty and decision in 1940- 
1970 by the English economist Shackle [3.13], who 
called degree of potential surprise of an event its de- 
gree of impossibility, that is, retrospectively, the degree 
of necessity of the opposite event. Shackle’s notion of 
possibility is basically epistemic, it is a character of the 
chooser’s particular state of knowledge in his present. 
Impossibility is understood as disbelief. Potential sur- 
prise is valued on a disbelief scale, namely a positive 
interval of the form [0, y*], where y* denotes the ab- 
solute rejection of the event to which it is assigned. 
In case everything is possible, all mutually exclusive 
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hypotheses have zero surprise. At least one elemen- 
tary hypothesis must carry zero potential surprise. The 
degree of surprise of an event, a set of elementary hy- 
potheses, is the degree of surprise of its least surprising 
realization. Shackle also introduces a notion of con- 
ditional possibility, whereby the degree of surprise of 
a conjunction of two events A and B is equal to the max- 
imum of the degree of surprise of A, and of the degree of 
surprise of B, should A prove true. The disbelief notion 
introduced later by Spohn [3.9, 10] employs the same 
type of convention as potential surprise, but uses the set 
of natural integers as a disbelief scale; his conditioning 
tule uses the subtraction of natural integers. 


3.1.2 D. Lewis 


In his 1973 book [3.14], the philosopher David Lewis 
considers a graded notion of possibility in the form of 
a relation between possible worlds he calls compara- 
tive possibility. He connects this concept of possibility 
to a notion of similarity between possible worlds. This 
asymmetric notion of similarity is also comparative, 
and is meant to express statements of the form: a world 
jis at least as similar to world i as world k is. Compar- 
ative similarity of j and k with respect to 7 is interpreted 
as the comparative possibility of j with respect to k 
viewed from world i. Such relations are assumed to be 
complete pre-orderings and are instrumental in defining 
the truth conditions of counterfactual statements (of the 
form If I were rich, I would buy a big boat). Compar- 
ative possibility relations > 77 obey the key axiom: for 
all events A, B, C 


A >r Bimplies CUA >z CUB. 
This axiom was later independently proposed by the 


first author [3.15] in an attempt to derive a possi- 
bilistic counterpart to comparative probabilities. Inde- 


pendently, the connection between numerical possibil- 
ity degrees and similarity was investigated by Sud- 
kamp [3.16]. 


3.1.3 L.J. Cohen 


A framework very similar to the one of Shackle was 
proposed by the philosopher Cohen [3.17] who con- 
sidered the problem of legal reasoning. He introduced 
so-called Baconian probabilities understood as degrees 
of provability. The idea is that it is hard to prove some- 
one guilty at the court of law by means of pure statistical 
arguments. The basic feature of degrees of provability is 
that a hypothesis and its negation cannot both be prov- 
able together to any extent (the contrary being a case 
for inconsistency). Such degrees of provability coincide 
with what is known as necessity measures. 


3.1.4 L.A. Zadeh 


In his seminal paper [3.1], Zadeh proposed an inter- 
pretation of membership functions of fuzzy sets as 
possibility distributions encoding flexible constraints 
induced by natural language statements. Zadeh tenta- 
tively articulated the relationship between possibility 
and probability, noticing that what is probable must 
preliminarily be possible. However, the view of pos- 
sibility degrees developed in his paper refers to the 
idea of graded feasibility (degrees of ease, as in the 
example of how many eggs can Hans eat for his break- 
fast) rather than to the epistemic notion of plausibility 
laid bare by Shackle. Nevertheless, the key axiom of 
maxitivity for possibility measures is highlighted. In 
the two subsequent articles [3.18, 19], Zadeh acknowl- 
edged the connection between possibility theory, belief 
functions and upper/lower probabilities, and proposed 
their extensions to fuzzy events and fuzzy information 
granules. 


3.2 Basic Notions of Possibility Theory 


The basic building blocks of possibility theory orig- 
inate in Zadeh’s paper [3.1] and were first ex- 
tensively described in the authors’ book [3.20], 
then further on in [3.3,21]. More recent accounts 
are in [3.4,5]. In this section, possibility theory 
is envisaged as a stand-alone theory of uncer- 
tainty. 


3.2.1 Possibility Distributions 


Let S be a set of states of affairs (or descriptions 
thereof), or states for short. This set can be the domain 
of an attribute (numerical or categorical), the Cartesian 
product of attribute domains, the set of interpretations 
of a propositional language etc.. A possibility distribu- 
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tion is a mapping z from S to a totally ordered scale L, 
with top denoted by | and bottom by 0. In the finite 
case L= {1 =A, > ++- Àn > Ànp1ı = 0}. The possibil- 
ity scale can be the unit interval as suggested by Zadeh, 
or generally any finite chain, or even the set of nonnega- 
tive integers. It is often assumed that L is equipped with 
an order-reversing map denoted by à € Le 1— À. 

The function z represents the state of knowledge of 
an agent (about the actual state of affairs), also called 
an epistemic state distinguishing what is plausible from 
what is less plausible, what is the normal course of 
things from what is not, what is surprising from what 
is expected. It represents a flexible restriction on what 
is the actual state with the following conventions (sim- 
ilar to probability, but opposite to Shackle’s potential 
surprise scale (If L = N, the conventions are opposite: 
0 means possible and oo means impossible.)): 


@ x(s) =0 means that state s is rejected as impossi- 
ble; 

© zx(s)=1 means that state s is totally possible (= 
plausible). 


The larger z(s), the more possible, i.e., plausible 
the state s is. Formally, the mapping z is the member- 
ship function of a fuzzy set [3.1], where membership 
grades are interpreted in terms of plausibility. If the uni- 
verse S is exhaustive, at least one of the elements of S 
should be the actual world, so that 3s, (s) = 1 (nor- 
malization). This condition expresses the consistency of 
the epistemic state described by zr. 

Distinct values may simultaneously have a degree 
of possibility equal to 1. In the Boolean case, z is just 
the characteristic function of a subset E C S of mutually 
exclusive states (a disjunctive set [3.22]), ruling out all 
those states considered as impossible. Possibility theory 
is thus a (fuzzy) set-based representation of incomplete 
information. 


3.2.2 Specificity 


A possibility distribution z is said to be at least as spe- 
cific as another zr’ if and only if for each state of affairs 
s: (s) < x’ (s) [3.23]. Then, x is at least as restrictive 
and informative as 7’, since it rules out at least as many 
states with at least as much strength. In the possibilistic 
framework, extreme forms of partial knowledge can be 
captured, namely: 


© Complete knowledge: for some so, 7 (so) = 1 and 
z (s) = 0, Vs Æ so (only so is possible) 


© Complete ignorance: x(s) = 1, Ws € S (all states are 
possible). 


Possibility theory is driven by the principle of min- 
imal specificity. It states that any hypothesis not known 
to be impossible cannot be ruled out. It is a minimal 
commitment, cautious information principle. Basically, 
we must always try to maximize possibility degrees, 
taking constraints into account. 

Given a piece of information in the form x is F, 
where F is a fuzzy set restricting the values of the ill- 
known quantity x, it leads to represent the knowledge by 
the inequality x < ur, the membership function of F. 
The minimal specificity principle enforces the possibil- 
ity distribution z = upr, if no other piece of knowledge 
is available. Generally there may be impossible val- 
ues of x due to other piece(s) of information. Thus, 
given several pieces of knowledge of the form x is F;, 
for i=1,...,n, each of them translates into the con- 
straint 2 < ur; hence, several constraints lead to the 
inequality 2 < min}_, Hr; and on behalf of the mini- 
mal specificity principle, to the possibility distribution 


n 
m = min , 
i=l 
where x; is induced by the information item x is F;. 
It justifies the use of the minimum operation for com- 
bining information items. It is noticeable that this way 
of combining pieces of information fully agrees with 
classical logic, since a classical logic base is equivalent 
to the logical conjunction of the logical formulas that 
belong to the base, and its models is obtained by in- 
tersecting the sets of models of its formulas. Indeed, in 
propositional logic, asserting a proposition @ amounts 
to declaring that any interpretation (state) that makes ¢ 
false is impossible, as being incompatible with the state 
of knowledge. 


3.2.3 Possibility and Necessity Functions 


Given a simple query of the form does event A occur? 
(is the corresponding proposition ¢ true?) where A is 
a subset of states, the response to the query can be 
obtained by computing degrees of possibility and ne- 
cessity, respectively (if the possibility scale L = [0, 1]) 


IT(A) = sup x (s); 


SEA 


N(A) = inf 1—z(s). 
séA 


IT(A) evaluates to what extent A is consistent with 
zt, while N(A) evaluates to what extent A is certainly 
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implied by x. The possibility—necessity duality is ex- 
pressed by N(A) = 1 — TI (A°), where A° is the comple- 
ment of A. Generally, 7(S) = N(S) = 1 and (Ø) = 
N(®) = 0 (since z is normalized to 1). In the Boolean 
case, the possibility distribution comes down to the dis- 
junctive (epistemic) set E C S [3.3, 24]: 


© I[1(A) =1ifANE # 9, and 0 otherwise: function M 
checks whether proposition A is logically consistent 
with the available information or not. 

© N(A)=1 if ECA, and 0 otherwise: function N 
checks whether proposition A is logically entailed 
by the available information or not. 


More generally, possibility and necessity measures 
represent degrees of plausibility and belief, respec- 
tively, in agreement with other uncertainty theories (see 
Sect. 3.4). Possibility measures satisfy the basic maxi- 
tivity property TI (A U B) = max(/7(A), II (B)). Neces- 
sity measures satisfy an axiom dual to that of possi- 
bility measures, namely N (A N B) = min(N (A), N(B)). 
On infinite spaces, these axioms must hold for infinite 
families of sets. As a consequence, of the normalization 
of x, min(N (A), N(A°)) = 0 and max (I (A), I (A°)) = 
1, where A‘ is the complement of A, or equivalently 
IT(A) = 1 whenever N(A) > 0, which totally fits the in- 
tuition behind this formalism, namely that something 
somewhat certain should be fully possible, i. e., consis- 
tent with the available information. 


3.2.4 Certainty Qualification 


Human knowledge is often expressed in a declara- 
tive way using statements to which belief degrees are 
attached. Certainty-qualified pieces of uncertain infor- 
mation of the form A is certain to degree a can then 
be modeled by the constraint N(A) > œ. It represents 
a family of possible epistemic states x that obey this 
constraints. The least specific possibility distribution 
among them exists and is defined by [3.3] 


9 1 ifseA, (3.1) 
T s) = ` 
(Aa) 1—a_ otherwise. 


If œ = 1, we get the characteristic function of A. If a = 
0, we get total ignorance. This possibility distribution 
is a key building block to construct possibility distri- 
butions from several pieces of uncertain knowledge. 
Indeed, e.g., in the finite case, any possibility distribu- 
tion can be viewed as a collection of nested certainty- 
qualified statements. Let E; = {s : 7 (s) > A; € L} be the 


Ài-cut of x. Then it is easy to check that 7r (s) = 
min;:sgg; | — N(E;) (with the convention ming = 1). 

We can also consider possibility-qualified state- 
ments of the form M (A) > p; however, the least spe- 
cific epistemic state compatible with this constraint 
expresses total ignorance. 


3.2.5 Joint Possibility Distributions 


Possibility distributions over Cartesian products of at- 
tribute domains S, x -++ X Sm are called joint possibility 


distributions 7 (s1,..., Sn). The projection ny of the 
joint possibility distribution x onto S; is defined as 


my (s) = T (S1 +++ Spi X {4} +++ Sepa X Sm) 


= sup A(S1,...,Sn). 
sj ES; izk 
Clearly, 7 (s1, ..., Sn) < ming, my (sx) that is, a joint 


possibility distribution is at least as specific as the 
Cartesian product of its projections. When the equality 
holds, 7 (s1, .. . , Sn) is called separable. 


3.2.6 Conditioning 


Notions of conditioning exist in possibility theory. Con- 
ditional possibility can be defined similarly to prob- 
ability theory using a Bayesian-like equation of the 
form [3.3] 


TI(BAA) = TI(B | A) x TI (A) . 


where TI(A)>0 and * is a t-norm (A nondecreas- 
ing Abelian semigroup operation on the unit interval 
having identity 1 and absorbing element 0 [3.25].); 
moreover N(B | A) = 1 — TI (B° | A). The above equa- 
tion makes little sense for necessity measures, as it 
becomes trivial when N(A) = 0, that is under lack 
of certainty, while in the above definition, the equa- 
tion becomes problematic only if M(A)= 0, which 
is natural as then A is considered impossible. If op- 
eration » is the minimum, the equation JI (B A A) = 
min(I(B | A), ITI (A)) fails to characterize M(B |A), 
and we must resort to the minimal specificity principle 
to come up with the qualitative conditioning rule [3.3] 


1 if (BNA) = T(A)>0, 


IT(B|A)= 
(B14) IT(BNA) otherwise . 


(3.2) 
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It is clear that N(B |A) >0 if and only if 7(BN 
A) > TI (B° MA). Moreover, if TI(B | A) > (B) then 
TI(B|A)= 1, which points out the limited expres- 
siveness of this qualitative notion (no gradual positive 
reinforcement of possibility). However, it is possible to 
have that N(B) > 0, N(B° | A1) > 0, N(B |A: NA2) > 0 
(i.e., oscillating beliefs). Extensive works on condi- 
tional possibility, especially qualitative, handling the 
case IT(A) = 0, have been recently carried out by Co- 
letti and Vantaggi [3.26, 27] in the spirit of De Finetti’s 
approach to subjective probabilities defined in terms of 
conditional measures and allowing for conditioning on 
impossible events. 

In the numerical setting, due to the need of preserv- 
ing for JT7(B| A) continuity properties of IT, we must 
choose * = product, so that 


TI(BNA) 
TI(A) 


which makes possibilistic and probabilistic condition- 
ings very similar [3.28] (now, gradual positive rein- 
forcement of possibility is allowed). But there is yet 
another definition of numerical possibilistic condition- 
ing, not based on the above equation as seen later in this 
chapter. 


TI(B |A) = 


3.2.7 Independence 


There are also several variants of possibilistic indepen- 
dence between events. Let us mention here the two 
basic approaches: 


@ Unrelatedness: TI (AA B) = min(M (A), T (B)). 
When it does not hold, it indicates an epistemic 
form of mutual exclusion between A and B. It is 
symmetric but sensitive to negation. When it holds 
for all pairs made of A, B and their complements, 
it is an epistemic version of logical independence 
related to separability. 

© Causal independence: TI(B | A) = TI (B). This no- 
tion is different from the former one and stronger. 
It is a form of directed epistemic independence 
whereby learning A does not affect the plausibility 
of B. It is neither symmetric nor insensitive to nega- 
tion: for instance, it is not equivalent to N(B | A) = 
N(B). 


Generally, independence in possibility theory is 
neither symmetric, nor insensitive to negation. For 
Boolean variables, independence between events is 
not equivalent to independence between variables. But 


since the possibility scale can be qualitative or quan- 
titative, and there are several forms of conditioning, 
there are also various possible forms of independence. 
For studies of various notions and their properties 
see [3.29-32]. More discussions and references appear 
in [3.4]. 


3.2.8 Fuzzy Interval Analysis 


An important example of a possibility distribution is 
a fuzzy interval [3.3,20]. A fuzzy interval is a fuzzy 
set of reals whose membership function is unimodal 
and upper-semi continuous. Its œ-cuts are closed in- 
tervals. The calculus of fuzzy intervals is an extension 
of interval arithmetics based on a possibilistic counter- 
part of a computation of random variable. To compute 
the addition of two fuzzy intervals A and B one has 
to compute the membership function of A @ B as the 
degree of possibility pagg (2) = H ({(x, y) : x+y = z}), 
based on the possibility distribution min(j,4 (x), ug (y)). 
There is a large literature on possibilistic interval anal- 
ysis; see [3.33] for a survey of 20th-century refer- 
ences. 


3.2.9 Guaranteed Possibility 


Possibility distributions originally represent negative 
information in the sense that their role is essentially 
to rule out impossible states. More recently, [3.34, 35] 
another type of possibility distribution has been con- 
sidered where the information has a positive nature, 
namely it points out actually possible states, such as 
observed cases, examples of solutions, etc. Positively- 
flavored possibility distributions will be denoted by 6 
and serve as evidential support functions. The conven- 
tions for interpreting them contrast with usual possibil- 
ity distributions: 


@ 65(s)=1 means that state s is actually possible be- 
cause of a high evidential support (for instance, s is 
a case that has been actually observed); 

@ 6(s) =O means that state s has not been observed 
(yet: potential impossibility). 


Note that (s) = 1 indicates potential possibility, 
while 5(s) = 1 conveys more information. In contrast, 
ô(s) = 0 expresses ignorance. 

A measure of guaranteed possibility can be defined, 
that differs from functions M and N [3.34,35] 


A(A) = inf 3(s) . 
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It estimates to what extent all states in A are actually 
possible according to evidence. A(A) can be used as 
a degree of evidential support for A. Of course, this 
function possesses a conjugate V such that V(A) = 
1— A(A‘) = sup,g,4 1 — 6(s). Function V(A) evaluates 
the degree of potential necessity of A, as it is 1 
only if some state s outside A is potentially impossi- 
ble. 

Uncertain statements of the form A is possible to de- 
gree P often mean that any realization of A is possible 
to degree f (e.g., it is possible that the museum is open 
this afternoon). They can then be modeled by a con- 
straint of the form A(A) > £. It corresponds to the idea 
of observed evidence. 

This type of information is better exploited by as- 
suming an informational principle opposite to the one 
of minimal specificity, namely, any situation not yet ob- 
served is tentatively considered as impossible. This is 
similar to the closed-world assumption. The most spe- 
cific distribution ôq, g) in agreement with A(A) > £ is 


Ê ifseA, 


ô s) = 
a.p (s) 0 otherwise. 


Note that while possibility distributions induced from 
certainty qualified pieces of knowledge combine con- 
junctively, by discarding possible states, evidential 
support distributions induced by possibility-qualified 
pieces of evidence combine disjunctively, by accumu- 
lating possible states. Given several pieces of knowl- 
edge of the form x is F; is possible, for i= 1,...,n, 
each of them translates into the constraint ô > wr,; 
hence, several constraints lead to the inequality 6 > 
max'_, Ur; and on behalf of another minimal commit- 
ment principle based on maximal specificity, we get the 
possibility distribution 


n 
ô = max T; , 


i=1 


where 6; is induced by the information item x is F; is 
possible. It justifies the use of the maximum operation 
for combining evidential support functions. Acquiring 
pieces of possibility-qualified evidence leads to updat- 
ing (4,8) into some wider distribution ô > 5(4,8). Any 
possibility distribution can be represented as a collec- 
tion of nested possibility-qualified statements of the 
form (E;, A(E;)), with E; = {s : 6(s) > Aj}, since ô(s) = 
max;:;<z, A(E;), dually to the case of certainty-qualified 
statements. 


3.2.10 Bipolar Possibility Theory 


A bipolar representation of information using pairs 
(5, 7x) may provide a natural interpretation of interval- 
valued fuzzy sets [3.8]. Although positive and negative 
information are represented in separate and different 
ways via 6 and z functions, respectively, there is a co- 
herence condition that should hold between positive 
and negative information. Indeed, observed informa- 
tion should not be impossible. Likewise, in terms of 
preferences, solutions that are preferred to some ex- 
tent should not be unfeasible. This leads to enforce the 
coherence constraint ô < x between the two represen- 
tations. 

This condition should be maintained when new in- 
formation arrives and is combined with the previous 
one. This does not go for free since degrees 5(s) tend 
to increase while degrees (s) tend to decrease due 
to the disjunctive and conjunctive processes that, re- 
spectively, govern their combination. Maintaining this 
coherence requires a revision process that works as fol- 
lows. If the current information state is represented by 
the pair (5,7), receiving a new positive (resp. nega- 
tive) piece of information represented by ™°™ (resp. 
z™) to be enforced, leads to revising (6,7) into 
(max(6, 5°”), x") (resp. into (6, min(z, 2"*™)), us- 
ing, respectively, 


mY = max(z, 6"™) ; (3.3) 
5°’ = min(a", 5). (3.4) 


It is important to note that when both positive and neg- 
ative pieces of information are collected, there are two 
options: 


© Either priority is given to positive information over 
negative information: it means that (past) positive 
information cannot be ruled out by (future) negative 
information. This may be found natural when very 
reliable observations (represented by 5) contradict 
tentative knowledge (represented by zr). Then revis- 
ing (6, m) by (6°, 2") yields the new pair 


Oe mY) == (max(6, 67), 


max(min(z, 2"), max(6, 5"°”))) 


@ Priority is given to negative information over pos- 
itive information. It makes sense when handling 
preferences. Indeed, then, positive information may 
be viewed as wishes, while negative informa- 
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tion reflects constraints. Then, revising (6,7) by 
(6°, a") would yield the new pair 


(6°, 2’) = (min(min(z, 2"), max(5, 6"™)), 


min(z,72™™)) . 


It can be checked that the two latter revision rules 
generalize the two previous ones. With both revision 
options, it can be checked that if 6 < m and 6" < 17" 


3.3 Qualitative Possibility Theory 


This section is restricted to the case of a finite state 
space S, typically S is the set of interpretations of a for- 
mal propositional language £ based on a finite set of 
Boolean attributes V. The usual connectives A (con- 
junction), V (disjunction), and — (negation) are used. 
The possibility scale is then taken as a finite chain, or 
the unit interval understood as an ordinal scale, or even 
just a complete preordering of states. At the other end, 
one may use the set of natural integers (viewed as an im- 
possibility scale) equipped with addition, which comes 
down to a countable subset of the unit interval, equipped 
with the product t-norm, instrumental for conditioning. 
However, the qualitative nature of the latter setting is 
questionable, even if authors using it do not consider it 
as genuinely quantitative. 


3.3.1 Possibility Theory and Modal Logic 


In this section, the possibility scale is Boolean (L = 
{0, 1}) and a possibility distribution reduces to a sub- 
set of states EF, for instance the models of a set of 
formulas K representing the beliefs of an agent in 
propositional logic. The presence of a proposition p 
in K can be modeled by N([p]) = 1, or I7([=p]) =0 
where [p] is the set of interpretations of p; more gen- 
erally the degrees of possibility and necessity can be 
defined by [3.36]: 


@ N((p]) = I ([p]) = 1 if and only if K }¥ p (the agent 
believes p) 

e N((=p]) = M([>p]) = 0 if and only if K =} —p (the 
agent believes =p) 

@ N([p]) = 0 and J7([p]) = 1 if and only if K {£ p and 
K |- -p (the agent is unsure about p) 


However, in propositional logic, it cannot be syn- 
tactically expressed that N([p]) = 0 nor J7([p]) = 1. To 


hold, revising (6, z) by (6°, 2") yields a new coher- 
ent pair. This revision process should not be confused. 
with another one pertaining only to the negative part of 
the information, namely computing min(z, 7") may 
yield a possibility distribution that is not normalized, in 
the case of inconsistency. If such an inconsistency takes 
place, it should be resolved (by some appropriate renor- 
malization) before one of the two above bipolar revision 
mechanisms can be applied. 


do so, a modal language is needed [3.12], that prefixes 
propositions with modalities such as necessary (O) and 
possible (©). Then Op encodes N([p]) = 1 (instead of 
p € K in classical logic), Op encodes IT([p]) = 1. Only 
a very simple modal language £— is needed that en- 
capsulates the propositional language £. Atoms of this 
logic are of the form Op, where p is any propositional 
formula. Well-formed formulas in this logic are ob- 
tained by applying standard conjunction and negation 
to these atoms 


£L- =Up,peLl|-d|oary. 


The well-known conjugateness between possibility and 
necessity reads: Op = —D-p. Maxitivity and minitiv- 
ity axioms of possibility and necessity measure, respec- 
tively, read O(pV q) = OpV ©q and O(p ^ q) = Op ^ 
q and are well known to hold in regular modal log- 
ics, and the consistency of the epistemic state is ensured 
by axiom D : Op > Op. This is the minimal epistemic 
logic (MEL) [3.37] needed to account for possibility 
theory. It corresponds to a small fragment of the logic 
KD without modality nesting and without objective for- 
mulas (£= N £ = Ø). Models of such modal formulas 
are epistemic states: for instance, E is a model of Op 
means that E C [p] [3.37,38]. This logic is sound and 
complete with respect to this semantics, and enables 
propositions whose truth status is explicitly unknown 
to be reasoned about. 


3.3.2 Comparative Possibility 


A plausibility ordering is a complete preorder of states 
denoted by > x, which induces a well-ordered partition 
{E\,...,£,} of S. It is the comparative counterpart of 
a possibility distribution 7, i.e., s >, s’ if and only if 
st(s) > 2(s’). Indeed it is more natural to expect that 
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an agent will supply ordinal rather than numerical in- 
formation about his beliefs. By convention, E; contains 
the most normal states of fact, E, the least plausible, 
or most surprising ones. Denoting by max(A) any most 
plausible state sọ € A, ordinal counterparts of possibil- 
ity and necessity measures [3.15] are then defined as 
follows: {s} > 77 Ø for all s € S and 


A >m B if and only if max(A) >, max(B) 
A >y B if and only if max(B°) > max(A°) . 


Possibility relations > yz were proposed by Lewis [3.14] 
and they satisfy his characteristic property 


A >r Bimplies CUA >z CUB, 


while necessity relations can also be defined as A >y 
B if and only if B° >77 A‘, and they satisfy a similar 
axiom 


A >y B implies CNA >y CAB. 


The latter coincides with epistemic entrenchment re- 
lations in the sense of belief revision theory [3.39] 
(provided that A > 77 Ø, if A Æ Ø). Conditioning a pos- 
sibility relation > zz by a nonimpossible event C > pry 9 
means deriving a relation >¢, such that 


A>% Bifand only ifANC>7 BNC. 


These results show that possibility theory is implicitly 
at work in the principal axiomatic approach to belief 
revision [3.40], and that conditional possibility obeys its 
main postulates [3.41]. The notion of independence for 
comparative possibility theory was studied by Dubois 
et al. [3.31], for independence between events, and Ben 
Amor et al. [3.32] between variables. 


3.3.3 Possibility Theory 
and Nonmonotonic Inference 


Suppose S$ is equipped with a plausibility ordering. The 
main idea behind qualitative possibility theory is that 
the state of the world is always believed to be as nor- 
mal as possible, neglecting less normal states. A > 77 B 
really means that there is a normal state where A holds 
that is at least as normal as any normal state where B 
holds. The dual case A >y B is intuitively understood 
as A is at least as certain as B, in the sense that there 
are states where B fails to hold that are at least as nor- 
mal as the most normal state where A does not hold. In 


particular, the events accepted as true are those which 
are true in all the most plausible states, namely the ones 
such that A >y Ø. These assumptions lead us to inter- 
pret the plausible inference A |~ B of a proposition B 
from another A, under a state of knowledge > 77 as fol- 
lows: B should be true in all the most normal states 
were A is true, which means B >H B® in terms of or- 
dinal conditioning, that is, A N B is more plausible than 
AN B°. A |x B also means that the agent considers B as 
an accepted belief in the context A. 

This kind of inference is nonmonotonic in the sense 
that A |~ B does not always imply A N C |~ B for any 
additional information C. This is similar to the fact that 
a conditional probability P(B | AM C) may be low even 
if P(B | A) is high. The properties of the consequence 
relation |x are now well understood, and are precisely 
the ones laid bare by Lehmann and Magidor [3.42] 
for their so-called rational inference. Monotonicity is 
only partially restored: A |~ B implies A N C |~ B pro- 
vided that A |~ C° does not hold (i. e., that states were 
A is true do not typically violate C). This property is 
called rational monotony, and, along with some more 
standard ones (like closure under conjunction), charac- 
terizes default possibilistic inference |-~. In fact, the set 
{B, A | B} of accepted beliefs in the context A is de- 
ductively closed, which corresponds to the idea that the 
agent reasons with accepted beliefs in each context as 
if they were true, until some event occurs that modifies 
this context. This closure property is enough to justify 
a possibilistic approach [3.43] and adding the rational 
monotonicity property ensures the existence of a single 
possibility relation generating the consequence relation 
|= [3.44]. 

Plausibility orderings can be generated by a set of 
if-then rules tainted with unspecified exceptions. This 
set forms a knowledge base supplied by an agent. Each 
tule if A then B is modeled by a constraint of the form 
ANB >r AN B° on possibility relations. There exists 
a single minimally specific element in the set of pos- 
sibility relations satisfying all constraints induced by 
rules (unless the latter are inconsistent). It corresponds 
to the most compact plausibility ranking of states in- 
duced by the rules [3.44]. This ranking can be computed 
by an algorithm originally proposed by Pearl [3.45]. 

Qualitative possibility theory has been studied from 
the point of view of cognitive psychology. Experimen- 
tal results [3.46] suggest that there are situations where 
people reason about uncertainty using the rules or pos- 
sibility theory, rather than with those of probability 
theory, namely people jump to plausible conclusions 
based on assuming the current world is normal. 
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3.3.4 Possibilistic Logic 


Qualitative possibility relations can be represented by 
(and only by) possibility measures ranging on any to- 
tally ordered set L (especially a finite one) [3.15]. This 
absolute representation on an ordinal scale is slightly 
more expressive than the purely relational one. For 
instance, one can express that a proposition is fully 
plausible (J7(A) = 1), while using a possibility rela- 
tion, one can only say that it is among the most plausible 
ones. When the finite set S is large and generated 
by a propositional language, qualitative possibility dis- 
tributions can be efficiently encoded in possibilistic 
logic [3.47—49]. 

A possibilistic logic base K is a set of pairs (p;, œi), 
where p; is an expression in classical (propositional or 
first-order) logic and a; > 0 is a element of the value 
scale L. This pair encodes the constraint N(p;) > a; 
where N(p;) is the degree of necessity of the set of mod- 
els of p;. Each prioritized formula (p;, œ;) has a fuzzy 
set of models (via certainty qualification described in 
Sect. 3.2) and the fuzzy intersection of the fuzzy sets 
of models of all prioritized formulas in K yields the 
associated plausibility ordering on S encoded by a pos- 
sibility distribution zx. Namely, an interpretation s is 
all the less possible as it falsifies formulas with higher 
weights, i. e., 


zxg(s) = lifs = pi, V(pi, i) EK, (3.5) 
Itk(s) = 1—max{q; : (pi, &i) € K, s F pi} 
otherwise . (3.6) 


This distribution is obtained by applying the minimal 
specificity principle, since it is the largest one that sat- 
isfies the constraints N(p;) > œ;. If the classical logic 
base {p; : (pi, &i) € K} is inconsistent, zx is not nor- 
malized, and a level of inconsistency equal to inc(K) = 
1 — max zx can be attached to the base K. However, the 
set of formulas {p; : (pi, &;) € K, a; > inc(K)} is always 
consistent. 

Syntactic deduction from a set of prioritized clauses 
is achieved by refutation using an extension of the stan- 
dard resolution rule, whereby (p V q, min(a@, 6)) can 
be derived from (pV r,a) and (qV =r, B). This rule, 
which evaluates the validity of an inferred proposition 
by the validity of the weakest premiss, goes back to 
Theophrastus, a disciple of Aristotle. Another way of 
presenting inference in possibilistic logic relies on the 
fact that KF (p,q) if and only if Ka = {p; : (pi, ai) € 
K,a; > «œ}F p in the sense of classical logic. In par- 
ticular, inc(K) = max{a : K F L}. Inference in possi- 


bilistic logic can use this extended resolution rule and 
proceeds by refutation since K F (p,q) if and only if 
inc({(—p, 1)} UK) > œ. Computational inference meth- 
ods in possibilistic logic are surveyed in [3.50]. 

Possibilistic logic is an inconsistency-tolerant ex- 
tension of propositional logic that provides a natural 
semantic setting for mechanizing nonmonotonic rea- 
soning [3.51], with a computational complexity close 
to that of propositional logic. Namely, once a possibil- 
ity distribution on models is generated by a set of if-then 
rules p; —> qi (as explained in Sect. 3.3.3 and modeled 
here using qualitative conditioning as N(q; | pi) > 9), 
weights œ; = N(—p; V qi) can be computed, and the cor- 
responding possibilistic base built [3.51]. See [3.52] for 
an efficient method involving compilation. 

Variants of possibilistic logic have been proposed 
in later works. A partially ordered extension of pos- 
sibilistic logic has been proposed, whose semantic 
counterpart consists of partially ordered models [3.53]. 
Another approach for handling partial orderings be- 
tween weights is to encode formulas with partially 
constrained weights in a possibilistic-like many-sorted 
propositional logic [3.54]. Namely, a formula (p, œ) is 
rewritten as a classical two-sorted clause p V abg, where 
aby means the situation is a-abnormal, and thus the 
clause expresses that p is true or the situation is abnor- 
mal, while more generally (p, min(a, 6)) is rewritten 
as the clause p V aby V abg. Then a known constraint 
between unknown weights such as a > £ is translated 
into a clause aby V abg. In this way, a possibilistic 
logic base, where only partial information about the rel- 
ative ordering between the weights is available under 
the form of constraints, can be handled as a set of clas- 
sical logic formulas that involve symbolic weights. 

An efficient inference process has been proposed 
using the notion of forgetting variables. This approach 
provides a technique for compiling a standard possi- 
bilistic knowledge bases in order to process inference 
in polynomial time [3.55]. Let us also mention quasi- 
possibilistic logic [3.56], an extension of possibilis- 
tic logic based on the so-called quasi-classical logic, 
a paraconsistent logic whose inference mechanism is 
close to classical inference (except that it is not allowed 
to infer pV q from p). This approach copes with in- 
consistency between formulas having the same weight. 
Other types of possibilistic logic can also handle con- 
straints of the form I7(¢) > a, or A(¢) > « [3.49]. 

There is a major difference between possibilistic 
logic and weighted many-valued logics [3.57]. Namely, 
in the latter, a weight t € L attached to a (many val- 
ued, thus nonclassical) formula p acts as a truth-value 
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threshold, and (p,t) in a fuzzy knowledge base ex- 
presses the Boolean requirement that the truth value of 
p should be at least equal to t for (p, T) to be valid. So 
in such fuzzy logics, while truth of p is many-valued, 
the validity of a weighted formula is two-valued. On the 
contrary, in possibilistic logic, truth is two-valued (since 
p is Boolean), but the validity of a possibilistic for- 
mula (p,@) is many-valued. In particular, it is possible 
to cast possibilistic logic inside a many-valued logic. 
The idea is to consider many-valued atomic sentences 
@ of the form (p,a), where p is a formula in classi- 
cal logic. Then, one can define well-formed formulas 
suchas ġ V Y, PAY, or yet 6 — y, where the exter- 
nal connectives linking ġ and y are those of the chosen 
many-valued logic. From this point of view, possibilis- 
tic logic can be viewed as a fragment of a many-valued 
logic that uses only one external connective: conjunc- 
tion interpreted as minimum. This approach involving 
a Boolean algebra embedded in a nonclassical one has 
been proposed by Boldrin and Sossai [3.58] with a view 
to augment possibilistic logic with fusion modes cast at 
the object level. It is also possible to replace classical 
logic by a many-valued logic inside possibilistic logic. 
For instance, possibilistic logic has been extended to 
Gödel many-valued logic [3.59]. A similar technique 
has been used by Hájek et al. to extend possibilistic 
logic to a many-valued modal setting [3.60]. 

Lehmke [3.61] has cast fuzzy logics and possibilistic 
logic inside the same framework, considering weighted 
many-valued formulas of the form (p,0), where p is 
a many-valued formula with truth set T, and @ is a la- 
bel defined as a monotone mapping from the truth-set 
T to a validity set L (a set of possibility degrees). T 
and L are supposed to be complete lattices, and the set 
of labels has properties that make it a fuzzy extension 
of a filter. Labels encompass fuzzy truth-values in the 
sense of Zadeh [3.62], such as very true, more or less 
true that express uncertainty about (many-valued) truth 
in a graded way. 

Rather than expressing statements such as it is half- 
true that John is tall, which presupposes a state of 
complete knowledge about John’s height, one may be 
interested in handling states of incomplete knowledge, 
namely assertions of the form all we know is that John 
is tall. One way to do it is to introduce fuzzy constants 
in a possibilistic first-ordered logic. Dubois, Prade, and 
Sandri [3.63] have noticed that an imprecise restric- 
tion on the scope of an existential quantifier can be 
handled in the following way. From the two premises 
Vx € A, ap(x, y) V q(x, y), and ax € B, p(x, a), where a 
is a constant, we can conclude that 3x € B, g(x, a) pro- 


vided that B C A. Thus, letting p(B, a) stand for 3x € 
B, p(x, a), one can write 


Vx € A, =p(x, y) V q(x, y), p(B, a) F q(B, a) 


if B CA, B being an imprecise constant. Letting A and 
B be fuzzy sets, the following pattern can be validated 
in possibilistic logic 


phx, y) V q(x, y), min(ua (x), @)), (p(B, a), B) 
F (q(B, a), min(Ng(A), a, B) , 


where Ng(A) = inf, max (u4 (t), 1 — ug(t)) is the neces- 
sity measure of the fuzzy event A based on fuzzy 
information B. Note that A, which appears in the weight 
slot of the first possibilistic formula plays the role of 
a fuzzy predicate, since the formula expresses that the 
more x is A, the more certain (up to level a) if p is true 
for (x, y), q is true for them as well. 

Alsinet and Godo [3.64,65] have applied possi- 
bilistic logic to logic programming that allows for 
fuzzy constants [3.65,66]. They have developed pro- 
gramming environments based on possibility theory. In 
particular, the above inference pattern can be strength- 
ened, replacing B by its cut Bg in the expression of 
Ng(A) and extended to a sound resolution rule. They 
have further developed possibilistic logic programming 
with similarity reasoning [3.67] and more recently ar- 
gumentation [3.68, 69]. 

Lastly, in order to improve the knowledge repre- 
sentation power of the answer-set programming (ASP) 
paradigm, the stable model semantics has been ex- 
tended by taking into account a certainty level, ex- 
pressed in terms of necessity measure, on each rule 
of a normal logic program. It leads to the definition 
of a possibilistic stable model for weighted answer- 
set programming [3.70]. Bauters et al. [3.71] introduce 
a characterization of answer sets of classical and pos- 
sibilistic ASP programs in terms of possibilistic logic 
where an ASP program specifies a set of constraints on 
possibility distributions. 


3.3.5 Ranking Function Theory 


A theory that parallels possibility theory to a large ex- 
tent and that has been designed for handling issues 
in belief revision, nonmonotonic reasoning and causa- 
tion, just like qualitative possibility theory is the one of 
ranking functions by Spohn [3.9, 10, 72]. The main dif- 
ference is that it is not really a qualitative theory as it 
uses the set of integers including oo (denoted by N+) 
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as a value scale. Hence, it is more expressive than qual- 
itative possibility theory, but it is applied to the same 
problems. 

Formally [3.10], a ranking function is a mapping x : 
2S —> N* such that: 


@ «x({s}) =0 for some s € S; 
@ «(A) = minses K({5}); 
© k(Ø)=%. 


It is immediate to verify that the set function 
TI (A) = 2~*™) is a possibility measure. So a ranking 
function is an integer-valued measure of impossibility 
(disbelief). The function (A) = x(A‘) is an integer- 
valued necessity measure used by Spohn for measuring 
belief, and it is clear that the rescaled necessity measure 
is N(A) = 1— 272%), Interestingly, ranking functions 
also bear close connection to probability theory [3.72], 
viewing «(A) as the exponent of an infinitesimal prob- 
ability, of the form P(A) = €“. Indeed the order of 
magnitude of P(A U B) is then «™“-«@)) | Integers 
also come up naturally if we consider Hamming dis- 
tances between models in the Boolean logic context, if 
for instance, the degree of possibility of an interpreta- 
tion is a function of its Hamming distance to the closest 
model of a classical knowledge base. 

Spohn [3.9] also introduces conditioning concepts, 
especially: 


@ The so-called A-part of k, which is a conditioning 
operation by event A defined by «(B | A) =K(BN 
A) —k(B); 

@ The (A,n)-conditionalization of «K, K(-| (A — n)) 
which is a revision operation by an uncertain input 
enforcing «’(A°) = n, and defined by 


K(s| A) 
n+k(s|A‘) 


ifseA 
otherwise . 
(3.7) 


K(s| (A> n)) = 


This operation makes A more believed than A‘ by n 
steps, namely, 


B(A|(A>n)) =0; 


It is easy to see that the conditioning of ranking 
functions comes down to the product-based condi- 
tioning of numerical possibility measures, and to the 
infinitesimal counterpart of usual Bayesian condition- 
ing of probabilities. The other conditioning rule can 
be obtained by means of Jeffrey’s rule of condition- 
ing [3.73] P(B | (A,a@)) = aP(B| A) + (1—a@)P(B | AS) 


pa | (A>n=n)). 


by a constraint of the form P(A) = «œ. Both qualita- 
tive and quantitative counterparts of this revision rule 
in possibility theory have been studied in detail [3.74, 
75]. In fact, ranking function theory is formally en- 
compassed by numerical possibility theory. Moreover, 
there is no fusion rule in Spohn theory, while fusion is 
one of the main applications of possibility theory (see 
Sect. 3.5). 


3.3.6 Possibilistic Belief Networks 


Another compact representation of qualitative possi- 
bility distributions is the possibilistic directed graph, 
which uses the same conventions as Bayesian nets, but 
relies on conditional possibility [3.76]. The qualitative 
approach is based on a symmetric notion of qualita- 
tive independence J7(BMA) = min(/7(A), I(B)) that 
is weaker than the causal-like condition [7(B | A) = 
IT(B) [3.31]. Like joint probability distributions, joint 
possibility distributions can be decomposed into a con- 
junction of conditional possibility distributions (us- 
ing minimum or product) in a way similar to Bayes 
nets [3.76]. A joint possibility distribution associated 
with variables X1, . . . , X„, decomposed by the chain rule 


n (Xi, , Xn) = min(z(X,, | X1,...,Xn—1), 
š .., (X2 | X1), a(X1)) x 


Such a decomposition can be simplified by assuming 
conditional independence relations between variables, 
as reflected by the structure of the graph. The form of 
independence between variables at work here is condi- 
tional noninteractivity: Two variables X and Y are inde- 
pendent in the context Z, if for each instance (x, y, z) of 
(X,Y,Z) we have: z(x,y | z) = min(z(x | z2), z | z)). 
Ben Amor and Benferhat [3.77] investigate the 
properties of qualitative independence that enable lo- 
cal inferences to be performed in possibilistic nets. 
Uncertainty propagation algorithms suitable for possi- 
bilistic graphical structures have been studied in [3.78]. 
It is also possible to propagate uncertainty in nondi- 
rected decompositions of joint possibility measures 
as done quite early by Borgelt etal. [3.79]. Coun- 
terparts of product-based numerical possibilistic nets 
using ranking functions exist as well [3.10]. Quali- 
tative possibilistic counterparts of decision trees and 
influence diagrams for decision trees have been re- 
cently investigated [3.80,81]. Compilation techniques 
for inference in possibilistic networks have been de- 
vised [3.82]. Finally, the study of possibilistic networks 
from the standpoint of causal reasoning has been inves- 
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tigated, using the concept of intervention, that comes 
down to enforcing the values of some variables so as to 
lay bare their influence on other ones [3.83, 84]. 


3.3.7 Fuzzy Rule-Based and Case-Based 
Approximate Reasoning 


A typology of fuzzy rules has been devised in the set- 
ting of possibility theory, distinguishing rules whose 
purpose is to propagate uncertainty through reasoning 
steps, from rules whose main purpose is similarity- 
based interpolation [3.85], depending on the choice 
of a many-valued implication connective that models 
a rule. The bipolar view of information based on (ô, 7) 
pairs sheds new light on the debate between conjunctive 
and implicative representation of rules [3.86]. Repre- 
senting a rule as a material implication focuses on 
counterexamples to rules, while using a conjunction 
between antecedent and consequent points out exam- 
ples of the rule and highlights its positive content. 
Traditionally in fuzzy control and modeling, the lat- 
ter representation is adopted, while the former is the 
logical tradition. Introducing fuzzy implicative rules in 
modeling accounts for constraints or landmark points 
the model should comply with (as opposed to observed 
data) [3.87]. The bipolar view of rules in terms of ex- 
amples and counterexamples may turn out to be very 
useful when extracting fuzzy rules from data [3.88]. 

Fuzzy rules have been applied to case-based rea- 
soning (CBR). In general, CBR relies on the following 
implicit principle: similar situations may lead to similar 
outcomes. Thus, a similarity relation § between prob- 
lem descriptions or situations, and a similarity measure 
T between outcomes are needed. This implicit CBR 
principle can be expressed in the framework of fuzzy 
rules as: “the more similar (in the sense of S) are the 
attribute values describing two situations, the more pos- 
sible the similarity (in the sense of T) of the values 
of the corresponding outcome attributes.” Given a sit- 
uation sọ associated to an unknown outcome fo and 
a current case (s,f), this principle enables us to con- 
clude on the possibility of f being equal to a value 
similar to ¢ [3.89]. This acknowledges the fact that, of- 
ten in practice, a database may contain cases that are 
rather similar with respect to the problem description 
attributes, but which may be distinct with respect to 
outcome attribute(s). This emphasizes that case-based 
reasoning can only lead to cautious conclusions. 

The possibility rule the more similar s and so, the 
more possible t and to are similar, is modeled in terms 
of a guaranteed possibility measure [3.90]. This leads 


to enforce the inequality Ao(T(t,-)) > Ws(s, So), which 
expresses that the guaranteed possibility that fo belongs 
to a high degree to the fuzzy set of values that are T- 
similar to ¢, is lower bounded by the S-similarity of s 
and so. Then the fuzzy set F of possible values f’ for fo 
with respect to case (s, t) is given by 


Fp (t) = min(ur(t, t), Ws(s, 50)) , 


since the maximally specific distribution such that 
A(A) > & is 6 = min(u4, œ). What is obtained is the 
fuzzy set T(t,.) of values ¢ that are T-similar to t, 
whose possibility level is truncated at the global degree 
[ts(S, 89) of similarity of s and so. The max-based ag- 
gregation of the various contributions obtained from the 
comparison with each case (s,f) in the memory M of 
cases acknowledges the fact that each new comparison 
may suggest new possible values for tọ and agrees with 
the positive nature of the information in the repository 
of cases. Thus, we obtain the following fuzzy set Eso of 
the possible values r’ for to 


Eso(t) = max min(S(s, so), T(t, t). 
(s,t)EM 


This latter expression can be put in parallel with the 
evaluation of a flexible query [3.91]. This approach has 
been generalized to imprecisely or fuzzily described sit- 
uations, and has been related to other approaches to 
instance-based prediction [3.92, 93]. 


3.3.8 Preference Representation 


Possibility theory also offers a framework for prefer- 
ence modeling in constraint-directed reasoning. Both 
prioritized and soft constraints can be captured by pos- 
sibility distributions expressing degrees of feasibility 
rather than plausibility [3.6]. Possibility theory offers 
a natural setting for fuzzy optimization whose aim is to 
balance the levels of satisfaction of multiple fuzzy con- 
straints (instead of minimizing an overall cost) [3.94]. 
In such problems, some possibility distributions repre- 
sent soft constraints on decision variables, other ones 
can represent incomplete knowledge about uncontrol- 
lable state variables. Qualitative decision criteria are 
particularly adapted to the handling of uncertainty in 
this setting. Possibility distributions can also model 
ill-known constraint coefficients in linear and nonlin- 
ear programming, thus leading to variants of chance- 
constrained programming [3.95]. 

Optimal solutions of fuzzy constraint-based prob- 
lems maximize the satisfaction of the most violated 
constraint, which does not ensure the Pareto dominance 
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of all such solutions. More demanding optimality no- 
tions have been defined, by canceling equally satisfied 
constraints (the so-called discrimin ordering) or using 
a leximin criterion [3.94, 96, 97]. 

Besides, the possibilistic logic setting provides 
a compact representation framework for preferences, 
where possibilistic logic formulas represent priori- 
tized constraints on Boolean domains. This approach 
has been compared to qualitative conditional prefer- 
ence networks (CP nets), based on a systematic ceteris 
paribus assumption (preferential independence between 
decision variables). CP nets induce partial orders of so- 
lutions rather than complete preorders, as possibilistic 
logic does [3.98]. Possibilistic networks can also model 
preference on the values of variables, conditional to the 
value of other ones, and offer an alternative to condi- 
tional preference networks [3.98]. 

Bipolar possibility theory has been applied to pref- 
erence problems where it can be distinguished between 
imperative constraints (modeled by propositions with 
a degree of necessity), and nonimperative wishes (mod- 
eled by propositions with a degree of guaranteed possi- 
bility level) [3.99]. Another kind of bipolar approach to 
qualitative multifactorial evaluation based on possibil- 
ity theory, is when comparing objects in terms of their 
pros and cons where the decision maker focuses on the 
most important assets or defects. Such qualitative multi- 
factorial bipolar decision criteria have been defined, ax- 
iomatized [3.100], and empirically tested [3.101]. They 
are qualitative counterparts of cumulative prospect the- 
ory criteria of Kahneman and Tverski [3.102]. 

Two issues in preference modeling based on possi- 
bility theory in a logic format are as follows: 


@ Preference statements of the form M (p) > IT(q) 
provide an incomplete description of a preference 
relation. One question is then how to complete this 
description by default. The principle of minimal 
specificity then means that a solution not explicitly 
rejected is satisfactory by default. The dual maximal 
specificity principle, says that a solution not sup- 
ported is rejected by default. It is not always clear 
which principle is the most natural. 

e@ A statement according to which it is better to sat- 
isfy a formula p than a formula q can in fact be 
interpreted in several ways. For instance, it may 
mean that the best solution satisfying p is better that 
the best solution satisfying q, which reads IT(p) > 
IT(q) and can be encoded in possibilistic logic 
under minimal specificity assumption; a stronger 
statement is that the worst solution satisfying p is 


better that the best solution satisfying q, which reads 
A(p) > TI (q). Other possibilities are A(p) > A(q), 
and IT(p) > A(q). This question is studied in some 
detail by Kaci [3.103]. 


3.3.9 Decision-Theoretic Foundations 


Zadeh [3.1] hinted that since our intuition concerning 
the behavior of possibilities is not very reliable, our un- 
derstanding of them 


would be enhanced by the development of an ax- 
iomatic approach to the definition of subjective 
possibilities in the spirit of axiomatic approaches 
to the definition of subjective probabilities. 


Decision-theoretic justifications of qualitative possibil- 
ity were devised, in the style of Von Neumann and 
Morgenstern, and Savage [3.104] more than 15 years 
ago [3.105, 106]. 

On top of the set of states, assume there is a set 
X of consequences of decisions. A decision, or act, is 
modeled as a mapping f from S to X assigning to each 
state S its consequence f(s). The axiomatic approach 
consists in proposing properties of a preference relation 
> between acts so that a representation of this relation 
by means of a preference functional W(f) is ensured, 
that is, act f is as good as act g (denoted by f > g) if 
and only if W(f) > W(g). W(f) depends on the agent’s 
knowledge about the state of affairs, here supposed to 
be a possibility distribution z on S, and the agent’s goal, 
modeled by a utility function u on X. Both the utility 
function and the possibility distribution map to the same 
finite chain L. A pessimistic criterion W; (f) is of the 
form 


Wee (f) = min max(n(sx(s)),u(f(s))), 


where n is the order-reversing map of L. n(z(s)) is the 
degree of certainty that the state is not s (hence the de- 
gree of surprise of observing s), u(f(s)) the utility of 
choosing act f in state s. W7 (f) is all the higher as all 
states are either very surprising or have high utility. This 
criterion is actually a prioritized extension of the Wald 
maximin criterion. The latter is recovered if z(s) = 1 
(top of L) Vs € S. According to the pessimistic criterion, 
acts are chosen according to their worst consequences, 
restricted to the most plausible states S* = {s, x (s) > 
n(W;, (f))}. The optimistic counterpart of this criterion 
is 


Wer (f) = max min(a (s)), u(f(s))) . 
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we (f) is all the higher as there is a very plausible state 
with high utility. The optimistic criterion was first pro- 
posed by Yager [3.107] and the pessimistic criterion by 
Whalen [3.108]. See Dubois et al. [3.109] for the res- 
olution of decision problems under uncertainty using 
the above criterion, and cast in the possibilistic logic 
framework. Such criteria can be refined by the classical 
expected utility criterion [3.110]. 

These optimistic and pessimistic possibilistic crite- 
ria are particular cases of a more general criterion based 
on the Sugeno integral [3.111] specialized to possibility 
and necessity of fuzzy events [3.1, 20] 


Syu(f) = max min(A, y(F,)) , 
AEL 


where Fy, = {s € S,u(f(s)) > A}, y is a monotonic set 
function that reflects the decision-maker attitude in 
front of uncertainty: y(A) is the degree of confidence in 
event A. If y = TI, then S77,,(f) = w (f). Similarly, if 
y =N, then Sy uf) = Wz (f). 

For any acts f, g, and any event A, let fAg denote an 
act consisting of choosing f if A occurs and g if its com- 
plement occurs. Let f A g (resp. f V g) be the act whose 
results yield the worst (resp. best) consequence of the 
two acts in each state. Constant acts are those whose 
consequence is fixed regardless of the state. A result 
in [3.112, 113] provides an act-driven axiomatization of 
these criteria, and enforces possibility theory as a ra- 
tional representation of uncertainty for a finite state 
space S: 


Theorem 3.1 
Suppose the preference relation > on acts obeys the fol- 
lowing properties: 


1. (X°,>)isa complete preorder. 
2. There are two acts such that f > g. 


3.4 Quantitative Possibility Theory 


The phrase quantitative possibility refers to the case 
when possibility degrees range in the unit interval, and 
are considered in connection with belief function and 
imprecise probability theory. Quantitative possibility 
theory is the natural setting for a reconciliation be- 
tween probability and fuzzy sets. In that case, a precise 
articulation between possibility and probability theo- 
ries is useful to provide an interpretation to possibility 
and necessity degrees. Several such interpretations can 


3. VWA,Vg and h constant, Yf,g > h implies gAf > 
hAf. 

4. Iff is constant, f > hand g > h imply f A g >h. 

5. Iff is constant, h > f and h > g imply h >f V g. 


Then there exists a finite chain L, an L-valued 
monotonic set function y on S and an L-valued utility 
function u, such that > is representable by a Sugeno 
integral of u(f) with respect to y. Moreover, y is a ne- 
cessity (resp. possibility) measure as soon as property 
(4) (resp. (5)) holds for all acts. The preference func- 
tional is then Wz (f) (resp. wx (f)). 


Axioms (4 and 5) contradict expected utility theory. 
They become reasonable if the value scale is finite, de- 
cisions are one-shot (no compensation) and provided 
that there is a big step between any level in the quali- 
tative value scale and the adjacent ones. In other words, 
the preference pattern f > h always means that f is 
significantly preferred to h, to the point of consider- 
ing the value of h negligible in front of the value of 
f. The above result provides decision-theoretic founda- 
tions of possibility theory, whose axioms can thus be 
tested from observing the choice behavior of agents. 
See [3.114] for another approach to comparative possi- 
bility relations, more closely relying on Savage axioms, 
but giving up any comparability between utility and 
plausibility levels. The drawback of these and other 
qualitative decision criteria is their lack of discrimi- 
nation power [3.115]. To overcome it, refinements of 
possibilistic criteria were recently proposed, based on 
lexicographic schemes. These refined criteria turn out 
to be by a classical (but big-stepped) expected utility 
criterion [3.110], and Sugeno integral can be refined by 
a Choquet integral [3.116]. For extension of this qual- 
itative decision-making framework to multiple-stage 
decision, see [3.117]. 


be consistently devised: a degree of possibility can 
be viewed as an upper probability bound [3.118], and 
a possibility distribution can be viewed as a likelihood 
function [3.119]. A possibility measure is also a special 
case of a Shafer plausibility function [3.120]. Following 
a very different approach, possibility theory can ac- 
count for probability distributions with extreme values, 
infinitesimal [3.72] or having big steps [3.121]. There 
are finally close connections between possibility theory 
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and idempotent analysis [3.122]. The theory of large de- 
viations in probability theory [3.123] also handles set 
functions that look like possibility measures [3.124]. 
Here we focus on the role of possibility theory in the 
theory of imprecise probability. 


3.4.1 Possibility as Upper Probability 


Let x be a possibility distribution where (s) € [0, 1]. 
Let P(r) be the set of probability measures P such 
that P < TI, i.e., WA C S, P(A) < II (A). Then the pos- 
sibility measure JI coincides with the upper probability 
function P* such that P*(A) = sup{P(A), P € P(x)} 
while the necessity measure N is the lower probabil- 
ity function Pẹ such that P(A) = inf{P (A), P € P(z)}; 
see [3.118, 125] for details. P and z are said to be con- 
sistent if P € P(x). The connection between possibility 
measures and imprecise probabilistic reasoning is es- 
pecially promising for the efficient representation of 
nonparametric families of probability functions, and it 
makes sense even in the scope of modeling linguistic 
information [3.126]. 

A possibility measure can be computed from 
nested confidence subsets {A,A2,...,Am} where A; C 
Aj41,i=1,...,m—1. Each confidence subset A; is 
attached a positive confidence level À; interpreted as 
a lower bound of P(A;), hence a necessity degree. It is 
viewed as a certainty qualified statement that generates 
a possibility distribution 7; according to Sect. 3.2. The 
corresponding possibility distribution is 


f ifue A 
© Ji- ifj=max{i:sgA}>1 


The information modeled by z can also be viewed 
as a nested random set {(A;, v;), i = 1,...,m}, where 
vi = Ài — Ài—1. This framework allows for imprecision 
(reflected by the size of the A;’s) and uncertainty (the 
v;’s). And v; is the probability that the agent only knows 
that A; contains the actual state (it is not P(A;)). The 
random set view of possibility theory is well adapted 
to the idea of imprecise statistical data, as developed 
in [3.127, 128]. Namely, given a bunch of imprecise 
(not necessarily nested) observations (called focal sets), 
x supplies an approximate representation of the data, as 
m(s) = are Vi. 

In the continuous case, a fuzzy interval M can be 
viewed as a nested set of œ-cuts, which are intervals 
Mo = {x : m(x) = a, Va > 0}. In the continuous case, 


note that the degree of necessity is N(Ma) = 1—a, 
and the corresponding probability set P(um) = {P: 
P(Mq) = 1—a,Va > 0}. Representing uncertainty by 
the family of pairs {((Mq, 1—a) : Va > 0} is very simi- 
lar to the basic approach of info-gap theory [3.129]. 

The set P(x) contains many probability distribu- 
tions, arguably too many. Neumaier [3.130] has re- 
cently proposed a related framework, in a different 
terminology, for representing smaller subsets of prob- 
ability measures using two possibility distributions in- 
stead of one. He basically uses a pair of distributions 
(6,7) (in the sense of Sect. 3.2) of distributions, he 
calls cloud, where 6 is a guaranteed possibility distri- 
bution (in our terminology) such that x > 6. A cloud 
models the (generally nonempty) set P(x) AN P(1 — ô), 
viewing | — ô as a standard possibility distribution. The 
precise connections between possibility distributions, 
clouds and other simple representations of numerical 
uncertainty is studied in [3.131]. 


3.4.2 Conditioning 


There are two kinds of conditioning that can be en- 
visaged upon the arrival of new information E. The 
first method presupposes that the new information al- 
ters the possibility distribution z by declaring all states 
outside E impossible. The conditional measure z(. | £) 
is such that TI (B | E)- I(E) = I(B A E). This is for- 
mally Dempster rule of conditioning of belief functions, 
specialized to possibility measures. The conditional 
possibility distribution representing the weighted set of 
confidence intervals is 


ms), 
——., ifseE 
m(s| E) = 4 IT(E) 
0 otherwise . 


De Baets et al. [3.28] provide a mathematical justifi- 
cation of this notion in an infinite setting, as opposed 
to the min-based conditioning of qualitative possibil- 
ity theory. Indeed, the maxitivity axiom extended to 
the infinite setting is not preserved by the min-based 
conditioning. The product-based conditioning leads to 
a notion of independence of the form [7(BN E) = 
IT(B)- IT(E) whose properties are very similar to the 
ones of probabilistic independence [3.30]. 

Another form of conditioning [3.132, 133], more 
in line with the Bayesian tradition, considers that the 
possibility distribution x encodes imprecise statisti- 
cal information, and event E only reflects a feature of 
the current situation, not of the state in general. Then 
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the value /7(B || E) = sup{P(B | E), P(E) > 0, P < IT} 
is the result of performing a sensitivity analysis of the 
usual conditional probability over P(x) [3.134]. Inter- 
estingly, the resulting set function is again a possibility 
measure, with distribution 


z(s || E) = 
max (0, a) , ifseE 
m(s) +N(E) 
0 otherwise . 


It is generally less specific than m on E, as clear 
from the above expression, and becomes noninforma- 
tive when N(E) = 0 (i.e., if there is no information 
about E). This is because z/(- || E) is obtained from 
the focusing of the generic information m over the 
reference class E. On the contrary, m(-| E) operates 
a revision process on m due to additional knowledge 
asserting that states outside E are impossible. See De 
Cooman [3.133] for a detailed study of this form of 
conditioning. 


3.4.3 Probability—Possibility 
Transformations 


The problem of transforming a possibility distribution 
into a probability distribution and conversely is mean- 
ingful in the scope of uncertainty combination with 
heterogeneous sources (some supplying statistical data, 
other linguistic data, for instance). It is useful to cast all 
pieces of information in the same framework. The ba- 
sic requirement is to respect the consistency principle 
IT > P. The problem is then either to pick a probability 
measure in P(x), or to construct a possibility measure 
dominating P. 

There are two basic approaches to possibility/ 
probability transformations, which both respect a form 
of probability—possibility consistency. One, due to 
Klir [3.135, 136] is based on a principle of information 
invariance, the other [3.137] is based on optimizing in- 
formation content. Klir assumes that possibilistic and 
probabilistic information measures are commensurate. 
Namely, the choice between possibility and probabil- 
ity is then a mere matter of translation between lan- 
guages neither of which is weaker or stronger than 
the other (quoting Klir and Parviz [3.138]). It sug- 
gests that entropy and imprecision capture the same 
facet of uncertainty, albeit in different guises. The 
other approach, recalled here, considers that going from 
possibility to probability leads to increase the preci- 
sion of the considered representation (as we go from 


a family of nested sets to a random element), while 
going the other way around means a loss of speci- 
ficity. 


From Possibility to Probability 
The most basic example of transformation from possi- 
bility to probability is the Laplace principle of insuf- 
ficient reason claiming that what is equally possible 
should be considered as equally probable. A general- 
ized Laplacean indifference principle is then adopted 
in the general case of a possibility distribution z: the 
weights v; bearing on the sets A; from the nested fam- 
ily of levels cuts of m are uniformly distributed on 
the elements of these cuts A;. Let P; be the uniform 
probability measure on A;. The resulting probability 
measure is P = J ;—1,...m Vi Pj. This transformation, 
already proposed in 1982 [3.139] comes down to select- 
ing the center of gravity of the set P(x) of probability 
distributions dominated by z. This transformation also 
coincides with Smets’ pignistic transformation [3.140] 
and with the Shapley value of the unamimity game (an- 
other name of the necessity measure) in game theory. 
The rationale behind this transformation is to minimize 
arbitrariness by preserving the symmetry properties of 
the representation. This transformation from possibility 
to probability is one-to-one. Note that the definition of 
this transformation does not use the nestedness prop- 
erty of cuts of the possibility distribution. It applies all 
the same to nonnested random sets (or belief functions) 
defined by pairs {(A;,v;),i=1,...,m}, where v; are 
nonnegative reals such that })j—),, Vi = 1. 
From Objective Probability to Possibility 

From probability to possibility, the rationale of the 
transformation is not the same according to whether 
the probability distribution we start with is subjec- 
tive or objective [3.106]. In the case of a statistically 
induced probability distribution, the rationale is to pre- 
serve as much information as possible. This is in 
line with the handling of A-qualified pieces of in- 
formation representing observed evidence, considered 
in Sect. 3.2; hence we select as the result of the 
transformation of a probability measure P, the most 
specific possibility measure in the set of those dominat- 
ing P [3.137]. This most specific element is generally 
unique if P induces a linear ordering on S. Suppose 
S is a finite set. The idea is to let JT (A) = P(A), for 
these sets A having minimal probability among other 
sets having the same cardinality as A. If pı > p2 > 
+++ > Pn, then IT(A) = P(A) for sets A of the form 
{sj,..-,5,}, and the possibility distribution is defined 


3.4 Quantitative Possibility Theory 47 


re | Y Hed 


48 PartA | Foundations 


te | Y Hed 


as (Si) = } j=i,....m Pj» With pj = P({s;}). Note that 
Xp is a kind of cumulative distribution of P, already 
known as a Lorentz curve in the mathematical liter- 
ature [3.141]. If there are equiprobable elements, the 
unicity of the transformation is preserved if equipossi- 
bility of the corresponding elements is enforced. In this 
case it is a bijective transformation as well. Recently, 
this transformation was used to prove a rather surpris- 
ing agreement between probabilistic indeterminateness 
as measured by Shannon entropy, and possibilistic non- 
specificity. Namely it is possible to compare probability 
measures on finite sets in terms of their relative peaked- 
ness (a concept adapted from Birnbaum [3.142]) by 
comparing the relative specificity of their possibilis- 
tic transforms. Namely let P and Q be two probability 
measures on S and zp, mg the possibility distribu- 
tions induced by our transformation. It can be proved 
that if mp > 7g (i.e., P is less peaked than Q) then 
the Shannon entropy of P is higher than the one of 
Q [3.143]. This result give some grounds to the in- 
tuitions developed by Klir [3.135], without assuming 
any commensurability between entropy and specificity 
indices. 


Possibility Distributions Induced by Prediction 

Intervals 
In the continuous case, moving from objective prob- 
ability to possibility means adopting a representation 
of uncertainty in terms of prediction intervals around 
the mode viewed as the most frequent value. Extract- 
ing a prediction interval from a probability distribution 
or devising a probabilistic inequality can be viewed 
as moving from a probabilistic to a possibilistic rep- 
resentation. Namely suppose a nonatomic probability 
measure P on the real line, with unimodal density ¢, 
and suppose one wishes to represent it by an interval 7 
with a prescribed level of confidence P(/) = y of hitting 
it. The most natural choice is the most precise interval 
ensuring this level of confidence. It can be proved that 
this interval is of the form of a cut of the density, i. e., 
I, = {s,o(s) = 0} for some threshold 6. Moving the 
degree of confidence from 0 to 1 yields a nested family 
of prediction intervals that form a possibility distribu- 
tion z consistent with P, the most specific one actually, 
having the same support and the same mode as P and 
defined by [3.137] 


(inf l,) = z (suply) = 1— y =1—PU,). 


This kind of transformation again yields a kind of 
cumulative distribution according to the ordering in- 


duced by the density ġ. Similar constructs can be 
found in the statistical literature (Birnbaum [3.142]). 
More recently Mauris et al. [3.144] noticed that starting 
from any family of nested sets around some charac- 
teristic point (the mean, the median,...), the above 
equation yields a possibility measure dominating P. 
Well-known inequalities of probability theory, such as 
those of Chebyshev and Camp-Meidel, can also be 
viewed as possibilistic approximations of probability 
functions. It turns out that for symmetric unimodal den- 
sities, each side of the optimal possibilistic transform 
is a convex function. Given such a probability density 
on a bounded interval [a, b], the triangular fuzzy num- 
ber whose core is the mode of ¢ and the support is 
[a, b] is thus a possibility distribution dominating P re- 
gardless of its shape (and the tightest such distribution). 
These results justify the use of symmetric triangu- 
lar fuzzy numbers as fuzzy counterparts to uniform 
probability distributions. They provide much tighter 
probability bounds than Chebyshev and Camp-Meidel 
inequalities for symmetric densities with bounded sup- 
port. This setting is adapted to the modeling of sensor 
measurements [3.145]. These results are extended to 
more general distributions by Baudrit et al. [3.146], 
and provide a tool for representing poor probabilis- 
tic information. More recently, Mauris [3.147] unifies, 
by means of possibility theory, many old techniques 
independently developed in statistics for one-point esti- 
mation, relying on the idea of dispersion of an empirical 
distribution. The efficiency of different estimators can 
be compared by means of fuzzy set inclusion applied 
to optimal possibility transforms of probability distribu- 
tions. This unified approach does not presuppose a finite 
variance. 


Subjective Possibility Distributions 
The case of a subjective probability distribution is dif- 
ferent. Indeed, the probability function is then supplied 
by an agent who is in some sense forced to express 
beliefs in this form due to rationality constraints, and 
the setting of exchangeable bets. However his actual 
knowledge may be far from justifying the use of a sin- 
gle well-defined probability distribution. For instance in 
case of total ignorance about some value, apart from its 
belonging to an interval, the framework of exchange- 
able bets enforces a uniform probability distribution, 
on behalf of the principle of insufficient reason. Based 
on the setting of exchangeable bets, it is possible to 
define a subjectivist view of numerical possibility the- 
ory, that differs from the proposal of Walley [3.134]. 
The approach developed by Dubois et al. [3.148] re- 
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lies on the assumption that when an agent constructs 
a probability measure by assigning prices to lotteries, 
this probability measure is actually induced by a be- 
lief function representing the agent’s actual state of 
knowledge. We assume that going from an underly- 
ing belief function to an elicited probability measure 
is achieved by means of the above mentioned pignis- 
tic transformation, changing focal sets into uniform 
probability distributions. The task is to reconstruct this 
underlying belief function under a minimal commit- 
ment assumption. In the paper [3.148], we pose and 
solve the problem of finding the least informative be- 
lief function having a given pignistic probability. We 
prove that it is unique and consonant, thus induced by 
a possibility distribution. The obtained possibility dis- 
tribution can be defined as the converse of the pignistic 
transformation (which is one-to-one for possibility dis- 
tributions). It is subjective in the same sense as in the 
subjectivist school in probability theory. However, it is 
the least biased representation of the agent’s state of 
knowledge compatible with the observed betting be- 
havior. In particular, it is less specific than the one 
constructed from the prediction intervals of an objec- 
tive probability. This transformation was first proposed 
in [3.149] for objective probability, interpreting the em- 
pirical necessity of an event as summing the excess of 
probabilities of realizations of this event with respect 
to the probability of the most likely realization of the 
opposite event. 


Possibility Theory and Defuzzification 
Possibilistic mean values can be defined using Choquet 
integrals with respect to possibility and necessity mea- 
sures [3.133, 150], and come close to defuzzification 
methods [3.151]. Interpreting a fuzzy interval M, asso- 
ciated with a possibility distribution 4m, as a family of 


3.5 Some Applications 


Possibility theory has not been the main framework 
for engineering applications of fuzzy sets in the past. 
However, on the basis of its connections to symbolic 
artificial intelligence, to decision theory and to im- 
precise statistics, we consider that it has significant 
potential for further applied developments in a number 
of areas, including some where fuzzy sets are not yet 
always accepted. Only some directions are pointed out 
here. 


probabilities, upper and lower mean values E*(M) and 
E,.(M), can be defined as [3.152] 


1 1 
Ex(M) = | infMa do: E"M) = f supMa de 
0 0 


where Mg is the a-cut of M. 

Then the mean interval E(M) = [Ex (M), E* (M)] of 
M is the interval containing the mean values of all 
random variables consistent with M, that is E(M) = 
{E(P) | P € P(um)}, where E(P) represents the ex- 
pected value associated with the probability measure 
P. That the mean value of a fuzzy interval is an in- 
terval seems to be intuitively satisfactory. Particularly 
the mean interval of a (regular) interval [a,b] is this 
interval itself. The upper and lower mean values are 
linear with respect to the addition of fuzzy numbers. 
Define the addition M + N as the fuzzy interval whose 
cuts are My +Nq = {s+ t, s € My, t € Na} defined ac- 
cording to the rules of interval analysis. Then E(M + 
N) = E(M) + E(N), and similarly for the scalar multi- 
plication E(aM) = aE(M), where aM has membership 
grades of the form jzy(s/a) fora #0. In view of this 
property, it seems that the most natural defuzzication 
method is the middle point E(M) of the mean interval 
(originally proposed by Yager [3.153]). Other defuzzi- 
fication techniques do not generally possess this kind 
of linearity property. E(M) has a natural interpreta- 
tion in terms of simulation of a fuzzy variable [3.154], 
and is the mean value of the pignistic transformation 
of M. Indeed it is the mean value of the empirical 
probability distribution obtained by the random process 
defined by picking an element «œ in the unit interval 
at random, and then an element s in the cut Mg at 
random. 


3.5.1 Uncertain Database Querying 
and Preference Queries 


The evaluation of a flexible query in the face of incom- 
plete or fuzzy information amounts to computing the 
possibility and the necessity of the fuzzy event express- 
ing the gradual satisfaction of the query [3.155]. This 
evaluation, known as fuzzy pattern matching [3.156, 
157], corresponds to the extent to which fuzzy sets 
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(representing the query) overlap, or include the possi- 
bility distributions (representing the available informa- 
tion). Such an evaluation procedure has been extended 
to symbolic labels that are no longer represented by 
possibility distributions, but which belong to possi- 
bilistic ontologies where approximate similarity and 
subsumption between labels are estimated in terms of 
possibility and necessity degrees, respectively [3.158]. 
These approaches presuppose a total lack of depen- 
dencies between ill-known attributes. A more general 
approach based on possible world semantics has been 
envisaged [3.159]. However, as for the probabilistic 
counterpart of this latter view, evaluating queries has 
a high computational cost [3.160]. This is why it has 
been proposed to only use certainty qualified values (or 
disjunctions of values), as in possibilistic logic, rather 
than general possibility distributions, for representing 
attribute values pervaded with uncertainty. It has been 
shown that it leads to a tractable extension of relational 
algebra operations [3.161, 162]. 

Besides, possibility theory is not only useful for 
representing qualitative uncertainty, but it may also 
be of interest for representing preferences, and as 
such may be applied to the handling of preferences 
queries [3.163]. Thus, requirements of the form A and 
preferably B (i.e., it is more satisfactory to have A and 
B than A alone), or A or at least B can be expressed 
using appropriate priority orderings, as in possibilistic 
logic [3.164]. Lastly, in bipolar queries [3.165-167], 
flexible constraints that are more or less compulsory 
are distinguished from additional wishes that are op- 
tional, as for instance in the request find the apartments 
that are cheap and maybe near the train station. Indeed, 
negative preferences express what is (more or less, or 
completely) impossible or undesirable, and by com- 
plementation state flexible constraints restricting the 
possible or acceptable values. Positive preferences are 
not compulsory, but rather express wishes; they state 
what attribute values would be really satisfactory. 


3.5.2 Description Logics 


Description logics (initially named terminological log- 
ics) are tractable fragments of first-order logic repre- 
sentation languages that handle notions of concepts, 
roles and instances, referring at the semantic level to 
the respective notions of set, binary relations, mem- 
bership, and cardinality. They are useful for describing 
ontologies that consist in hierarchies of concepts in 
a particular domain, for the semantic web. Two ideas 
that, respectively, come from fuzzy sets and possibil- 


ity theory, and that may be combined, may be used for 
extending the expressive power of description logics. 
On one hand, vague concepts can be approximated in 
practice by pairs of nested sets corresponding to the 
cores and the supports of fuzzy sets, thus sorting out 
the typical elements, in a way that agrees with fuzzy set 
operations and inclusions. On the other hand, a possi- 
bilistic treatment of uncertainty and exceptions can be 
performed on top of a description logic in a possibilistic 
logic style [3.168]. In both cases, the underlying prin- 
ciple is to remain as close as possible to classical logic 
for preserving computational efficiency as much as pos- 
sible. Thus, formal expressions such as (P 3 Q, B) 
intend to mean that it is certain at least at level B that 
the degree of subsumption of concept P in Q is at least 
a, in the sense of some X-implication (e.g., Gödel, or 
Kleene—Dienes implication). In particular, it can be ex- 
pressed that typical Ps are Qs, or that typical Ps are 
typical Qs, or that an instance is typical of a concept. 
Such ideas have been developed by Qi etal. [3.169] 
toward implemented systems in connection with web 
research. 


3.5.3 Information Fusion 


Possibility theory offers a simple, flexible framework 
for information fusion that can handle incompleteness 
and conflict. For instance, intervals or fuzzy intervals 
can be merged, coming from several sources. The ba- 
sic fusion modes are the conjunctive and disjunctive 
modes, presupposing, respectively, that all sources of 
information are reliable and that at least one is [3.170, 
171]. In the conjunctive mode, the use of the minimum 
operation avoids assuming sources are independent. If 
they are, the product rule can be applied, whereby 
low plausibility degrees reinforce toward impossibil- 
ity. Quite often, the results of a conjunctive aggregation 
are subnormalized, this indicating a conflict. Then, it 
is common to apply a renormalization step that makes 
this mode of combination brittle in case of strong con- 
flict, and anyway the more numerous the sources the 
more conflicting they become. Weighted average of 
possibility degrees can be used but it does not pre- 
serve the properties of possibility measure. The use of 
the disjunctive mode is more cautious: it avoids the 
conflict at the expense of losing information. When 
many sources are involved the result becomes totally 
uninformative. 

To cope with this problem, some ad hoc adap- 
tive combination rules have been proposed that fo- 
cus on maximal subsets of sources that are either 
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fully consistent or not completely inconsistent [3.170]. 
This scheme has been further improved by Oussalah 
et al. [3.172]. Oussalah [3.173] has proposed a num- 
ber of postulates a possibilistic fusion rule should 
satisfy. Another approach is to merge the set of cuts 
of the possibility distributions based on the maximal 
consistent subsets of sources (consistent subsets of 
cuts are merged using conjunction, and the results are 
merged disjunctively). The result is then a belief func- 
tion [3.174]. Another option is to make a guess on 
the number of reliable sources and merge informa- 
tion inside consistent subsets of sources having this 
cardinality. 

Possibilistic information fusion can be performed 
syntactically on more compact representations such as 
possibilistic logic bases [3.175] (the merging of possi- 
bilistic networks [3.176] has also been recently consid- 
ered). The latter type of fusion may be of interest both 
from a computational and from representational point 
of view. Still it is important to make sure that the syntac- 
tic operations are counterparts of semantic ones. Fusion 
should be performed both at the semantic and at the syn- 
tactic levels equivalently. For instance, the conjunctive 
merging of two possibility distributions corresponds to 
the mere union of the possibilistic bases that represent 
them. More details for other operations can be found 
in [3.175, 177], and in the bipolar case in [3.99]. This 
line of research is pursued by Qi et al. [3.178]. They 
also proposed an approach to measuring conflict be- 
tween possibilistic knowledge bases [3.179]. 

The distance-based approach [3.180] that applies to 
the fusion of classical logic bases can be embedded 
in the possibilistic fusion setting as well [3.177]. The 
distance between an interpretation s and each classical 
base K is usually defined as d(s, K) = min{H(s,s*) : 
s* H K} where H(s,s*) is the Hamming distance that 
evaluates the number of literals with different signs in s 
and s*). It is then easy to encode the distance d(s, K) 
into a possibilistic knowledge base (interpreting pos- 
sibility as Hamming-distance-based similarity to the 
models of K, i. e., z (s) = a@-®) a e (0, 1)). The result 
of the possibilistic fusion is a possibilistic knowledge 
base, the highest weight layer of which is the classical 
database that is searched for, provided that the distance 
merging operation is suitably translated to a possibilis- 
tic merging operation. 

A similar problem exists in belief revision where an 
epistemic state, represented either by a possibility dis- 
tribution or by a possibilistic logic base, is revised by 
an input information p [3.181]. Revision can be viewed 
as prioritized fusion, using for instance conditioning, 


or other operations, depending if in the revised epis- 
temic state one wants to enforce N(p) = 1, or N(p) > 0 
only, or if we are dealing with an uncertain input (p, œ). 
Then, the uncertain input may be understood as enforc- 
ing N(p) => « in any case, or as taking it into account 
only if it is sufficiently certain w.r.t. the current epis- 
temic state. 


3.5.4 Temporal Reasoning and Scheduling 


Temporal reasoning may refer to time intervals or to 
time points. When handling time intervals, the basic 
building block is the one provided by Allen relations 
between time intervals. There are 13 relations that de- 
scribe the possible relative locations of two intervals. 
For instance, given the two intervals A = [a,a’] and 
B = |b,b’], A is before (resp. after) B means a’ <b 
(resp. b’ < a), A meets (resp. is met by) B means a’ = b 
(resp. b’ = a), A overlaps (resp. is overlapped by) B iff 
b >a and d’ >b and b’ >a’ (resp. a >b and V >a, 
and a’ > b’). The introduction of fuzzy features in tem- 
poral reasoning can be related to two different issues: 


© First, it can be motivated by the need of a grad- 
ual, linguistic-like description of temporal relations 
even in the face of complete information. Then 
an extension of Allen relational calculus has been 
proposed, which is based on fuzzy comparators 
expressing linguistic tolerance, which are used in 
place of the exact relations >, =’, and <. Fuzzy 
Allen relations are thus defined from three fuzzy 
relations between dates that can be, for instance 
approximately equal, clearly greater, and clearly 
smaller, where, e.g., the extent to which x is ap- 
proximately equal to y is the degree of membership 
of x— y to some fuzzy set expressing something like 
small [3.182, 183]. 

@ Second, the possibilistic handling of fuzzy or in- 
complete information leads to pervade classical 
Allen relations, and more generally fuzzy Allen 
relations, with uncertainty. Then patterns for prop- 
agating uncertainty and composing the different 
(fuzzy) Allen relations in a possibilistic way have 
been laid bare [3.184, 185]. 


Besides, the handling of temporal reasoning in 
terms of relations between time points can also be ex- 
tended in case of uncertain information [3.186]. Uncer- 
tain relations between temporal points are represented 
by means of possibility distributions over the three basic 
relations >,=’, and <. Operations for computing in- 
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verse relations, for composing relations, for combining 
relations coming from different sources and pertaining 
to the same temporal points, or for handling negation, 
have been defined. This shows that possibilistic tempo- 
ral uncertainty can be handled in the setting of point 
algebra. The possibilistic approach can then be favor- 
ably compared with a probabilistic approach previously 
proposed (first, the approach can be purely qualitative, 
thus avoiding the necessity of quantifying uncertainty 
if information is poor, and second, it is capable of 
modeling ignorance in a nonbiased way). Possibilis- 
tic logic has also been extended to a timed version 
where time intervals where a proposition is more or 
less certainly true is attached to classical propositional 
formulas [3.187]. 

Applications of possibility theory-based decision- 
making can be found in scheduling. One issue is to 
handle fuzzy due dates of jobs using the calculus of 
fuzzy constraints [3.188]. Another issue is to han- 
dle uncertainty in task durations in basic scheduling 
problems such as program evaluation and review tech- 
nique (PERT) networks. A large literature exists on this 
topic [3.189, 190] where the role of fuzzy sets is not al- 
ways very clear. Convincing solutions on this problem 
start with the works of Chanas and Zielinski [3.191, 
192], where the problem is posed in terms of projecting 
a joint possibility theory on quantities of interest (earli- 
est finishing times, or slack times) and where tasks can 
be possibly or certainly critical. A full solution apply- 
ing Boolean possibility theory to interval uncertainty of 
tasks durations is described in [3.193], and its fuzzy 
extension in [3.194]. Other scheduling problems are 
solved in the same possibilistic framework by Kasper- 
ski and colleagues [3.195, 196], as well as more general 
optimization problems [3.197, 198]. 


3.5.5 Risk Analysis 


The aim of risk analysis studies is to perform un- 
certainty propagation under poor data and without 
independence assumptions (see the papers in the spe- 
cial issue [3.199]). Finding the potential of possibilis- 
tic representations in computing conservative bounds 
for such probabilistic calculations is certainly a ma- 
jor challenge [3.200]. An important research direc- 
tion is the comparison between fuzzy interval anal- 
ysis [3.33] and random variable calculations with 


a view to unifying them [3.201]. Methods for joint 
propagation of possibilistic and probabilistic infor- 
mation have been devised [3.202], based on casting 
both in a random set setting [3.203]; the case of 
probabilistic models with fuzzy interval parameters 
has also been dealt with [3.204]. The active area of 
fuzzy random variables is also connected to this ques- 
tion [3.205]. 


3.5.6 Machine Learning 


Applications of possibility theory to learning have 
started to be investigated rather recently in differ- 
ent directions. For instance, taking advantage of the 
proximity between reinforcement learning and partially 
observed Markov decision processes, a possibilistic 
counterpart of reinforcement learning has been pro- 
posed after developing the possibilistic version of the 
latter [3.206]. Besides, by looking for big-stepped prob- 
ability distributions, defined by discrete exponential 
distributions, one can mine data bases for discovering 
default rules [3.207]. Big-stepped probabilities mim- 
ick possibility measures in the sense that P(A) > P(B) 
if and only if max;e, p(s) > max;eg p(s). The ver- 
sion space approach to learning presents interesting 
similarities with the binary bipolar possibilistic rep- 
resentation setting, thinking of examples as positive 
information and of counterexamples as negative in- 
formation [3.208]. The general bipolar setting, where 
intermediary degrees of possibility are allowed, pro- 
vides a basis for extending version space approach 
in a graded way, where examples and counter ex- 
amples can be weighted according to their impor- 
tance. The graded version space approach agrees with 
the possibilistic extension of inductive logic program- 
ming [3.209]. Indeed, where the background knowl- 
edge may be associated with certainty levels, the 
examples may be more or less important to cover, 
and the set of rules that is learnt may be stratified 
in order to have a better management of exceptions 
in multiple-class classification problems, in agree- 
ment with the possibilistic approach to nonmonotonic 
reasoning. 

Other applications of possibility theory can be 
found in fields such as data analysis [3.79, 210,211], 
diagnosis [3.212,213], belief revision [3.181], argu- 
mentation [3.68, 214, 215], etc. 
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3.6 Some Current Research Lines 


A number of ongoing works deal with new research 
lines where possibility theory is central. In the follow- 
ing, we outline a few of those: 


© Formal concept analysis: Formal concept analysis 
(FCA) studies Boolean data tables relating objects 
and attributes. The key issue of FCA is to ex- 
tract so-called concepts from such tables. A concept 
is a maximal set of objects sharing a maximal 
number of attributes. The enumeration of such con- 
cepts can be carried out via a Galois connection 
between objects and attributes, and this Galois con- 
nection uses operators similar to the A function of 
possibility theory. Based on this analogy, other cor- 
respondences can be laid bare using the three other 
set functions of possibility theory [3.216,217]. In 
particular, one of these correspondences detects in- 
dependent subtables [3.22]. This approach can be 
systematized to fuzzy or uncertain versions of for- 
mal concept analysis. 

© Generalized possibilistic logic: Possibilistic logic, 
in its basic version, attaches degrees of necessity 
to formulas, which turn them into graded modal 
formulas of the necessity kind. However only con- 
junction of weighted formulas are allowed. Yet 
very early we noticed that it makes sense to ex- 
tend the language toward handing constraints on 
the degree of possibility of a formula. This re- 
quires allowing for negation and disjunctions of 
necessity-qualified proposition. This extension, still 
under study [3.218], puts together the KD modal 
logic and basic possibilistic logic. Recently it has 
been shown that nonmonotonic logic programming 
languages can be translated into generalized pos- 
sibilistic logic, making the meaning of negation 
by default in rules much more transparent [3.219]. 
This move from basic to generalized possibilistic 
logic also enables further extensions to the mul- 
tiagent and the multisource case [3.220] to be 
considered. Besides, it has been recently shown 
that a Sugeno integral can also be represented in 
terms of possibilistic logic, which enables us to lay 
bare the logical description of an aggregation pro- 
cess [3.221]. 


© Qualitative capacities and possibility measures: 


While a numerical possibility measure is equiva- 
lent to a convex set of probability measures, it turns 
out that in the qualitative setting, a monotone set 
function can be represented by means of a family 
of possibility measures [3.222, 223]. This line of re- 
search enables qualitative counterparts of results in 
the study of Choquet capacities in the numerical set- 
tings to be established. Especially, a monotone set 
function can be seen as the counterpart of a belief 
function, and various concepts of evidence the- 
ory can be adapted to this setting [3.224]. Sugeno 
integral can be viewed as a lower possibilistic ex- 
pectation in the sense of Sect. 3.3.9 [3.223]. These 
results enable the structure of qualitative monotonic 
set functions to be laid bare, with possible con- 
nection with neighborhood semantics of nonregular 
modal logics [3.225]. 

Regression and kriging: Fuzzy regression analy- 
sis is seldom envisaged from the point of view of 
possibility theory. One exception is the possibilis- 
tic regression initiated by Tanaka and Guo [3.211], 
where the idea is to approximate precise or set- 
valued data in the sense of inclusion by means 
of a set-valued or fuzzy set-valued linear function 
obtained by making the linear coefficients of a lin- 
ear function fuzzy. The alternative approach is the 
fuzzy least squares of Diamond [3.226] where fuzzy 
data are interpreted as functions and a crisp dis- 
tance between fuzzy sets is often used. However, in 
this approach, fuzzy data are questionably seen as 
objective entities [3.227]. The introduction of pos- 
sibility theory in regression analysis of fuzzy data 
comes down to an epistemic view of fuzzy data 
whereby one tries to construct the envelope of all 
linear regression results that could have been ob- 
tained, had the data been precise [3.228]. This view 
has been applied to the kriging problem in geo- 
Statistics [3.229]. Another use of possibility theory 
consists in exploiting possibility—probability trans- 
forms to develop a form of quantile regression on 
crisp data [3.230], yielding a fuzzy function that is 
much more faithful to the data set than what a fuzzi- 
fied linear function can offer. 
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4, Aggregation Functions on [0,1] 


Radko Mesiar, Anna Kolesárová, Magda Komornikova 


After a brief presentation of the history of aggrega- 
tion, we recall the concept of aggregation functions 
on [0,1] and on a general interval / C [—oo, oo]. 
We give a list of basic examples as well as some 
peculiar examples of aggregation functions. Af- 
ter discussing the classification of aggregation 
functions on [0,1] and presenting the prototyp- 
ical examples for each introduced class, we also 
recall several construction methods for aggrega- 
tion functions, including optimization methods, 
extension methods, constructions based on given 


Aggregation (fusion, joining) of several input values 
into one, in some sense the most informative value, 
is a basic processing method in any field dealing with 
quantitative information. We only recall mathematics, 
physics, economy, sociology or finance, among others. 
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aggregation functions, and introduction of 
weights. Finally, a remark on aggregation of more 
general inputs, such as intervals, distribution 
functions, or fuzzy sets, is added. 


Basic arithmetical operations of addition and multipli- 
cation on [0, co] are typical examples of aggregation 
functions. As another example let us recall integration 
and its application to geometry allowing us to compute 
areas, surfaces, volumes, etc. 


4.1 Historical and Introductory Remarks 


Just in the field of integration one can find the first 
historical traces of aggregation known in the written 
form. Recall the Moscow mathematical papyrus and 
its problem no. 14, dating back to 1850 BC, concern- 
ing the computation of the volume of a pyramidal 
frustum [4.1], or the exhaustive method allowing to 
compute several types of areas proposed by Eudoxus 
of Cnidos around 370 BC [4.2]. The roots of a re- 
cent penalty-based method of constructing aggregation 
functions [4.3] can be found in books of Appolonius 
of Perga (living in the period about 262-190 BC) who 
(motivated by the center of gravity problems) proposed 
an approach leading to the centroid, i.e., to the arith- 
metic mean, minimizing the sum of squares of the 
Euclidean distances of the given n points from an un- 
known but fixed one. Generalization of the Appolonius 


of Perga method based on a general norm is known as 
the Fréchet mean, or also as the Karcher mean, and it 
was deeply discussed in [4.4]. 

Another type of mean, the Heronian mean of two 
nonnegative numbers x and y is given by the formula 


He(x, y) = z (x+ /xyty) . (4.1) 
It is named after Hero of Alexandria (10-70 AD) who 
used this aggregation function for finding the volume of 
a conical or pyramidal frustum. He showed that this vol- 
ume is equal to the product of the height of the frustum 
and the Heronian mean of areas of parallel bases. 
Another interesting historical example can be found 
in multivalued logic. Already Aristotle (384-322 BC) 
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was a classical logician who did not fully accept the 
law of excluded middle, but he did not create a system 
of multivalued logic to explain this isolated remark (in 
the work De Interpretatione, chapter IX). Systems of 
multivalued logics considering 3, n (finitely many), and 
later also infinitely many truth degrees were introduced 
by Łukasiewicz [4.5], Post [4.6], Gödel [4.7], respec- 
tively, and in each of these systems the aggregation 
of truth values was considered (conjunction, disjunc- 
tion). 

Though several particular aggregation functions (or 
classes of aggregation functions) were discussed in 
many earlier works (we only recall means discussed 
around 1930 by Kolmogorov [4.8] and Nagumo [4.9], 
or later by Aczél [4.10], triangular norms and copu- 
las studied by Schweizer and Sklar in 1960s of the 
previous century and summarized in [4.11]), an inde- 
pendent theory of aggregation can be dated only about 
20 years back and the roots of its axiomatization can 
be found in [4.12—14]. Probably the first monograph 
devoted purely to aggregation is the monograph by 
Calvo et al. [4.15]. As a basic literature for any scientist 
interested in aggregation we recommend the mono- 
graphs [4.16—18]. 

In this chapter, not only we summarize some earlier, 
but also some recent results concerning aggregation, 
including classification, construction methods, and sev- 
eral examples. We will deal with inputs and outputs 
from the unit interval [0, 1]. Note that though, in gen- 
eral, we can consider an arbitrary interval J C [—oo, oo], 
there is no loss of generality (up to the isomorphism) 
when restricting our considerations to 7 = [0,1]. As 
an example, consider the aggregation of nonnegative 
inputs, i. e., fix Z = [0, oo[. Then any aggregation func- 
tion A on [0, oo[ can be seen as an isomorphic transform 
of some aggregation function B on [0,1], restricted 
to [0, 1[ and satisfying two constraints: 


i) Bx) =1 if and only ifx = (1,...,1), 
ii) sup {B(x) |x € [0, 1["}=1neEN. 


Note that any increasing bijection g : [0, 1[— [0, oo[ 
can be applied as the considered isomorphism. For 
more details about aggregation on a general interval 
I C [—o0, ov] refer to [4.17]. 

We can consider either aggregation functions with 
a fixed number n € N, n > 2, of inputs or extended ag- 
gregation functions defined for any number n € N of 
inputs. The number n is called the arity of the aggrega- 
tion function. 


Definition 4.1 

For a fixed n € N, n > 2, a function A : [0, 1]” — [0, 1] 
is called an (n-ary) aggregation function whenever it is 
increasing in each variable and satisfies the boundary 
conditions 


A(O,...,0)=0 and A(1,...,1)=1. 


A mapping A: U,en[0. 1]" — [0, 1] is called an ex- 
tended aggregation function whenever A(x) =x for 
each x € [0, 1], and for each n € N, n> 2, A | [0, 1]” is 
an n-ary aggregation function. 


The framework of extended aggregation functions 
is rather general, not relating different arities, and thus 
some additional constraints are often considered, such 
as associativity, decomposability, neutral element, etc. 

The Heronian mean He given in (4.1) is an example 
of a binary aggregation function. Prototypical examples 
of extended aggregation functions on [0, 1] are: 


@ The smallest extended aggregation function A, 
given by 
1 ifx=(,...,1) 


0 else. 


A(x) = | 


@ The greatest extended aggregation function A, 
given by 
0 ifx=(0,...,0) 


1 else. 


Ag(x) = | 
© The arithmetic mean M given by 


1 n 
M(x1,...5%Xn) = -2 a. 


i=1 


@ The geometric mean G given by 


G(x1,...,Xn) = (11s) i ; 


i=l 


@ The product /7 given by 


Tsin] e 


i=1 


@ The minimum Min given by 


Min(x1,..., Xn) = min {x1,..., Xn} . 
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@ The maximum Max given by 


Max(x1,...,X,) = max {x1,...,X,} . 


© The truncated sum Sp (also known as the 
Lukasiewicz t-conorm) given by 


SL(X1,.--,Xn) = nin} oat ; 
i=1 


@ The 3-/7-operator E introduced in [4.19] and given 
by 


E s.es Xn) = 7 7 ; 
“1 a P=% + T7210 — x) 


with some convention covering the case 2, 
@ The Pascal weighted arithmetic mean Wp given by 


1 n m=i 
Wplxi, -3 Xn) = Ja (ije 
i=1 


As distinguished examples of n-ary aggregation 
functions for a fixed arity n > 2, recall the projections 
P; and order statistics OS;, i= 1,...,, given by 


Pi(x,-. 


š Xn) = Xj 
and 
OS; (x1 E Xn) = Xo (i), 


where o is an arbitrary permutation of (1,...,n) such 
that x6(1) < Xa) S++: < Xo nm). Observe that the first 
projection Pp = P; and the last projection PL = P,, can 
be seen as instances of extended aggregation functions 
Pp and PL, respectively. On the other hand, for any fixed 
n > 2, OS, is just Min | [0, 1]” and OS,, = Max | [0, 1]”. 

As a peculiar example of an extended aggre- 
gation function we can introduce the mapping V: 


Unen 0, 1]” > [0, 1] given by 


(4.2) 


1 
V(x, ..-,%n) = min (£a) Jl 


i=l 


4.2 Classification of Aggregation Functions 


Let us denote by A the class of all extended aggrega- 
tion functions, and by A, (for a fixed n > 2) the class of 
all n-ary aggregation functions. Several classifications 
of n-ary aggregation functions can be straightforwardly 
extended to the class A. The basic classification pro- 
posed by Dubois and Prade [4.20] distinguishes (both 
for n-ary and extended aggregation functions): 


© Conjunctive aggregation functions, 
C={AeE A|A < Min}, 

© Disjunctive aggregation functions, 
D= {A € A |A > Max}, 

© Averaging aggregation functions, 
Av = {A € A | Min < A < Max}, 

@ Mixed aggregation functions, 
M=A\(CUDU AY). 


Considering purely averaging aggregation func- 
tions Av? = Av \ {Min, Max}, we can see that the set 
{C,D, Av’, M} forms a partition of A. Note that the 
classes A, C, D, Av, Av? are convex, which is not the 
case of the class M. For the previously introduced ex- 


amples it holds: 


M, G, Wp, Pr, Pi E€ Av’, 
ITEC, 

S, VED, 

EEM. 


Observe that n-ary aggregation functions P; and 
OS;, i=1,...,n, are averaging, so are their convex 
sums, i.e., weighted arithmetic means 


W = 5 wiPi > 
i=1 
and ordered weighted averages (OWA operators) [4.21], 
OWA = > w,OS; , 
i=1 


with w; > 0 and $`; _;w;= 1. The binary Heronian 
mean He given in (4.1) is a convex combination of 
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the arithmetic mean M and the geometric mean G, 
He = M + 5G, and thus it is also averaging. 

Consider two binary aggregation functions Aj, A2 : 
[0, 1]? — [0, 1] given by 


Ai (x, y) = Med(0, 1,x + y—0.5) (4.3) 


and 


Ao(x, y) = Med(x+ y,0.5,x+y—1), (4.4) 
where Med is the standard median operator. Then 
A,,A2€™ but ŁA; + 4A. = M € Av. The 3D plots 
of aggregation functions A,, A and M are depicted in 
Figs. 4.1-4.3. 

More refined classifications of n-ary aggrega- 
tion functions are related to order statistics OS;, 
i=1,...,n. The conjunctive classification [4.22] 
deals with the partition of the class A, given by 
{C,...,Cn, Rc}, where the class of i-conjunctive ag- 
gregation functions, i= 1,...,n, is defined by 


C; = {A € A, | min{card{j | xj > A(x)} 
|x €[0, 1]"}=73 


= {Ae€A,|A<OS,—j;41 but not A < OS,—;}, 


where formally OSo = 0. 

In other words, A is i-conjunctive if and only if the 
aggregated value A(x) is dominated by at least i input 
values independently of x € [0, 1]”, but not by (i+ 1) 
values, in general. 


Clearly, the classes Cj,...,C, are pairwise disjoint 
and the remaining aggregation functions are members 
of the class Re = A, \\Uj~, Ci. If we come back to 
the above-mentioned basic classification of aggrega- 
tion functions (applied to A,,), we obtain C = C, and 
We = UZ] Ci = Av \ {Min}. The class We is called 
weakly conjunctive [4.22]. 

Similarly, we have a disjunctive type of classifica- 
tion of A, related to the partition {D,,...,D,,Ro}, 
with 


D; ={A€ A, |A > OS; but not A > OS;+1} , 


i=1,...,n. 


= aa 


Fig. 4.3 3D plot of the aggregation function SA} + 1A 
=M 
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Then D, =D and for the class of weakly disjunc- 
tive aggregation functions Wp = "| D; we have 
Wp = Av\{Max}. Hence We () Wp = Av’, and A € 
Ui, C; if and only if A < Max, while A € Ui; D; if 
and only if A > Min. 

Note that the conjunctive and disjunctive classifica- 
tions can be applied to aggregation functions defined 
on posets, too [4.22], and that this approach to the 
classification of aggregation functions on [0, 1] was al- 
ready proposed by Marichal in [4.23] as i-tolerant and 
i-intolerant aggregation functions (Marichal’s approach 
based on order statistics is applicable when considering 
chains only). 

Observe that this approach to classification has no 
direct extension to extended aggregation functions. On 
the other hand, we have the next classification valid for 
extended aggregation functions only. We distinguish: 


@ Dimension decreasing aggregation functions form- 
ing the class A\,, satisfying A (x1, . . . , Xn, Xn+1) < 
A(Xi,...,Xn) for any nEN, xi, ...,Xn+1 € [0, 1], 
but violating the equality, in general. 

@ Dimension increasing aggregation functions form- 
ing the class A x, satisfying A (x1, . . . , Xn, Xn+1) = 
A(Xi,...,Xn) for any nEN, xi, ...,Xn+1 € [0, 1], 
but violating the equality, in general. 

@ Dimension averaging aggregation functions form- 


ing the class Z, satisfying A(x1,...,Xn,0) < 
Airera 0 LAs Xn, 1) for any neN, 
Xi, ...,Xn € [0, 1], and attaining strict inequalities 
for at least one x € [0, 1]”. 


Evidently, the classes A, A z, and a are dis- 
joint and i they, together with their reminder A \ (A\, U 
A a UA), form a partition of A. Let us note that each 
associative conjunctive aggregation function is dimen- 
sion decreasing, and thus, J, Min € Ax. Similarly, 
each associative disjunctive aggregation function is di- 
mension increasing, so, SL, Max € Az. 

Recently, Yager has introduced extended aggrega- 
tion functions with the self-identity property [4.24] 
characterized by the equality 


A(X,- Xn A(X... Xn)) =A, .. Xn) 


for any ne N and x,...,x, € [0,1] (e.g., the arith- 
metic mean M or the geometric mean G satisfy this 


property). Evidently, each such aggregation function 
satisfies 


Ais esim O < A(X, ..- 5 Xn, Xn-+1) 
<A(X1,..-5Xp, 1) 


forall n EN, x1,...,Xn+1 € [0, 1] and thus, if the strict 
inequalities are attained for some n € N and 


X1y 06+ 5Xnt1 E [0, 1], 


<> <> 
A belongs to A. So, for example, M, G €A. The ex- 


<> 
tended aggregation function V (4.2) also belongs to A. 
On the other hand, the first projection Pp does not be- 
long to 


<> 
AX, UAaUA, 


and the last projection Pz; belongs to A. Recall that if 
A € Ax, it is also said to have the downward attitude 
property [4.24]. Similarly, the upward attitude prop- 
erty introduced in [4.24] corresponds to the class A a. 
Dimension increasing aggregation functions were also 
considered in [4.25]. 

Let us return to the basic classification of aggre- 
gation functions and recall several distinguished types 
of aggregation functions belonging to the classes C, D, 
Av’, and M: 


© Conjunctive aggregation functions: Triangular 
norms [4.26, 27], copulas [4.27, 28], quasi-copulas 
(4.29, 30], and semicopulas [4.31]. 

© Disjunctive aggregation functions: Triangular 
conorms [4.26, 27], dual copulas [4.28]. 

© Averaging aggregation functions: (Weighted) 
quasi-arithmetic means [4.10], idempotent uni- 
norms [4.32], integrals based on capacities, 
including the Choquet and Sugeno integrals [4.18, 
33-36], also covering OWA [4.21], ordered 
weighted maximum (OWMax) [4.37] and ordered 
modular average (OMA) [4.38] operators, as well 
as lattice polynomials [4.39]. 

© Mixed aggregation functions: nonidempotent uni- 
norms [4.40], gamma-operators [4.41], special con- 
vex sums in fuzzy linear programming [4.42]. 


For more details concerning these aggregation func- 
tions see [4.17] or references given above. 
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4.3 Properties and Construction Methods 


Properties of aggregation functions are mostly related 
to the field of their application, such as multicriteria 
decision aid, multivalued logics, or probability theory, 
for example. Besides the standard analytical properties 
of functions, such as continuity, the Lipschitz prop- 
erty, and (perhaps adapted) algebraic properties, such 
as symmetry, associativity, bisymmetry, neutral ele- 
ment, annihilator, cancellativity, or idempotency [4.17, 
Chapter 2], the above-mentioned applied fields have 
brought into aggregation theory properties as decom- 
posability, conjunctivity, or n-increasigness. Each of the 
mentioned properties can be introduced for n-ary aggre- 
gation functions (excepting decomposability), and thus 
also for extended aggregation functions. However, in 
the case of extended aggregation functions, some prop- 
erties can be introduced in a stronger form, involving 
different arities in a single formula. 

For example, the (weak) idempotency of A € A 
means the idempotency of each A | [0, 1]”, which means 
that for each n € N and 
xX) =x. 


x€[0,1], A(x... 
—— 
n-times 


Note that an extended aggregation function A is idem- 
potent if and only if it is averaging, i. e., A € Av. The 
strong idempotency [4.15] of an extended aggregation 
function A € A means that 


Ax, nX) = A(x) 


k-times 


for each k € N and x € |J „en (0, 1]”. For example, the 
extended aggregation function Wp is idempotent but not 
strongly idempotent. 

Similarly, e € [0, 1] is a (weak) neutral element of 
an extended aggregation function A € A if and only if 
for each n > 2 and x € [0, 1]” such that x = e for j Æ i it 
holds A(x) = x;. On the other hand, e is a strong neutral 
element of an extended aggregation function A € A if 
and only if for any n > 2, x € [0, 1]” with x; = e, it holds 


A(X), + 004 M1 OHA 1s0 + n) 


SAX 155 Minis Ay. HH) . 


Obviously, if e is a strong neutral element of A € A then 
it is also a (weak) neutral element of A. As an example, 
consider the extended copula D € A given by 


D(x1,..., Xn) = x1 -min {x2,..., Xn} - 


Obviously, e = 1 is a weak neutral element of D. How- 
ever D (1, 5, 5) = 5 # 1 =D(4, 5)s i.e., e = 1 is not 
a strong neutral element of D. For a deeper discussion 
and exemplification of properties of aggregation func- 
tions we recommend [4.17]. 

Aggregation functions in many fields are con- 
strained by the required properties — axioms in each 
considered field. As a typical example recall multi- 
valued logics (fuzzy logics) with truth values domain 
[0, 1], where conjunction is modeled by means of trian- 
gular norms [4.26, 43,44]. Recall that a binary aggre- 
gation function T : [0, 1]? — [0, 1] is called a triangular 
norm (f-norm for short) whenever it is symmetric, 
associative and e = | is its neutral element. Due to as- 
sociativity, there is a genuine extension of a t-norm T 
into an extended aggregation function (we will also use 
the same notation T in this case). Then e = 1 is a strong 
neutral element for the extended T. However, without 
some additional properties we still cannot determine 
a t-norm convenient for our purposes. Requiring, for 
example, the idempotency of T, we obtain that the 
only solution is T = Min, the strongest triangular norm. 
Considering continuous triangular norms satisfying the 
diagonal inequalities 0 < T(x, x) < x for all x €]O, 1[, 
we can show that T is isomorphic to the product /7, i. e., 
there is an automorphism ø : [0, 1] — [0, 1] such that 
T(x, y) = o7! (TI (g(x), ¢(y))), and in the extended 
form, T(x1,....X») = Q7! (M (x1), . . -, @&%))). For 
more details and several other results we recom- 
mend [4.26]. 

As another example consider probability theory, 
namely the relationship between the joint distribu- 
tion function Fz of a random vector Z = (X),...,Xn), 
and the corresponding marginal one-dimensional dis- 
tribution functions Fy,,...,Fx,. By the Sklar theo- 


rem [4.45], for all (x1, ..., Xn) € R we have 


Fz(X1,---.Xn) = C (Fx, @1), -< Fx, An) 


for some n-ary aggregation function C. Obviously, 
constrained by the basic properties of probabil- 
ities, C should possess a neutral element e= 
1 and annihilator (zero element) a= 0, and the 
function C should be n-increasing (i.e., proba- 
bility P (Z € [u,v] x -+ x [un,vn]) =O for any n- 
dimensional box [u;, v1] x- +- x [utn, Val), which yields an 
axiomatic definition of copulas. More details for inter- 
ested readers can be found in [4.28]. Considering some 
additional constraints, we obtain special subclasses of 
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copulas. For example, if we fix n= 2 and consider 
the stability of copulas with respect to positive pow- 
ers, i.e., the property C(x*, y+) = (C(x, y))* for each 
A €]0, oo[ and each (x, y) € [0, 1]*, then we obtain ex- 
treme value copulas (EV copulas) [4.46, 47]. Recall that 
a copula C : [0, 1]? > [0, 1] is an EV copula if and only 
if there is a convex function d : [0, 1] — [0, 1] such that 
for each t € [0, 1], max {t, 1 — t} < d(t) < 1 and for all 
(x,y) €]0, If, 


Cay) = (o) (S) 


(observe that on [0, 1]7\]0, 1[ for each copula it holds 
C(x, y) = min {x, y}). 

Our third example comes from economics. In mul- 
ticriteria decision problems, we often meet the require- 
ment of the comonotone additivity of the considered 
n-ary (extended) aggregation function A, i. e., we expect 
that A(x +y) = A(x) + A(y) for all x,y € [0, 1]” such 
that x +y € [0, 1]” and (x; —j) (Qi —yj) = 0 for any i,j € 
{1,...,m}. The comonotonicity of x and y means that 
the ordering on {1,...,} induced by x is not contradic- 
tory to that one induced by y. Due to Schmeidler [4.48], 
we know that then A is necessarily the Choquet inte- 
gral based on the fuzzy measure m : 2"!>--} — [0, 1], 
m(E) = A(1g), given by (4.6). 

The axiomatic approach to aggregation character- 
izes some special classes of aggregation functions. 
Another important look at aggregation involves con- 
struction methods. We can roughly divide them into the 
next four groups: 


© Optimization methods, 

© Extension methods, 

© Constructions based on the given aggregation func- 
tions, 

@ Introduction of weights. 


An exhaustive overview of construction methods for 
aggregation functions can be found in [4.17, Chapter 6]. 
Here we briefly recall the most distinguished ones. 

A typical optimization method is the penalty-based 
approach proposed in [4.49] and generalized in [4.3], 
where dissimilarity functions were introduced, see 
also [4.50]. 


Definition 4.2 
A function D : [0, 1]? — [0, oof given by 
D(x,y) = K F()—f()) ; 


where f:[0,1]— R is a continuous strictly mono- 
tone function and K : R —> [0, oo[ is a convex function 


attaining the unique minimum K(0) = 0, is called a dis- 
similarity function. 


Theorem 4.1 

Let D: [0,1]? = [0,00] be a dissimilarity function. 
Then for any ne N, x,...,x, € [0,1], the function 
h: [0,1] > R given by A(t) = X} ;—; D(x; t) attains its 
minimal value exactly on a closed interval [a, b] and the 
formula 


a+b 
2 


defines a strongly idempotent symmetric extended ag- 
gregation function A on (0, 1]. 


A(X1,...,Xn) = 


Construction given in Theorem 4.1 covers: 


© the arithmetic mean (D(x, y) = (x—y)’), 
© quasi-arithmetic means (D(x, y) = (f(x) -f (y))*), 
@ the median (D(x, y) = |x— y|), 


among others. This method is a generalization of the 
Appolonius of Perga method. Note that in general, 
a function D need not be symmetric, i.e., K need not 
be an even function (compare with the symmetry of 
metrics). As a typical example, let us recall the dissim- 
ilarity function De : [0, 1]? — [0, co], c €]0, cof, given 
by 

x—y ifx> y, 
c(y—x) 


yielding by means of Theorem 4.1 the a-quantile of 
a sample (x1, ... , Xn) witha = The 

As a possible generalization of Theorem 4.1, one 
can consider different dissimilarity functions D; (which 
violates the symmetry of the constructed aggregation 
function A). Consider, for example, Dj (x, y) = |x— 
y| and D2(x, y) = -++ = D, (x, y) = ++» = (x—y)*. Then 
the minimization of the sum )~_, D;(x;, f) results in 
the extended aggregation function A : (J „ey [0, 1]” > 
[0, 1] given by 


A(X1,.--,Xn) = 
Med (x;, M(x, .. 


BER = ifx<y 


Xn) — 0.5, M (x2, . . . , Xn) + 0.5) 


(4.5) 


whenever n > 1. 

Some other generalizations based on a generalized 
approach to dissimilarity (penalty) functions can be 
found in [4.16]. 
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Extension methods are based on a partial informa- This integral covers the Choquet integral if C is 

tion that is available about an aggregation function. As equal to the product copula IT, [77m = Chm, as well 
a typical example, we recall integral-based aggregation as the Sugeno integral in the case of the greatest 
functions. Suppose that for a fixed arity n the values of copula Min, IMin.mn = SUn. Observe that if the ca- 
an aggregation function A are known at Boolean inputs pacity m is symmetric, i.e., m(E) = Vearaz, Where 
only, i.e., we know A | {0, 1!" only. Identifying sub- 0 = vo < vı <+ < vn = 1, then Ic,m turns to OMA 
sets of the space X = {1, .. . , n} with the corresponding operator introduced in [4.38]. Its special instances 
characteristic functions, we get the set function m : are the OWA operators [4.21] based on the Choquet 
2X > [0, 1] given by m(E) = A(1g). Obviously, m is integral, 
monotone, i. e., m(E1) < m(E2) whenever E; C E, C X, f 
and m(@) =A(0, ...,0) =0, m(X) = A(1,...,1)= 1. OWA (x) = Yro Wis 
Note that m is often called a fuzzy measure [4.51,52] or = 
a capacity [4.17]. ; 

v Among several integral-based extension methods with w;=v;—vi—1, and the OWMax opera- 

5 we recall: tor [4.37], 

= _ ; eee 

= © The Choquet integral [4.53], Chn : [0, 1]" > [0, 1], Ea ar vere E gay |Past 

w For better understanding, fix n = 2, i.e., consider 


Chp) = J xo (m(Eo.i) — m(Eo.i+1)) 5 
i=l 


(4.6) 


where o : X — X is a permutation such that xo (1) < 
Xo (2) S+ SXo(n); Eo.i = {o(i),...,0(n)} for i = 
1,...,n, and Eg,,41 = Ø. Note that the Choquet in- 
tegral can be seen as a weighted arithmetic mean 
with the weights dependent on the ordinal structure 
of the input vector x. If the capacity m is additive, 
i.e., m(E) = Ve, m({i}), then 


Chn (x) = a ’ 


i=1 


where for the weights it holds w; = m({i}), i € X 
(hence X; w; = 1). 
© The Sugeno integral [4.51], Su, : [0, 1]” — [0, 1], 


Su„ (x) = max {min {xg (i, mM(Eg,i)} | i € X} . 


If m is maxitive, i. e., m(E) = max {m({i}) | i € E}, 
then we recognize the weighted maximum 
Su„ (x) = max {min {x;, v;i} | i€ X}, with weights 
v; = m({i}) (hence max {v; | i € X} = 1). 

@ The copula-based integral [4.34], Ic,m : [0, 1]” > 
[0, 1], where C : [0, 1]? — [0, 1] is a binary copula, 


n 


Ic m(x) = > (c (Xow, m(Ev,i)) 


i=l 


=C (xoi, mMEo,i+1))) - 


X = {1,2}. Then m({1}) =a and m({2}) = b are any 
constants from [0, 1], and m(@) = 0, m(X) = 1 due to 
the boundary conditions. The following equalities hold: 


ax+(1—a)y 
(1—b)x+ by else, 

© Sun(x, y) = max {min {a, x}, min {b, y} , min {x, y}}, 
C(x, a) + y— Cy, a) 
C(y, b) +x— C(x, db) 


TER 
© Ch,(,y) = ie 


ifx>y, 
e Toum(%, y) = dice 


The considered capacity m is symmetric if and only 
if a = b, and then: 


© Ch,, (x,y) = OWA(x, y) = (1 —a)- min {x, y} + a- 
max {x, y}, 

© Sun(x, y) = OW Max(x, y) = Med(x, a, y) is the so- 
called a-median [4.54, 55], 

© Ten (x,y) = OMA(x, y) = fi (min {x, y}) + 
fo(max {x, y}), where fi, fo: [0,1] — [0, 1] 
given by fı (t) = t— C(t, a) and fh (t) = C(t, a). 


are 


For more details concerning integral-based 
constructions of aggregation functions we recom- 
mend [4.34, 36, 56] or [4.34] by Klement, Mesiar, and 
Pap. 

Another kind of extension methods exploiting 
capacities is based on the Möbius transform. Re- 
call that for a capacity m:2* — [0,1], its Möbius 
transform y : 2X — R is given by 


WE) = YD mL) . 


LOE 
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Theorem 4.2 

[4.57] Let C : [0, 1]" — [0, 1] be an n-ary copula, and 
m:2X — [0,1] a capacity. Then the function Ac,m : 
[0, 1]” — [0, 1] given by 


Acm@) = J uE): CEV le) 


ECX 


is an aggregation function. 


Special instances of Theorem 4.2 are the Lovász 
extension [4.58] corresponding to the strongest cop- 
ula Min (Amin.m = I,m = Chm is just the Cho- 
quet integral), and the Owen extension [4.59] cor- 
responding to the product copula M (Arm, mœ) = 
L rcx (WE) Mice x)). 

Several extension methods were introduced for bi- 
nary copulas, for example, in the case when only the 
information about their diagonal section ôc : [0, 1] > 
[0, 1], 5c(x) = C(x, x) is available. If 5 : [0, 1] > [0, 1] 
is any increasing 2-Lipschitz function such that 
6(0) = 0, 6(1) = 1, and d(x) < x for each x € [0, 1], then 
the formula 


d(x) + 8O) 


a z (Œy) € [0, 1}, 


D(x, y) = min 2 
defines a binary copula with dp = ô. Note that D is the 
greatest symmetric copula with the given diagonal sec- 
tion. Among numerous papers dealing with such types 
of extensions we recommend the overview paper [4.60]. 
Similarly, one can extend horizontal or vertical sections 
to copulas [4.61]. An overview of extension methods 
for triangular norms can be found in [4.26]. 

The third group of construction methods involves 
methods creating new aggregation functions from the 
given ones. These methods are applied either to aggre- 
gation functions with a fixed arity n, or to extended 
aggregation functions. Some of them can be applied to 
any kind of aggregation functions. As a typical exam- 
ple, recall transformation of aggregation functions by 
means of an automorphism ¢ : [0, 1] — [0, 1] Gi. e., an 
isomorphic transformation) given by 


Ag (x1, tee Xn) = yg! (A (g(x), sees 9(Xn))) G 
(4.7) 


Transformation (4.7) preserves all algebraic properties 
as well as the classification of aggregation functions. 
However, some analytical properties can be broken, 
for example, the Lipschitz property or n-increasigness. 
Some special classes of aggregation functions can be 


characterized by a unique member and its isomorphic 
transforms. Consider, for example, triangular norms. 
Then strict triangular norms are isomorphic to the prod- 
uct t-norm JI, nilpotent t-norms are isomorphic to 
the Lukasiewicz t-norm T,. Similarly, quasi-arithmetic 
means with no annihilator are isomorphic to the arith- 
metic mean M. The only n-ary aggregation functions 
invariant under isomorphic transformations are the lat- 
tice polynomials [4.62], i.e., the Choquet integrals 
with respect to {0, 1}-valued capacities. So, for n = 2, 
only Min, Max, Pp and P, are invariant under isomor- 
phic transformations. There are several generalizations 
of (4.7). One can consider, for example, decreasing bi- 
jections 7: [0, 1] — [0, 1] and define A, via (4.7). This 
type of transformations reverses the conjunctivity of ag- 
gregation function into disjunctivity, and vice versa. It 
preserves the existence of a neutral element (annihila- 
tor), however, if e is a neutral element of A (a is an 
annihilator of A) then n7! (e) is a neutral element of Ay, 
(n~! (a) is an annihilator of An). If ņ is involutive, i. e., 
if no n = idjo.1], then (Ay), =A, so there is a duality 
between A and Ay. The most applied duality is based 
on the standard (or Zadeh’s) negation 7: [0, 1] — [0, 1] 
given by n(x) = 1 — x. In that case, we use the notation 
A? = Ay and A4(x1,...,%,) = 1-A(1—11,..., 1— Xn). 
As a distinguished example recall the class of triangular 
conorms which are just the dual aggregation functions 
to triangular norms, i. e., S is a triangular conorm [4.26] 
if and only if there is a triangular norm T such that 
SEI, 

Further generalizations of (4.7) consider different 
automorphisms g, 91, . . . , n : [0, 1] — [0, 1], 


= Q (A (p1 1), +--+ Pn@n))) - (4.8) 


Moreover, it is enough to suppose that ¢),...,@, are 
monotone (not necessarily strictly) and satisfy g;(0) = 
0, gi(1)= 1,i=1,...,n, as in such case it also holds 
that for any aggregation function A, Ag,g,,...,0, given 
by (4.8) is an aggregation function. 

Another construction well known from functional 
theory is linked to the composition of functions. We 
have two kinds of composition methods. In the first one, 
considering a k-ary aggregation function B: [0, 1]‘ > 
[0, 1], we can choose arbitrary k aggregation functions 
Ci,..., Ck (either all of them are extended aggregation 
functions, or all of them are n-ary aggregation functions 
for some fixed n > 1), and then we can introduce a new 
aggregation function A (either extended, with the con- 
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vention A(x) = x, x € [0, 1]; or n-ary) such that Theorem 4.3 
Let f : [0, 1] — [—oo, oo] be a continuous strictly mono- 
A(x) =B(C\(x),..., Ci(x)) . (4.9) tone function, and let O = aọ <a; <---<a,=1 be 
a given sequence of real constants. Then for any system 
As a typical example of construction (4.9), consider B (iyi of n-ary (extended) aggregation functions the 
to be a weighted arithmetic mean W, W(x,,...,x,) = function A: [0, 1]" — [0, 1] (A : Unen 0, 1]” > [0, 1]) 
Ye Wixi. Then given by 
k k 
A(x) = >D wi Ci(x) , A(x) =f ice + (G—aGj—)A(x) 
i=1 J= 
. A . bi . f . fi . k—1 
i. e., A is a convex combination of aggregation functions y f(a) (4.11) 
m TE Ces j=l 
3 The second method is based on a partition of the 
> space of coordinates {1, ... , n} into subspaces where 
a 
w {1,...,m},{m4+1,...,m) +n2},..., 


{ny Hees +n ++ l,n}. 


Then, considering a k-ary aggregation function B: 
[0, 1] — [0, 1] and aggregation functions C; : [0, 1]" > 
[0,1], ¿= 1,...,k, we can define a composite aggrega- 
tion function A : [0, 1]” — [0, 1] by 


Alies) 
=B (Ci@i tae Xn). C2 (Xn) +1; tee ee 
sens CeO atthe eta n) > (4.10) 


This method can be generalized by considering an 
arbitrary partition of {1,...,} into {),..., 4}. As an 
example, consider the n-ary copula C : [0, 1]” — [0, 1] 
defined for a fixed partition {J),...,J} of {1,...,n} by 


k 
Ci.) = | | min {y JER 


i=1 


For more details, see [4.63]. 

The third group containing constructions based 
on some given aggregation functions can be seen as 
a group of patchwork methods. As typical examples, 
we can recall several types of ordinal sums. Besides 
the well-known Min-based ordinal sums for conjunctive 
aggregation functions (especially for triangular norms 
and copulas) [4.26, 64], W-ordinal sums for copulas 
(or quasi-copulas) [4.65], as well as g-ordinal sums 
for copulas [4.66], we recall one kind of ordinal sums 
introduced in [4.67] which is applicable to arbitrary ag- 
gregation functions. 


. Xj —G— 
xo = max fo, min fi, = ; 
a1 


is an n-ary (extended) aggregation function. 


Observe that if all A;’s are triangular norms (copu- 
las, quasi-copulas, triangular conorms, continuous ag- 
gregation functions, idempotent aggregation functions, 
symmetric aggregation functions) then so is the newly 
constructed aggregation function A. 

The fourth group contains construction methods 
allowing one to introduce weights into the aggrega- 
tion procedure. The quantitative look at weights can 
be seen as the corresponding repetition of inputs, and 
the weights roughly correspond to the occurrence of 
single input arguments. For example, when consid- 
ering a strongly idempotent (symmetric) aggregation 
function constructed by means of a dissimilarity func- 
tion D (see Theorem 4.1) and weights w1,...,Wpn 
(at least one of them should be positive, and all of 
them are nonnegative), we look for minimizers of the 
sum >, w;D(x;, t). For example, if D(x, y) = (x—y)*, 
then we obtain the weighted arithmetic mean 


Wires in = Se. 

Dimi Wi 
This approach can also be introduced in the case when 
different dissimilarity functions are applied. As an ex- 
ample, consider the aggregation function A : [0, 1]” > 
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[0, 1] given by (4.5). We look for minimizers of the 
expression wı|x1 — t| + X; wi(x; — 1)” and the result- 
ing weighted aggregation function Aw : [0, 1]” — [0, 1] 
is given by 


Aw(x1, tee Xn) 


= Med (1M... 89) ow 
i=2 i 


Wi 
M(X2,.--5%n) + sera} 
2 im Wi 
Considering the integer weights w = (w1, ..., Wn), for 
an extended aggregation function A which is symmetric 


and strongly idempotent, we obtain the weighted aggre- 
gation function Aw : [0, 1]” — [0, 1] given by 


Aw(x1, see Xn) 
SA Ries ag Mi KOs ig HI jax ep More og Rp 
——$s§@+ — aAa 
w -times w2-times Wn-times 


The strong idempotency of A also allows one to in- 
troduce rational weights into aggregation. Observe that 
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As already mentioned, all introduced results (some- 
times for special types of aggregation functions only) 
can be straightforwardly extended to any interval J C 
[—co, co]. Moreover, one can aggregate more gen- 
eral objects than real numbers. For example, a quite 
expanding field concerns interval mathematics. The 
aggregation of interval inputs can be done coordinate- 
wise, 


A (Pa, yi], +--+ Bn Yn]) 
= [Ai (x1, Sra , Xn), A421, series yn) 5 


where A), Ao are an arbitrary couple of classical aggre- 
gation functions such that A; < A2 (mostly A; = A2 is 
considered). However, there are also more sophisticated 
approaches [4.69]. 

Already in 1942, Menger [4.43] introduced the ag- 
gregation of distribution functions whose supports are 
contained in [0,00] (distance functions), which led 
not only to the concept of triangular norms [4.44], 
but also to triangle functions directly aggregating 


for each ke N, the weights k-w result in the same 
weighted aggregation function as when considering the 
weights w only. For general weights the limit approach 
described in [4.17, Proposition 6.27] should be applied. 

The qualitative approach to weights considers 
a transformation of inputs x1, . . . , Xn accordingly to the 
considered weights (importances) w1,...,Wn € [0, 1], 
with constraint that at least once it holds w; = 1. This 
approach is applied when we consider an extended 
aggregation function A with a strong neutral element 
e € (0, 1]. Then the weighted aggregation function Ay : 
[0, 1]” — [0, 1] is given by 


Ayi, see Xn) =A (h(w1, x1), Brena , A(Wn, Xn)) ’ 


where hh: [0,1]? — [0,1] is a relevancy transforma- 
tion (RET) operator [4.24, 68] satisfying h(0,x) =e, 
h(1,x) =x, which is increasing in the second coordi- 
nate as well as in the first coordinate for all x > e, while 
h(-,x) is decreasing for all x < e. As an example, con- 
sider the RET operator h given by 


h(w,x) =wx+ (1—w)e. 


For more details, we recommend [4.17, Chapter 6]. 


such distribution functions [4.70]. Some triangle func- 
tions are derived from special aggregation functions 
(triangular norms), some of them have more com- 
plex background (as a distinguished example recall 
the standard convolution of distribution functions). 
For an overview and details we recommend [4.71, 
72). 

In 1965, Zadeh [4.73] introduced fuzzy sets. Their 
aggregation, in particular union and intersection, is 
again built by means of special aggregation functions 
on [0, 1], namely by means of triangular conorms and 
triangular norms [4.26]. Triangular norms also play an 
important role in the Zadeh extension principle [4.74— 
76] allowing to extend standard aggregation functions 
acting on real inputs to the generalized aggregation 
functions acting on fuzzy inputs. As a typical exam- 
ple recall the arithmetic of fuzzy numbers [4.77]. In 
some special fuzzy logics also uninorms have found 
the application in modeling conjunctions. Among re- 
cent generalizations of fuzzy set theory recall the type 
2-fuzzy sets, including interval-valued fuzzy sets, or 
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n-fuzzy sets. In all these fields, a deep study of aggre- 
gation functions is one of the major theoretical tasks to 
build a sound background. 

Observe that all mentioned particular domains are 
covered by the aggregation on posets, where up to now 
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only some particular general results are known [4.22, 
78]. We expect an enormous growth of interest in this 
field, as it can be seen, for example, in its special sub- 
domain dealing with computing and aggregation with 
words [4.79—81]. 
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5. Monotone Measures-Based Integrals 


Erich P. Klement, Radko Mesiar 


The theory of classical measures and integral re- 
flects the genuine property of several quantities 
in standard physics and/or geometry, namely the 
o-additivity. Though monotone measure not as- 
suming o-additivity appeared naturally in models 
extending the classical ones (for example, inner 
and outer measures, where the related integral 
was considered by Vitali already in 1925), their 
intensive research was initiated in the past 40 
years by the computer science applications in ar- 
eas reflecting human decisions, such as economy, 
psychology, multicriteria decision support, etc. In 
this chapter, we summarize basic types of mono- 
tone measures together with the basic monotone 
measures-based integrals, including the Choquet 
and Sugeno integrals, and we introduce the con- 
cept of universal integrals proposed by Klement 
et al. to give a common roof for all mentioned 


Before Cauchy, there was no definition of the integral 
in the actual sense of the word definition, though the 
integration was already well established and in many ar- 
eas applied method. Recall that constructive approaches 
to integration can be traced as far back as the ancient 
Egypt around 1850 BC; the Moscow Mathematical Pa- 
pyrus (Problem 14) contains a formula of a frustum 
of a square pyramid [5.1]. The first documented sys- 
tematic technique, capable of determining integrals, is 
the method of exhaustion of the ancient Greek as- 
tronomer Eudoxus of Cnidos (ca. 370 BC) [5.2] who 
tried to find areas and volumes by approximating them 
by a (large) number of shapes for which the area or vol- 
ume was known. This method was further developed 
by Archimedes in third-century BC who calculated the 
area of parabolas and gave an approximation to the 
area of a circle. Similar methods were independently 
developed in China around third-century AD by Liu 
Hui, who used it to find the area of the circle. This 
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integrals. Benvenuti's integrals linked to semicop- 
ulas are shown to be a special class of universal 

integrals. Up to several other integrals, we also in- 
troduce decomposition integrals due to Even and 
Lehrer, and show which decomposition integrals 

are inside the framework of universal integrals. 


was further developed in the fifth century by the Chi- 
nese mathematicians Zu Chongzhi and Zu Geng to find 
the volume of a sphere. In the same century, the In- 
dian mathematician Aryabhata used a similar method 
in order to find the circumference of a circle. More 
than 1000 years later, Johannes Kepler invented the 
Kepler’sche Fassregel [5.3] (also known as Simpson 
rule) in order to compute the (approximative) volume 
of (wine) barrels. 

Based on the fundamental work of Isaac New- 
ton and Gottfried Wilhelm Leibniz in the 18th century 
(see [5.4,5]), the first indubitable access to integration 
was given by Bernhard Riemann in his Habilitation 
Thesis at the University of Gottingen [5.6]. Note that 
Riemann has generalized the Cauchy definition of in- 
tegral defined for continuous real functions (of one 
variable) defined on a closed interval [a, b]. 

Among several other developments of the inte- 
gration theory, recall the Lebesgue approach covering 
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measurable functions defined on a measurable space 
and general o-additive measures. Here we recall the 
final words of H. Lebesgue from his lecture held at 
a conference in Copenhagen on May 8, 1926, entitled 
The Development of the Notion of the Integral (for the 
full text see [5.7]): 


. if you will, that a generalization made not for 
the vain pleasure of generalizing, but rather for the 
solution of problems previously posed, is always 
a fruitful generalization. The diverse applications 
which have already taken the concepts which we 
have just examine prove this super-abundantly. 


All till now mentioned approaches to integration are 
related to measurable spaces, measurable real functions 
and (o-)additive real-valued measures. Though there 
are many generalizations and modifications concerning 
the range and domain of considered functions and mea- 
sures (and thus integrals), in this chapter we will stay in 
the above-mentioned framework, with the only excep- 
tion that the (o-)additivity of measures is relaxed into 
their monotonicity, thus covering many natural general- 
izations of (o-)additive measures, such as outer or inner 
measures, lower or upper envelopes of systems of mea- 
sures, etc. 

Maybe the first approach to integration not deal- 
ing with the additivity was due to Vitali [5.8]. Vitali 
was looking for integration with respect to lower/upper 
measures and his approach is completely covered by 
the later, more general, approach of Choquet [5.9], 
see Sect. 5.1. Note that the Choquet integral is a gen- 
eralization of the Lebesgue integral in the sense 


that they coincide whenever the considered measure 
is o-additive (i.e., when the Lebesgue integral is 
meaningful). 

A completely different approach, influenced by the 
starting development of fuzzy set theory [5.10], is due 
to Sugeno [5.11]. Sugeno even called his integral as 
fuzzy integral (and considered set functions as fuzzy 
measures), though there is no fuzziness in this con- 
cept (Sect. 5.1). Later, several approaches generalizing 
or modifying the above-mentioned integrals were in- 
troduced. In this chapter, we give a brief overview 
of these integrals, i.e., integrals based on monotone 
measures. In the next section, some preliminaries and 
basic notions are recalled, as well as the Choquet and 
Sugeno integrals. Section 5.2 brings a generalization 
of both Choquet and Sugeno integrals, now known as 
the Benvenuti integral. In Sect. 5.3, universal integrals 
as a rather general framework for monotone measures- 
based integral is given and discussed, including copula- 
based integrals, among others. In Sect. 5.4, we bring 
some integrals not giving back the underlying measure. 
Finally, some possible applications are indicated and 
some concluding remarks are added. Note that we will 
not discuss integrals defined only for some special sub- 
classes of monotone measures, such as pseudoadditive 
integrals [5.12, 13] or t-conorms-based integrals of We- 
ber [5.14]. Moreover, we restrict our considerations to 
normed measures satisfying m(X) = 1, and to functions 
with range contained in [0, 1]. This is done for the sake 
of higher transparentness and the generalizations for 
m(X) € ]0, oo] and functions with different ranges will 
be covered by the relevant quotations only. 


5.1 Preliminaries, Choquet, and Sugeno Integrals 


For a fixed measurable space (X, A), where A is a o- 
algebra of subsets of the universe X, we denote by 
F.a) the set of all A-measurable functions f : X > 
[0, 1], and by Mvy, a) the set of all monotone measures 
m: A — [0,1] i.e., m(@) = 0, m(X) = 1 and m(A) < 
m(B) whenever A C B C X). Note that functions f from 
Fx.a) can be seen as membership functions of fuzzy 
events on (X, A), and that monotone measures are in 
different references also called fuzzy measures, capaci- 
ties, pre-measures, etc. Moreover, if X is finite, we will 
always consider A = 2* only. In such case, any mono- 
tone measure m € Mga) is determined by 2!*!—2 
weights from [0, 1] (measures of proper subsets of X) 
constraint by the monotonicity condition only, and to 


each monotone measure m: A — [0, 1] we can assign 
its Mobius transform M, : A — R given by 


Mn(A) = >> (—1)4\8! -m(B). (5.1) 
BCA 
Then 
m(A) =) Mn(B). (5.2) 


BSA 


Moreover, dual monotone measure m : A — [0, 1] is 
given by m? (A) = 1 — m(A°). 


Monotone Measures-Based Integrals | 5.1 Preliminaries, Choquet, and Sugeno Integrals 


Among several distinguished subclasses of mono- 
tone measures from M(x, a) we recall these classes, 
supposing the finiteness of X: 


© Additive measures, m(A U B) = m(A) +m(B) when- 
ever A N B = Ø; 

© Maxitive measures, m(A U B) = m(A) V m(B) (these 
measures are called also possibility measures [5.15, 
16)); 

@ k-additive measures, M,,(A) = 0 whenever |A| > k 
(hence additive measures are 1-additive); 

© Belief measures, M,,(A) > 0 for all A C X; 

Plausibility measures, m’ is a belief measure; 

@ Symmetric measures, M,,(A) depends on |A| only. 


For more details on monotone measures, we recom- 
mend [5.17—19] and [5.20]. 

Concerning the functions, for any c € [0, 1], A € A 
we define a basic function b(c, A) : X — [0, 1] by 


c ifxEA, 
HEA = 0 else. 
Obviously, basic functions can be related to the char- 
acteristic functions, 14 = b(1,A) and b(x, A) = c- 14. 
However, as we are considering more general types of 
multiplication as the standard product, in general, we 
prefer not to depend in our consideration on the stan- 
dard product. 

The first integral introduced for monotone measures 
was proposed by Choquet [5.9] in 1953. 


Definition 5.1 
For a fixed monotone measure m € M(x, a), a func- 
tional Ch, : F¢x,.a) — [0, 1] given by 


1 


Ch (f) = J maid (5.3) 


0 


is called the Choquet integral (with respect to m), where 
the right-hand side of (5.3) is the classical Riemann in- 
tegral. 


Note that the Choquet integral is well defined be- 
cause of the monotonicity of m. Observe that if m is 
o-additive, i. e., if it is a probability measure on (X, A), 
then the function A: [0,1] > [0,1] given by h(t) = 
m(f > t) is the standard survival function of the random 
variable f, and then Ch,,(f) = i h(t)dt = fọ f dmis the 
standard expectation of f (i.e., Lebesgue integral of f 
with respect to m). 


Due to Schmeidler [5.21, 22], we have the following 
axiomatization of the Choquet integral. 


Theorem 5.1 

A functional J: Fæ, a) > [0, 1], 7x) = 1, is the Cho- 
quet integral with respect to monotone measure m € 
M.a) given by m(A) = 1(14) if and only if J is 
comonotone additive, i. e., if If + g) = /(f) + 1(g) for 
all f, g € F(x,a) such that f + g € Fx,a) and f and g 
are comonotone, (f(x) —f(y))-(g(x) — g(y)) = 0 for any 
x,y EX. 


We recall some properties of the Choquet integral. 

It is evident that the Choquet integral Ch,, is an 
increasing functional, Ch,,(f) <Ch,,(g) for any m€ 
Mx, a)» f.8 E€ F.a) such that f < g. Moreover, for 
each A € A it holds Ch,, (b(c, A)) = c- m(A), and espe- 
cially Ch„ (14) = m(A). 


Remark 5.1 

i) Due to results of Sipos [5.23], see also [5.24], 
the comonotone additivity of the functional J in 
Theorem 5.1, which implies its positive homogene- 
ity, [(cf) =c-I(f) for all c>0 and f € Fy,a) 
such that cf € Fcx,a) can be replaced by the pos- 
itive homogeneity of J and its horizontal additivity, 
ie, 


If) =lf Aa)+lf—faa) 


for all f € F.a) anda € [0, 1]. 
ii) Choquet integral Ch,, : Fæ, a) — [0, 1] is continu- 
ous from below, 


lim Chm f) = Chn (f) 


n—>co 


whenever for (f,)neN € FÀ ay we have fa < fn+1 
for all n € N and f = lim,—+oo fn, if and only if m is 
continuous from below, 


lim m(A,) = m(A) 
n=>=00 


whenever for (An)neN € AN we have A, C Ant 
for all n € N and A = U ey An. 

iii) Choquet integral Ch, : Fæ, a) — [0, 1] is subaddi- 
tive (superadditive), 


IF +8) SIC)+I) UF+8) 217) +1@)) 
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for all f, g,f +8 € F.a), if and only if m is sub- Then 
modular (supermodular), 
m(A U B) + m(A N B) < m(A) + m(B) , Ch„ (f) = inf fi dP|P > m 
(m(A U B) + m(A N B) > m(A) + m(B)) x 
forall A. Be A. Similarly, for the related plausibility measure mf, it 
iv) For any m € My, a) and f € Fy, a) it holds holds 
Cha (f) =1 —Ch,, (1 —f) , Cha (f) = sup P dP|P < mf 
i.e., in the framework of aggregation func- * 
tions [5.25] the dual to a Choquet integral (with 
respect to a monotone measure m) is again the Cho- = sup / faP|P >m 
quet integral (with respect to the dual monotone x 
measure m?). 
For interested readers, we recommend the collec- 
For th fs and details ab he ab tion [5.30]. 

i orit n EOIS an p crsa ae p nie In general, for any monotone measure m € Mix, a) 
ary 4 ts on Choquet integral, we recommend [5.18, 19, 24, and any measurable (continuous from below) func- 
= 6]. 7 ; . : . tion f € F(x,a) there is a probability measure P, f on 
> Restricting our considerations to finite universes, (X, A) so that 
E we have also the next evaluation formula due to ; 
= Chateauneuf and Jaffray [5.27] 


Chn(f) = >) Mn(A)-min (f(x)|xe€ A). (5.4) 


ACX 


In the Dempster-Shafer theory of evidence [5.28, 29], 
belief measures are considered, and then the Mobius 
transform M, : 2* \ {Ø} > [0, 1] of a belief measure m 
is called a basic probability assignment. Evidently, M, 
can be seen as a probability measure (of singletons) 
on the finite space 2* \ {Ø} (with cardinality 2!*! — 1), 
and defining a function F : 2* \ {Ø} — [0, 1] by F(A) = 
min (f(x)|x € A), the formula (5.4) can be seen as the 
Lebesgue integral of F with respect to Mm (i. e., it is the 
standard expectation of variable F) 


Chn(f)= J. FA) M,A). 


AEX Ø; 


Another genuine relationship of Choquet and 
Lebesgue integrals in the framework of the Dempster- 
Shafer theory is based on the fact that each belief 
measure m can be seen as a lower envelope of the 
class of dominating probability measures, i. e., for each 
A CX (X is finite) 


m(A) = inf {P(A)|P > m} . 


Chn (f) = [ferns ’ (5.5) 


X 


see, e.g., [5.24, Theorem 2.6], where the right-hand 
side of (5.5) is the standard Lebesgue integral. More- 
over, if f,g € F.a) are comonotone, one can find 
unique probability measure P allowing to express the 
Choquet integral of f and g with respect to m as 
the Lebesgue integral of f and g with respect to P, 
respectively. As an immediate consequence of (5.5), 
Jensen’s inequality for Choquet integral can be shown 
to be valid. Similarly, if f and g are comonotone, 
based on the above observations, one can prove the 
Minkowski and Chebyshev inequality. For more details, 
see [5.31]. 

For ke N, consider a probability measure P on 
the product space (X, AY, and define a set function 
m: A — [0,1] by m(A) = P(A‘). Then m € Mix, a) is 
a k-additive monotone measure (and belief measure, as 
well), and for all f € Fy.) it holds 


Chn f) = fre, (5.6) 
xk 
where F:X*-> [0,1] is given by F(x1,...,x.) = 


min (f(x;),...,f(x,)). For more details see [5.32]. 
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The Sugeno integral (in the original sources 
called fuzzy integral) was introduced by Sugeno in 
1972 in Japanese in [5.33] and in English in 1974 
in [5.11]. Inspired by the fuzzy set theory introduced 
by Zadeh [5.10], Sugeno has proposed a way how to 
formalize human subjectivity in spirit similar to the ran- 
domness but based only on ordinal scales. His concept 
is not fuzzy, though both fuzzy set theory and Sugeno’s 
integral theory exploit the same aggregation functions 
(sup and inf), and considering functions f € Fx, a) as 
membership functions of fuzzy subsets of X, the corre- 
sponding Sugeno integral can be seen as a version of 
expectation of fuzzy sets. 


Definition 5.2 
For a fixed monotone measure m € My, q), a func- 
tional Su,, : F(x,a) — [0, 1] given by 


Su,,(f) = sup {min (t, m(f > t)) |t € [0, 1]} 


is called the Sugeno integral (with respect to m). 


(5.7) 


There is an equivalent formula for the Sugeno integral, 

compare ([5.11, Definition 3.1]), 
Su (f) = sup {min (m/(A), inf (f(x) |x EA) |A € A}, 
(5.8) 


which in the case of finite X (and using also lattice no- 
tation sup = V, min = A) can be rewritten as 


Sun (f) = V (My (A) A min(f(a)|x € A)) , 


ACX 


(5.9) 


showing the striking similarity with the evaluation for- 
mula (5.4) for the Choquet integral. Here the set func- 
tion MY : 2% \ {Ø} — [0, 1] is the so-called possibilistic 
Mobius transform introduced by Mesiar in [5.34] and 
given by 


MY (A) = 0 if m(A) = m(B) for some BSA, 
j m(A) else. 


Sugeno integral has properties similar to the Choquet 
integral. Indeed, it is nondecreasing functional such that 
Su,,(b(c, A)) = c Am(A), and in particular Su,,(14) = 
m(A). Moreover, Su„ is comonotone maxitive, i.e., 
Sun (f V 2) = Sun(f) V Sun(g) for any comonotone 
f.g € F(x.ay, and min-homogeneous, Su,,(cAf) = cA 
Su,,(f). We have the next axiomatization of the Sugeno 
integral due to Marichal [5.35] (compare with Theorem 
5.1 for the Choquet integral). 


Theorem 5.2 


A 


functional 7: Fæ,a > [0,1], 71x) = 1, is the 


Sugeno integral with respect to monotone measure 
m E€ Myx,a) given by m(A) = /(1,) if and only if J is 
comonotone maxitive and min-homogeneous. 


For alternative axiomatizations see [5.24]. 
Choquet and Sugeno integrals with respect to 


a monotone measure m may differ not more than i 1.6, 
for all f € Fcx,a) it holds 


1 
|Ch,,(f) ~~ Sun(f)| = 4 ; 


Moreover, Chn (f) = Sum(f) for all f E€ F.a) if and 
only if m(A) € {0, 1} for all A € A, and then 


Ch,,(f) = Sun(f) = sup {inf {f(x) |x E€ A} |m(A) = 1} , 


which in case X is finite turns out to be a lattice polyno- 
mial. 


Note that if X has cardinality n and m(A) € {0, 1} 


for all AC xX, then Chm = Sun : [0, 1]" — [0,1] are 
the only n-ary continuous aggregation functions in- 
variant under each automorphism ¢ : [0, 1] — [0, 1], 
i.e., ġ o Cha (f) = Chn (f ° $) for each f[0, 1]” (for f = 
(ai,..., an), f° = ($ (a1), ...,(an))). For more de- 
tails see [5.36]. 


Example 5.1 


i) 


ii) 


Let X = {1,2,3} and define m : 2* — [0, 1] by 


0 ifcard A<1, 


m(A) = 1 otherwise. 


Then, for each f = (x, y, z) € [0, 1], 
Ch,,(f) = Sun(f) = (xAy)V (xA ZV (yA z) 
= med(x, y, z) 


brings the classical median. 
Let X= {1,2} and define m:2* > [0,1] by 
m(A) = sA, Then, for each f = (x, y) € [0, 11°, 
x+y 

2 


G. e., Ch, is the standard arithmetic mean), while 


Chn f) = 


Suntf = Eav (CvA) . 


For fi = (4, 1), Chn (fi) = ; and Sun(fi) = 
For h = (0, 5) Chn (f2) = 1 and Sun (f2) = 


NI=NI= 
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In general, where c, is the unique solution of the equation t = 
1 1 (1— t)”, t €]0, 1[. Hence, 
Chin (f)— Sun (f)| = 2 (lx— yl ^ lx+y— 1) = 4 + 
: 1 
iii) Let X = [0,1], A = B ([0,1]) and let m: A — ifp=1, Ch, (f) = Sun(f) = 5? 
[0, 1] be given by m(A) = AP (A), where p € ]0, oo[ 1 
is a fixed constant and A: A — [0,1] is the stan- if p = 2, Ch,,(f) = = 
dard Lebesgue measure. For any Lebesgue measure 3 
preserving function f : X — [0, 1], such as f(x) = x, 35 J5 
f(x) = 1—x, or f(x) = |2x— 1|, we have and Sun (f) = 2 = 0.382; 
2 
f r 1 ifp =3; Ch, (f) — 
Ch,,(f) = [ mrz0 da= fa-pa= — 3 
ae gaei 
ajd ? j and Sun (f) = —5— = 0.618 
Su„ (f) = sup {min (1, (1 — P) |t € [0, 1} = ¢, 
= 5.2 Benvenuti Integral 
ro] 
as Comparing Theorems 5.1 and 5.2, we see a striking Lemma 5.1 
— similarity in the axiomatic characterization of the Cho- Let @ : [0, u]? — [0, u] be a given pseudoaddition on 
z quet and Sugeno integrals. This similarity was gener- [0, u]. The related pseudo difference © : [0, u]? — [0, u] 


alized under a common roof by Benvenuti et al. [5.24], 
calling there introduced integral general fuzzy integral. 
This integral is now also known as Benvenuti integral 
(compare [5.25]). 

Choquet integral is linked to the standard arithmetic 
operations + and - on [0, co], while the Sugeno inte- 
gral deals with lattice operations ^ and v on (0, 1]. To 
generalize these two couples of operations, pseudoad- 
dition ® and pseudomultiplication © was introduced 
in [5.24]. 


Definition 5.3 

Let u € [1, co] be a fixed constant. An operation @ : 
[0, u]? + [0, u] is called a pseudoaddition on [0,u] 
whenever it is associative, nondecreasing in both 
components, 0 is its neutral element, and © is 
continuous. 


Observe that the structure ([0, u], $) with @ a pseu- 
doaddition on [0, u] is just an /-semigroup of Mostert 
and Shields [5.37] and hence @ is also commutative. 
Moreover, considering the principles of Galois connec- 
tions, we can introduce a pseudodifference © related to 
® satisfying, for all a,b,c € [0, u], (a © b) < c if and 
only ifa<bO6c. 

It is not difficult to see the link to the pseudodiffer- 
ence considered already by Weber [5.14]. 


is given by 


aQb=inf{ceE [0,u]|bDOc>a}. 


Considering the standard addition + on [0, oo], and 
a > b, then the corresponding (pseudo-) difference is 
the standard difference a— b. On the other hand, V is 
a pseudoaddition on any interval [0, u], and its corre- 
sponding pseudodifference Oy, is given by 


0 ifa<b, 


a b= : 
Əv a otherwise . 


Due to [5.37], each pseudoaddition ® on [0, u] can be 
represented as an ordinal sum, 


Se | (ge(Bx) A (grla) + gx(b))) 
if (a,b) € Jax, BÊ. 
avb otherwise , 


agb= 


where (Jax, Bx [pe is a disjoint system of open subin- 
tervals of [0, u], and gx: [ax, Bk] —> [0, 00] is a con- 
tinuous strictly increasing function such that g; (œx) = 
0,ke K (K can be also empty). Two extremal 
cases correspond to 6 = V (when K is empty) and 
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Archimedean pseudoaddition ® on [0, u] generated by 
g : [0, u] — [0, co] (when K is singleton, say K = {1}, 
and a; = 0, B} = u), 


a®b=g '(g(u) A (g(a) +8b))) . 


Then g is called an additive generator of ® and it is 
unique up to a positive multiplicative constant. 

Note that if g is a bijection, i.e., g(u) = ov, then 
a®b=g '(g(a) + g(b)) and @ is called a strict pseu- 
doaddition. 

For a fixed pseudoaddition ® on [0, u], Benvenuti 
et al. [5.24] have introduced a @-fitting pseudomulti- 
plication ©. 


Definition 5.4 

Fix u,v € [1, co] and let © be a given pseudoaddition 
on [0, u]. A mapping © : [0, u] x [0, v] — [0, u] is called 
a ®-fitting pseudomultiplication whenever it is nonde- 
creasing in both components, 0 is its annihilator, i. e., 
00b=a00=0 for all a € [0, u], b € [0, v], it is left 
distributive over @, i.e., (a@b) Oc = (aOc) @(bOc) 
for all a, b € [0, u], c € [0, v], and it is lower semicontin- 
uous, i. €., 


(VneN an) (©) (Vme N bm) = Vn,meEN (an © bm) : 


The left distributivity of a pseudomultiplication © 
over V simply means the nondecreasingness of © in 
the first coordinate, and thus there are several kinds 
of V-fitting pseudomultiplication ©. On the other 
hand, this is a rather restrictive constraint when © is 
Archimedean, i.e., generated by an additive generator 
g: [0, u] > [0, co]. 


Proposition 5.1 

Let @ : [0, u]? — [0, u] be an Archimedean pseudoad- 
dition generated by an additive generator g: [0, u] > 
[0, co]. A mapping ©: [0, u] x [0, v] > [0, u] is a @- 
fitting pseudomultiplication if and only if there is 
a lower semicontinuous nondecreasing function h: 
[0, v] — [0, co] such that h(w) = 0 for some w € [0, v], 
and g(u)- h(a) > g(u) for all a € |v, v], so that 


aOb =g" (g(u) ^ g(a): h(b)) . 


In particular, if ® is a strict pseudoaddition, then 
h: [0, v] — [0, co] is a lower semicontinuous nonde- 
creasing function, satisfying h(0)=0, and a © b= 
g~! (g(a) -h(b)). 


Definition 5.5 

Let u,v € [1,00] be fixed given constants and let @: 
[0, u]? > [0,u] be a given pseudoaddition, and ©: 
[0, u] x [0, v] + [0, u] be a given @-fitting pseudomul- 
tiplication such that 1 © 1 < 1. For a fixed monotone 
measure m € M,a), a functional B® : F a) > 
[0, 1] given by 


BLOF) = sup D (ai (©) m(Aj)) |n € N, 


i=1 


D b(ai, Ai) < f, (A=; is a chain 
i=1 
is called Benvenuti integral (with respect to m, based on 
® and ©). 


Observe that if s€ F(y,a) is a simple function, 
range s = {b),..., ba}, by < by < --- < bn, then 


BE (s) = @ (S b1) Om(s > bi) , 


i=1 
with the convention bp = 0. Then for any f € Fœ, a), 
BPO) 
= sup [BE Olose F is simple, s < \ 
m (X.A) ple, s <f ` 
Evidently, 


B®: (b(a, A)) = a © m(A) 


m 


and hence 


BR OA) = m(A) 
for all m € Mgx,a), A € A only if 1 © b = b for all b € 
[0, 1]. 

If @ is a strict pseudoaddition on [0, u] generated by 
an additive generator g, this means that © restricted to 
[0, 1]? is given by 


aobu eee) 
cee ( a). 


If is a nonstrict pseudoaddition, then there is no 
®-fitting pseudomultiplication © such that 1 © b = b 
for all b € [0, 1]. 

Note that for the standard arithmetic operations + 
and - on [0, oo], Bre = Chn, i. e., the Choquet integral 
is recovered. Similarly, BX+^ = Sun. 
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Example 5.2 

i) Let u=v=1, =v and ©: [0,1]? > [0,1] 
be given by a@b=a?-b!, p,q €]0,oo[. Then 
B®-O(f)) = sup {P -(m(f = 1))4 |t € [0, 1]} for any 
me My.ay and fe Faa and BY-O(14) = 
(m(A))*%. Note that if p = q = 1, the Shilkret inte- 
gral Shn = Bo-O is recovered, see [5.19,38]. In 
general, Be: F) = Shna (fP) for any f € F.a). 

ii) For a strict pseudoaddition © on [0, u] and a @- 
fitting pseudomultiplication © on [0, u] x [0, v], see 
Proposition 5.1, the constraint 1©1<1 means 
h(b) < 1, and then BLOF) = g | (Chron) (g(f))), 
i.e., BY-© is obtained as a transformation of the 


m 
Choquet integral. 


For more details, we recommend the original 
source [5.24], but also [5.25, 39]. 


Remark 5.2 
When considering u = 1, a pseudoaddition ® on (0, 1] 


5.3 Universal Integrals 


The concept of universal integrals on [0, oo] was pro- 
posed and discussed in [5.44]. As already mentioned, 
we will restrict our considerations to the interval 


(0, 1]. 


Definition 5.6 
Let S be the class of all measurable spaces. A mapping 


E U (Mæ. x Fæ.a) > [0.1] 
(xX, A)ES 


is called a universal integral whenever it satisfies 


UII J is nondecreasing in both components; 

UI2 there is a semicopula @ : [0, 1]? > [0,1] (i.e., @ 
is nondecreasing in both components and 1 & a = 
a® 1 for all a € [0, 1]) such that J (m, b(a, E)) = 
a®m(E) for all a€ [0,1], any (X,A)e€S,me 
Mix,a) and E € A; 

UB I(m;, fi) =I (m, fo) whenever (mi, fi) € (Xi, Ai), 
i= 1,2, and m (fi > t) = m(fh > t) for all tE 
[0, 1]. 


Observe that the axiom (UIl) reflects the stan- 
dard monotonicity of integrals. On the other hand, 


becomes a (continuous) f-conorm. Integrals based on 
t-conorms closely related to Benvenuti integrals were 
discussed by Murofushi and Sugeno [5.40], resulting 
to the two classes of ¢-conorm based integrals. Those 
based on the smallest t-conorm V coincide with Ben- 
venuti integral based on V, with stronger requirements 
on the corresponding V-fitting pseudomultiplication ©. 
The second one, based on continuous Archimedean t- 
conorms, is a special transform of the Choquet integral, 
compare Example 5.2 ii), 


MS,„ (f) =k (Chron) (s(f))) , 


with appropriately chosen functions k, h, g : [0,1] > 
[0, co]. Note that the Murofushi—-Sugeno integral cov- 
ers also the integral of Weber [5.14] based on strict 
t-conorms. Another closely related approach to integra- 
tion, fixing u = v = oo, can be found in [5.41], where 
Choquet-like integrals were introduced and discussed. 
For more details on these types of integrals we refer 
to [5.42, 43]. 


(UI2) expresses the fact that an integral of a basic 
function b(a, E) with respect to a monotone measure 
m depends on the values a and m(E) only, inde- 
pendently of the considered measurable space (X, A) 
and a monotone measure m € M(x,a) (compare the 
truth values principle in the propositional logics). Fi- 
nally, (UI3) generalizes the well-known fact from 
the probability theory that two random variables (de- 
fined possibly on two different probability spaces) 
have the same expectation whenever their distribution 
functions coincide (in fact, for a probability mea- 
sure P, P(f > t) defines a survival function which 
is complementary to the related distribution func- 
tion). 

There are several construction methods for univer- 
sal integrals. First of all, for any given semicopula & : 
[0, 1]? — [0, 1], one can introduce the smallest univer- 
sal integral Ig and the greatest universal integral [2 
related to ® through (UI2): 


Ig(m,f) = sup {t m(f > ilt € [0, 1} 
and 


1° (m, f) = essup,,(f) ® m(supp f) , 
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where 


essup,,(f) = sup {t € [0, 1]|m(f = t) > 0} 


and 


supp f = {x € X|f(x) > 0}. 


Observe that J, (m, -) = Su,, is the Sugeno integral, 
Irr (m,:) = Shn is the Shilkret integral (I denotes the 
product semicopula), while Zr with T a strict t-norm is 
an integral introduced by Weber in [5.45]. 

Considering the Benvenuti integral based on a pseu- 
doaddition ® on [0,u] and a @-fitting pseudomullti- 
plication © : [0, u] x [0, v] — [0, u], u, v, € [1, co], such 
that ® = ©|[0, 1]? is a semicopula, one get a universal 
integral given by 

PO (m,f) = BPO). 

Note that I+" (m,f) = Chf) and I~:^(m,f)= 
Sun (f). 

As an important class of universal integrals we 
introduce copula-based integrals. Recall that a semicop- 
ula C: [0, 1]? — [0, 1] is called a copula [5.46] when- 
ever it is supermodular, i.e., for any x,y € [0, 1]? it 
holds 


Cavy)+CxAy) = C)+ C). 


Note that there is a one-to-one correspondence between 
copulas and probability measures on Borel subsets of 
[0, 1]? with uniformly distributed margins, this relation 
is stated by the equality 


Pc ({0, a] x [0, b]) = C(a, b), (a,b) € [0, 1}. 


The next result is extracted from [5.44], also com- 
pare [5.47, 48]. 


Proposition 5.2 
Let C : [0, 1]? — [0, 1] be a fixed copula. Then the map- 
ping 


Ke: |] (Maa x Fux.ay) > (0.1 
(X, AVES 


given by 
Ke(m,f) = Pc ({(u, v) € [0, Iv < mf = w)}) 


is a universal integral (with C being the corresponding 
semicopula). 


Note that for the product copula M, K77(m,-) = 
Ch, is the Choquet integral, while for the greatest cop- 


ula A = Min, K,(m,-) = Sum is the Sugeno integral. 
For the smallest copula W : [0, 1]? — [0, 1] given by 


W(a, b) = max(0,a+b—-1), 


Kw was called opposite Sugeno integral in [5.49] and it 
is given by 


Kw(m,f) =A ({t € [0, ]|mf > )>1-t), 


where À is the standard Lebesgue measure on Borel 
subsets of [0, 1]. 


Remark 5.3 
The class of universal integrals is convex, i. e., for J4, Jb 
universal integrals and a constant c € [0, 1], also 
IT=cl+U-ohb 
is a universal integral (related to the semicopula © = 
c-O1 + (l—c)- 02). 
Though the class of semicopulas is also convex, for 


the weakest universal integrals we can ensure only the 
inequality 


l-0, +0—)-0: = Clo, + (1—cMe, - 


On the other hand, for the convex class of copulas it 
holds 


Kec, 40-0 = CKe, + 1 -o)Ke , 


i. e., the class of copula-based integrals is convex. 
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5.4 General Integrals Which Are Not Universal 


There are several integrals defined on any measurable 
space (X, A), for any monotone measure m € M(x, a) 
and any function f € Fx,a) which are not universal. 
We recall two of them based on the standard arithmetic 
operations + and -. 


Definition 5.7 
A mapping 


G: U (M&,a) x Fæ.a) > [0, 00] 
(X,A)ES 


given by 


G(m, f) = sup X ai-m(A)|n €N,ay,..-,4n 20, 


i=1 


do blai, Ai) <f and (Aj), 


i=l 


is a disjoint subsystem of A 


is called a PAN-integral. 


Note that this integral was introduced by 
Yang [5.50], see also [5.51] in more general set- 
ting on [0, co] involving operations ® and ©. Due to 
the results of [5.52], each PAN-integral on [0, 1] is ei- 
ther a transformation of integral given in Definition 5.7, 
I(m, f) = g—! (G(g(m), g(f))) for some automorphism 
g: [0,1] > [0, 1], or if 6 = v, it is a special instant 
of integrals Jọ discussed in Sect. 5.3. Also observe 
that a deep discussion on PAN-integral G can be found 
in [5.53]. 

PAN-integral allows one to recognize the under- 
flying monotone measure m only if m is superad- 
ditive. Moreover, as a major defect of this inte- 
gral we recall that it does not exclude the equality 
of integrals based on two different monotone mea- 
sures, i.e., there are monotone measures mj, mp € 
Mx,ay,m Am, such that G(m,f) = G(m,f) for 
all fE Fix,ay. Note that PAN-integral coincide 
with the Lebesgue integral whenever m is o- 
additive. A similar situation is linked to the con- 
cave integral introduced by Lehrer [5.54], see 
also [5.55]. 


Definition 5.8 
A mapping 


L: U (Maa) x Fæ.a)) — [0, co] 


(X,AJES 


given by 


L(m,f) = sup X ai-m(Aj)|n EN, 


i=1 


Bip io Gn > 0, Š d(ai,Ai) <f 


i=1 
is called a concave integral. 


Observe that this integral is concave in the sense that 
for each m € Mix, a). f, 9 E F.a) and c € [0, 1], 


Lim, cf + (1—c) g) > cL(m,f) + (1 — c) L(m, g). 


Concave integral coincides with the Choquet inte- 
gral whenever m is supermodular. However, also here 
L(m,f) = L(m2,f) may hold for all f € Fix.) for 
some monotone measures mı, m E Myx,a), mi Amo. 
Finally, recall that it trivially holds 


L(m,f) = G(m, f) and L(m,f) > Chn(f) 
for all m € Mx,a) and f € Frx,a). 


Example 5.3 

i) Consider X = [0, 1], A = 8([0, 1]) and A the stan- 
dard Lebesgue measure on A. Let m =A’, pe 
]0, 1[. Then for any f € Fa) with nonvanishing 
support (i. e., m(f > 0) > 0) it holds 


G(m,f) = L(m,f) = +00 . 


On the other hand, for m = A? (observe that m is 
supermodular, and thus also superadditive) we get, 
considering f = idx, 
2 . 1 
G(m,f) = B while L(m, f) = Cha f) = 3° 


ii) For X = {1,2,3} and A=2*, let m:A—>R 
be given by m,(@) = 0,m,(A) = 0.1 if card A = 1, 
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5.4 General Integrals Which Are Not Universal 


m,(A) = aif card A = 2 and m,(X) = 1. Evidently, 
Ma E Mx a) if and only if a€ [0.1,1]. Let fe 
Fæ.a be given by f(1) = 3,f(2) = },fG) = 1. 
Then 


to 4 2 
G(ma, f) = sup a dha Olka 
; if a € (0.1, 0.45], 
Otte if a€]0.45, 1), 


and 


1 1 1 1 
Lime f) = spf 1+5 a+ 5 Ol, 5 


1 2 1 2 
=:-0.1+ =-0.1,=-a+=- 
+3 t3 3 Gong a 

13 if a € [0.1,0.2[, 


= 4 Hte if ae[0.2,0.55], 


a if a€]0.55, 1]. 
Moreover, 
1.1 
Chm, (f) = < s g 


Observe that m, is supermodular if and only if a € 
[0.2, 0.55] and then 


lil+a 
z 


L(ma, f) = Chm, f) = 


iii) For X finite and m € M(x a) such that m(A) € 
{0, 1} for all A C X, all universal integrals coin- 
cide, independently of the underlying semicopula 
®, Im(f) = sup {min (f(x)|x € A) |m(A) = 1}. How- 
ever, this does not hold for PAN-integral G(m, -) 
neither for the concave integral L(m, -). Consider 
as an example the greatest monotone measure m* € 
Mgx.2x) given by 


* 0 if A=@, 
m (4) = 1 else. 

Then for any universal integral J it holds 
I(m*,f) = max (f(x)|xEX), but G(m*,f)= 
L(m*,f) T Vxext @). 

iv) The only monotone measures m E€ My oxy, X fi- 
nite, such that all universal integrals as well as the 


PAN and concave integrals coincide, are so-called 
unanimity measures 


1 if BCA, 


G = 
mp, BCX, BHO, mp(A) 0 else. 


Then 


I(mg,f) = G(mg, f) = L(mg, f) 
= min (f x)|x € B) . 


Recently, a new concept of decomposition integrals 
was proposed in [5.56], unifying the PAN integral G, 
the concave integral L, the Choquet integral Ch, and the 
Shilkret integral Sh. 


Definition 5.9 

Let (X, A) be a measurable space and let H be a sys- 
tem of some finite subsystems (i. e., of collections) from 
A. Then the mapping 


Dy : Mw, ay X F.a > (0, 0] 
given by 
Dyr (m,f) = sup Yo ai m(Ai)|ai >0,i€l, 
i€l 


by b(ai, Ai) <f, Adie € H 
icl 
is called a H -decomposition integral. 


Consider the next decomposition systems 


H” = {(AD;—ı isachainin A}, nEeN ; 
Hg = {(Ai)ic; İs a finite measurable partition of X} ; 
HL=A; 

Hey = {B\B is a finite chain in A} . 


Then 
Dyo (m, ) = Shin ; 
Dig =G; 
Dy, =L; 
D Hon (m, -) = Chm . 
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Fig. 5.1a-c The function À (idx > f) with shaded areas expressing the corresponding integrals Dy;a) (A, idx) (a), 


D rœ (A, idy) (b), Cha (idx) (c) 


Further, the only decomposable integrals which are also 
universal integrals are the Choquet integral and H- 
decomposition integrals D4fo) and they satisfy 


Sh = Dyo < Dyo <- < Dyw <- <Ch. 


Observe that if X is finite, card X = n, then Dzfœ = Ch 
and that 


Ch = lim Dyw = sup {Dywan E N}. 


For more details and further discussion about decom- 
position integrals, we recommend [5.56-58]. 


Example 5.4 
Using the notation from Example 5.3 i), it holds 
Dar (A, idy) E 
n , 1 ————— 

H X 2(n + D 

and 
1 
s L = Chy (idẹ) . 


lim —“— = 
noo Wn+1) 2 


For better understanding, see Fig. 5.1 with the graph of 
the function A (idx > t) and with shaded areas express- 
ing the corresponding integrals. 


5.5 Concluding Remarks, Application Fields 


We have recalled and discussed several kinds of inte- 
grals defined on any measurable space for any mono- 
tone measure and any nonnegative measurable func- 
tions, restricting our considerations to the unit interval 
[0, 1]. There are several possible extensions of these in- 
tegrals to the bipolar scale [—1, 1], i.e., for integrating 
functions with range in [—1, 1]. Recall only the case of 
the Choquet integral with bipolar extensions of different 
kinds, such as: 


@ Asymmetric Choquet integral, 


Chir (f) = Ch, f”) ap Cha (f) , 
where ft:X— [0,1] is given by ft(x)= 
max (0,f(x)), f :X — [0,1] is given by f7 (x) = 
max (0, —f(x)), and m? : A — [0, 1] is a monotone 
measure dual to m. For more details see [5.18, 19, 
26]; 


© Symmetric (Šipoš) Choquet integral, 


Chè (f) = Ch, (fT) = Cha (f ) , 


m 


see [5.18, 19, 23, 26]; 

@ In the case when X is finite, two another exten- 
sions called a balanced Choquet integral [5.59] and 
a merging Choquet integral [5.60] reflecting (par- 
tial) compensation of positive and negative inputs 
were also introduced and discussed. Further gen- 
eralizations yield the background of cummulative 
prospect theory CPT (Cummulative Prospect The- 
ory) of Tversky and Kahneman [5.61, 62], however, 
then two monotone measures are considered, 


Ch, m2 P) = Chm, E?) = Chm, C) . 


Observe that economical applications of CPT have 
resulted into Nobel Prize for Tversky and Kahneman in 
2002. 
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Some of introduced integrals were introduced be- 
cause of solving some practical problems. For example, 
concave integral of Lehrer [5.54] is a solution of an 
optimization problem looking for a maximal global 


Among many fields where integrals discussed in 
this chapter are an important tool, we recall decision 
making under multiple criteria, multiobjective opti- 
mization, multiperson decision making, pattern recog- 


performance. nition and classification, image analysis, etc. For more 
details, we recommend [5.25, Appendix B] or [5.19]. 
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Many different types of fuzzy sets have appeared in the 
literature since Zadeh introduced the concept of fuzzy 
set (or type-1 fuzzy set) [6.1]. Roughly speaking, the 
basic characteristics of all those definitions are the fol- 
lowing: 


i) They are particular instances of the L-fuzzy sets de- 
fined by Goguen [6.2]. 

ii) They arise from theoretical problems and are very 
efficient to solve such theoretical problems. 

iii) The specific characteristics of the new definitions 
do not use to play a formal role, quite often becom- 
ing an easy adaptation of Zadeh’s fuzzy sets. 

iv) It is not always shown to what extent the new pro- 
posal implies a practical advantage when compared 
to Zadeh’s fuzzy sets. 


The last point gives rise to a key criticism when ad- 
ditional information is needed for the management of 
a new kind of fuzzy sets, but the improvement we ob- 
tain in practice cannot be justified by the effort required 
to obtain such an information. But more important than 
that is the previous criticism (iii), about the difficulty 
of building the best family of sets for the application 
we are considering. Surprisingly, this key issue has not 
captured the attention of too many researchers. 

In this paper, we shall focuss on those sets con- 
ceived to address the problem stated by Zadeh in 1971 
in order to address the difficulty of finding the mem- 
bership degree of each element (we shall refer to these 
sets as extensions of the fuzzy sets), and then we shall 
point out applications that can be found in the litera- 
ture in which the use of some extensions provides better 


results than the use of type-1 fuzzy sets, according to 
the comparison carried out in the papers where this im- 
provement is shown. Once the definition of extension of 
fuzzy sets has been introduced, we shall describe some 
of its properties and remark the structural problems of 
the different types of these extensions. Among those 
extensions we shall consider type-2 fuzzy sets, interval- 
valued fuzzy sets, Atanassov’s intuitionistic fuzzy sets 
or type-2 bipolar fuzzy sets and Atanassov’s interval- 
valued fuzzy sets. 

We have organized this chapter as follows. In 
Sect. 6.2 we start recalling the reasons that led Zadeh 
to introduce fuzzy sets. We also remind the basic no- 
tions in Brouwer’s intuitionistic theory to later justify 
the terminological problems linked to the sets defined 
by Atanassov. In Sect. 6.3 we present the origin of the 
extensions of fuzzy sets as well as the definitions. Sec- 
tion 6.4 is devoted to type-2 fuzzy sets. We stress the 
problems related to the definition of the basic operations 
and the terminology. In Sect. 6.5 we analyze a particular 
case of the previous sets, namely, interval-valued fuzzy 
sets. We present their properties and different construc- 
tion methods, depending on the application that we are 
dealing with. We also refer to the papers in which it 
is shown that the results that we obtain with these sets 
are better than those obtained with other techniques. 
In Sects. 6.6 and 6.7 we describe the sets defined by 
Atanassov. Section 6.8 explains the links between the 
considered extensions. In Sect. 6.9 we exhibit some 
other definitions of fuzzy sets in the literature that do 
not fall into the scope of our notion of extension. We 
finish with some conclusions and references. 


6.1 Considerations Prior to the Concept of Extension of Fuzzy Sets 


In classical logic, propositions can only be either true 
or false. Aristotle formulated the basic principles of 
this logic: the noncontradiction principle (a statement 
cannot be true and false at the same time) and the 
middle-excluded principle (every statement is either 
true or false). 

It is easy to note that there are many situations 
for which more than two truth values are needed. 
This fact led C.S. Peirce to say that Aristotle’s for- 
mulation is the simplest hypothesis we can work with. 
In fact, meanwhile human knowledge representation 
is based upon concepts [6.3], and these concepts are 
not crisp in nature, we should not expect that hu- 
man beings use binary logic so often in their daily 
life. 


Everyday situations such as taste, meaning of ad- 
jectives, etc., can only be studied precisely if gradings 
more complex than true or false are considered. Even 
very widely used mathematical models can lead to 
paradoxes. For instance, quite often we are forced to 
establish arbitrary cuts in order to make reality fit our 
binary model. 

These considerations led to propose different log- 
ical formulations which allowed for more than two 
truth values, like Brouwer’s intuitionistic logic (par- 
tially caught by the so-called intuitionistic propositional 
calculus modeled by Heyting algebras), multivalued 
logics presented by Lukasiewicz, or Zadeh’s fuzzy logic 
(which replaces the set {0,1} by the set [0, 1]), for 
example. 


The Origin of Fuzzy Extensions 


6.1 Considerations Prior to the Concept of Extension of Fuzzy Sets 


6.1.1 Brouwer's Intuitionistic Logic 


In 1907, the Dutch mathematician L.E.J. Brouwer 
(1881-1966) introduced the intuitionistic logic. Be- 
tween the precursors of intuitionistic logic, we can 
include Kronecker, Poincare, Borel, or Weyl. 

For intuitionistic researchers, the objects of study in 
Mathematics are just some intuitions of the mind and 
the constructions that can be made with them. Hence, 
the intuitionistic mathematics only handles built objects 
and only recognizes the properties assigned to these ob- 
jects in their construction. In particular, the negation of 
the impossibility of a fact is not a construction of such 
a fact, and so both the double negation principle and 
the reduction ad absurdum method are not acceptable 
for the intuitionist. In the same way, it may happen that 
it is impossible to build both a fact and its negation, 
so also the middle-excluded principle is excluded by 
intuitionism. 

In 1930 Heyting, a Brouwer’s disciple, went one 
step ahead and defined a propositional calculus in terms 
of axioms and rules in Hilbert’s style. This calculus 
is known as intuitionistic propositional calculus (intu- 
itionistic logic). For several decades, the research in 
intuitionism was almost stopped. But it has reappeared 
with strength in the logic of categories and topos [6.4, 
5]. In this sense, the studies by Takeuti and Titani in 
1984 [6.6] on intuitionistic fuzzy logic and intuitionis- 
tic fuzzy set theory are of special interest for us. In [6.7], 
it is settled that 


Takeuti and Titani’s intuitionitic fuzzy logic is sim- 
ply an extension of intuitionistic logic, i.e., all 
formulas provable in the intuitionistic logic are 
provable in their logic. They give a sequent calculus 
which extends Heyting intuitionistic logic, an exten- 
sion that does not collapse to classical logic and 
keeps the flavor of intuitionism. 


6.1.2 Lukasiewicz's Multivalued Logics 


In 1920s, Jan Lukasiewicz (1878-1956) along with 
Lesniewski founded a school of logic in Warsaw that 
became one of the most important mathematical teams 
in the world, and among whose members was Alfred 
Tarski. 

Lukasiewicz’s idea consists in distributing the truth 
values uniformly on the [0, 1] interval: if n values are 
considered, they should be 0, a Žo aguS E, 1; if 


they are infinite, we should take QN [0, 1]. Negation is 


defined as n(x) = 1 — x, and the following operation is 
also defined: x ® y = min(1,x+ y). 


6.1.3 Zadeh's Fuzzy Logic. 
First Generalization by Goguen 


Consistently to Lukasiewicz’s studies, Zadeh [6.1] in- 
troduced fuzzy logic in his 1965 paper, Fuzzy Sets. 
Born in Azerbaijan in 1921, he moved to the Univer- 
sity of California at Berkeley in 1959. His ideas on 
fuzzy sets were soon applied to different areas such as 
artificial intelligence, natural language, decision mak- 
ing, expert systems, neural networks, control theory, 
etc. 

In mathematics, every subset of a given referential 
universe U can be identified with its characteristic func- 
tion f; that is, the function f: U — {0, 1} which takes the 
value 1 if the element belongs to the considered subset 
and 0 in other case. In contrast, a fuzzy set is a mapping 
from the universe U to [0, 1]; that is, 


Definition 6.1 
A fuzzy set (or type-1 fuzzy set) A over a referential 
set U is an object 


A = {(uj, Ma(ui))|ui € U} , 
where ua: U — [0, 1]. 


Ha (u;i) represents the degree of membership of the el- 
ement u; € U to the set A. The elements for which 
Ha (ui) = 1 belong to the set A; those for which 
Ha (ui) = 0 do not belong to A and there are elements 
with a greater or smaller degree of membership to A 
depending on pa (u;). 

We are going to denote by FS(U) the class of fuzzy 
sets defined over U; that is, FS(U) = [0, 1]”. The mem- 
bership degree of an element u; € U to the fuzzy set A 
is usually denoted by A(u;) instead of ya (u;). 

From the classical definition of union and inter- 
section for crisp sets, Zadeh proposes the following 
definitions: 


AU B(uj) = max(A(u;), B(uj)) š 
ANB(u;) = min(A(u;), B(u;)) . aiei 


A key concept in the following developments is that 
of lattice. We review now its definition, that can be 
found for instance in [6.8]. 
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Recall that an order relationship over a set L is a re- 
lation <z such that 


i) x <z x for all x € L (reflexivity); 

ii) if x <z y and y <z z then x <z z for any x,y,z E€ L 
(transitivity); 

iii) if x <z y and y <z x, then x = y, for any x, y € L (an- 
tisymmetry). 


If <z is an order relationship over L then (L, <;) 
is called a partially ordered set. Now, in order to de- 
fine a lattice we need first to introduce the following 
definition. 


Definition 6.2 

Let (L, <L) be a partially ordered set and A C L (in the 
sense of the usual set theory). The greatest lower bound 
of A (if it exists) is the element xint € L such that: 


i) Xing <z z for all z € A and 
ii) for any y € L such that y <z z for all z € A it follows 
that y <p Xing. 


Analogously, the least upper bound of A (if it exists) is 
the element Xsup € L such that 


i) Z <L Xsup for all z € A and 
ii) for any y € L such that z <z y for all z € A it follows 
that Xsup <z y. 


Now we can introduce the notion of lattice. 


Definition 6.3 

A lattice is a partially ordered set (L, <z) such that any 
two elements x, y € L have the greatest lower bound or 
meet, denoted by x A y and the lowest upper bound or 
join, denoted by xV y. A lattice L is called complete 
if any subset of L has the lowest upper bound and the 
greatest lower bound. 


Given a lattice (L <,), we will call supremum of 
L and denote by 1z the lowest upper bound of L (if it 
exists). Analogously, we will call the infimum of L and 
denote by 0, the greatest lower bound of L. In case both 
1, and O; exist, L is called a bounded lattice. 

Observe that if we know how the join and meet op- 
erations are defined for any two elements of a set L, 
we can recover the ordering <z just by defining for any 
x,yEL 


x <, y if and only ifx ^y =x 


if and only ifxv y= y 


Taking into account (6.1) and Definition 6.3, it is easy 
to prove the following theorem. 


Theorem 6.1 
(FS(U), U, N) is a complete lattice. 


From Theorem 6.1 and the concept of lattice, we 
can define the following partial order relation: For 
A,B € FS(U) 


A <rs B if and only if A(u;) < B(u;) 


for every wE U. 


The first criticism to fuzzy sets theory arises from 
this order relation <ps. Since Zadeh presented fuzzy 
sets to represent uncertainty, it comes out that <;s is 
a crisp relation. Note that the following may happen: 
Let U be a referential set with 1000 elements and let A 
and B be two fuzzy sets over U such that for every 
element except for one A(u;) < B(u;). Then, from the 
previous relation, A is not less than B. This fact led 
Willmott [6.9], Bandler and Kohout [6.10] and others 
to consider the concept of inclusion measure. These 
measures have been widely used in fuzzy morphologic 
mathematics [6.11], in image processing [6.12], etc. 

It is easy to see that with the operations defined 
in (6.1) and the standard negation, n(x) = 1— x for 
all x € [0, 1], neither the noncontradiction principle nor 
the middle excluded principle hold. Nowadays, op- 
erations in (6.1) are given in terms of t-norms and 
t-conorms [6.13-16]. 

Definition 6.1 can be clearly extended to consider 
mappings valued over any kind of set. In particular, 
for our future developments and following Goguen’s 
work [6.2], it is interesting to consider the case of map- 
pings that take values over a lattice L. In this case, we 
speak of L-fuzzy sets. 

Taking into account Definition 6.3 Goguen presents 
the concept of L-fuzzy set as follows: 


Definition 6.4 
Let (L, V, A) be a lattice. An L-fuzzy set over the refer- 
ential set U is a mapping 


£U—>L. 
For a given lattice L, we will denote by L-FS(U), 


the space of L-fuzzy sets over the referential U. That is, 
L-FS(U) = L’. 
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Union and intersection of L-fuzzy sets can be easily 
defined as follows. 


Definition 6.5 

Let L be a lattice, and let V and A be its join and meet 
operators respectively. Then intersection and union are 
defined, respectively, by: 


i) 
Nz: L-FS(U) x L-FS(U) — L-FS(U) given by 
Nz (A, B)(uj) = A(uj) A B(ui) . 
In order to recover the usual notation for fuzzy sets, 
we will write Nz (A, B) as AN, B; 
ii) 


Ur: L-FS(U) x L-FS(U) —> L-FS(U) given by 
Uz (A, B)(uj) = A(u;) V B(ui) . 


In order to recover the usual notation for fuzzy sets, 
we will write Uz (A, B) as A Uz B. 


We can state the following result for L-fuzzy sets. 


Proposition 6.1 

Let L be a bounded lattice with a supremum given by 
1, and an infimum given by 0z. Let V and ^ be the 
join and meet operators of L, respectively. Then, the set 


6.2 Origin of the Extensions 


In 1971, Zadeh in his paper [6.17] settled that the con- 
struction of the fuzzy sets, that is, the determination of 
the membership degree of each element to the set, is 
the biggest problem for using fuzzy sets theory in ap- 
plications. This fact led him to introduce the concept of 
type-2 fuzzy set. 

Later, in December 11, 2008, in the bisc-group mail 
list Zadeh proposes the following definitions. 


Definition 6.6 

Fuzzy logic is a precise system of reasoning, deduction, 
and computation in which the objects of discourse and 
analysis are associated with information which is, or is 
allowed to be, imperfect. 


Definition 6.7 

Imperfect information is defined as information which 
in one or more respects is imprecise, uncertain, vague, 
incomplete, partially true, or partially possible. 


(L-FS(U), <r-rs(yy) is a bounded lattice, where the or- 
der is defined as 


A <,.rs(v) B if and only if A Uz B = B 
or equivalently 

A <_rsv) B if and only if ANL B =A. 
That is 


A SL-FS(U) Bif and only if A(u;) Vv B(uj) = B(uj) 
for all u; € U 


or equivalently 


A <,.rs(v) B if and only if A(u;) A B(u;) = A(u;) 
foral u;E€ U. 


The supremum of this lattice is given by 


lLFsU) $ U > E 3 


uj; —> 1, 
and the infimum is given by 


OL-Fs(u) :U—>L 


ui > 0L. 


On the same date and place, Zadeh made the fol- 
lowing remarks: 


1. In fuzzy logic everything is or is allowed to be 
a matter of degree. Degrees are allowed to be 
fuzzy. 

2. Fuzzy logic is not a replacement for bivalent logic 
or bivalent-logic-based probability theory. Fuzzy 
logic adds to bivalent logic and bivalent-logic- 
based probability theory a wide range of concepts 
and techniques for dealing with imperfect informa- 
tion. 

3. Fuzzy logic is designed to address problems in rea- 
soning, deduction, and computation with imperfect 
information which are beyond the reach of tradi- 
tional methods based on bivalent logic and bivalent- 
logic-based probability theory. 

4. In fuzzy logic the writing instrument is a spray pen 
(Fig. 6.1) with a precisely known adjustable spray 
pattern. In bivalent logic the writing instrument is 
a ballpoint pen. 
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5. The importance of fuzzy logic derives from the fact 
that in much of the real-world imperfect informa- 
tion is the norm rather than exception. 


All these considerations justify the use of fuzzy sets 
theory whenever objects are linked to soft concepts, 
those that do not show clear boundaries. Of course, ap- 
plications might require tools other than fuzzy [6.18]. In 
any case, if we decide to use fuzzy sets and it is hard for 
us to build the characteristic functions of the involved 
sets, then we must use set representations that take into 
account these difficulties, and focus on those fuzzy sets 
that we call extensions. 


6.3 Type-2 Fuzzy Sets 


The idea of taking into account the experts’ uncertainty 
when they build the membership degrees of the ele- 
ments to a given fuzzy sets led Zadeh to present in 
1971 the notion of type-2 fuzzy set [6.17] as follows: 
A type-2 fuzzy set is a fuzzy set over a referential set U 
for which the membership degrees of the elements are 
given by fuzzy sets defined over the referential set [0, 1]. 

The mathematical formalization of this concept was 
made in 1976 by Mizumoto and Tanaka in [6.19] and in 
1979 by Dubois and Prade in [6.20] as follows: 


Definition 6.8 
A type-2 fuzzy set is a mapping A: U > FS([0, 1]). 


In Fig. 6.2 we show an example of type-2 fuzzy set. 
We denote by T2FS(U) the set of all type-2 fuzzy 
sets over U. That is 


T2FS(U) = (FS([0, 1)” . 
6.3.1 Type-2 Fuzzy Sets as a Lattice 
From Definition 6.8, the following result is obvious. 
Corollary 6.1 
Type-2 fuzzy sets are a particular type of Goguen’s L- 
fuzzy sets. 

Taking into account Corollary 6.1, it is clear that we 
can define the following operations over type-2 fuzzy 


sets [6.21]. 


Definition 6.9 
The operations of union U7 and intersection N72 of 


So the origin of the concept of extension of fuzzy 
sets is directly associated with the idea of building fuzzy 
sets that allow us to represent objects that are described 
through imperfect information, and that also allow us to 
represent the lack of knowledge or uncertainty associ- 
ated with the membership degrees that are given by the 
experts. 

It is clear that working with extensions implies that 
we need to use more information than in the basic 
model of Zadeh. As already pointed out, in order to jus- 
tify the use of these extensions in practice, the results 
obtained with them must be better than those obtained 
with usual fuzzy sets. 


A,B € T2FS(U) (in the sense of lattices) are defined, 
respectively, as 


Ur (A, B): U — FS((0, 1]) given by 
A Urz B(uj) = A(uj) U B(ui) 


and 


Nr (A, B): U > FS((0, 1]) , 
AN 72 B(u;) = A(uj) N B(ui) . 


Proposition 6.2 
The set (T2FS(U), Ur2, N72) is a bounded lattice with 
respect to the order 


A <mrs{v) B if and only if A Ur2 B = B 
or equivalently 

A <mrsw) B if and only if A Nmn B =A. 
That is 


A <12FS(U) Bif and only if A(u;) U B(u;) = B(u;) 
for all u; € U 


or equivalently 


A ITFS(U) B if and only if A (u;i) N B(uj) = A(uj) 
for allu;eU. 


The supremum of this lattice is given by I72Fs(y): 
U — FS(U) where, for every uj € U, lrzrs{v) (ui) is 
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the fuzzy sets that assigns to every t € [0, 1] member- 
ship equal to 1. The infimum is given by O7275(y): U > 
FS(U) where, for every u; € U, Orzrs(v) (ui) is the fuzzy 
sets that assigns to every t € [0, 1] membership equal 
to 0. 


6.3.2 Remarks on the Notation 


Mizumoto and Tanaka in 1976 [6.19] and Mendel and 
John in 2000 [6.22] used the following notation: 


J T 7 


uEU tEJy 


J, C [0,1], 


where J,, is the primary membership of u € U and, for 
each fixed u = uo, the fuzzy set Siera A(uo, t)/t is the 
secondary membership of ug. 

From our point of view, this notation is not the most 
appropriate one, so now we try to introduce a more 
clarifying notation. Observe that a type-2 fuzzy set as- 
signs to an element in the referential U a mapping 
A(u): [0, 1] — [0, 1]. To represent fuzzy sets (or type-1 
fuzzy sets) defined by a mapping A it is quite usual the 
notation 


{(u;, A(u;)) | u € U}. (6.2) 


In this type-1 case, A (u) is a real number in [0, 1] for ev- 
ery u; € U. In the case of type-2 fuzzy sets, if we imitate 
this notation, we formally lead to {(u;, A(u;)) | u; € U}. 
But now for each u; E€ U, we have that A(u;) is not a real 
number but a mapping (a type-1 fuzzy set) 


A(u):[0, 1] > [0, 1], 
t— A(u)(t). 


Fig. 6.2 Example of a type-2 fuzzy set 


Taking into account these considerations Harding 
et al. [6.21] and Aisbett et al. [6.23] suggested the fol- 
lowing notation for a type-2 fuzzy set A: 


A= {(u;, (t, A(u;)(t)|u; € U ,t € [0, 1}. 
But an easier one to use one could be the following. 


Definition 6.10 
Let A: U > FS((0, 1]) be a type-2 fuzzy set. Then A is 
denoted as 

{ (ui, A(uj, t)) |ui eU,te [0, 1]} : 
where A(u;,-):[0, 1] — [0,1] is defined as A(u;, t) = 
A(ui) (0). 


6.3.3 A First Definition of Operations 
Between Type-2 Fuzzy Sets: 
Lattice-Based Approach 

With Definition 6.10, if we have two type-2 fuzzy sets 

A= {uj (A(ui, 1) | u; E€ U , t € [0,1]; 
and 
B = {(u;, (B(uj, t)) | u; € U , t € [0, 1]} 

we have (Fig. 6.3) 

A Urrs B = {(u;, A U B(u;, t)) |u; E€ U , t€ [0, 1]}, 
where, for each u; € U and each t € [0, 1], we have 


A U B(u;, t) = max(A(u;, t), B(u;, t)) 
= max(A(u;)(t), B(uj) £) (6.3) 
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Analogously, and 
= {(u; 1 ift=ł 
A rors B {(u;, A N B(u;, t)) | uj E U; tE [0, 1}} ’ Br2(u)(t) = i 4 ; 
0 in other case 
where, for each u; € U and each t € [0, 1], we have 
1 ift=3 
Bro(u2)(t) = ; 
AN B(u;, t) = min(A(u;, t), B(uj, t)) 0 in other case 
= min(A(u;)(t), B(u;) (0) (6.4) 1 ifr=1t 
Br2(u3)(t) = 7 ; 
; nase nE: 0 in other case 
Observe that this notation is very similar to that pro- 
posed by Deschrijver and Kerre [6.24, 25]. When we have 
6.3.4 Problems with the Lattice-Based i depleted 
Definitions. Operations Based Ar Urzrs Bro (u1) (À) = | f 4 2, 
on Zadeh's Extension Principle 0 in other case 
1 ift=}ort=} 
Although meaningful from a mathematical point of Ar Urors Bro (u2) (t) = j 
view, as pointed out by Dubois and Prade in [6.26], 0 in other case 
from these definitions we do not recover the usual ones 
for fuzzy sets. To see it, just consider a finite referential and 
set U = {u1, u2, u3} with three elements, and consider . i 
the following two fuzzy sets over U. We use the nota- Aro Urzrs Br (u3) (t) = 1 E =7;o0rt=1 
tion of (6.2) for the sake of brevity. 0 in other case 
1 1 which does not coincide with our previous result. More- 
2 A=)|(u, 2)’ u2, 3j’ (u3, 1) over, observe that we do not even recover a fuzzy set but 
4 a true type-2 fuzzy set. 
> and In order to solve this problem, several authors [6.19, 
= 22,26] proposed the following definitions of the opera- 
w tions of union and intersection. 


r-e 


Then we have, for instance, 


AUB= Hu 5) ; (i IEG D} 


On the other hand, we can also see A and B as type-2 
fuzzy sets, that we denote by A, and Bro, respectively, 
just taking 


1 ifr=} 
Ar2(uy)(t) = 2 : 
r) F in other case 


1 ift=i 
Ar2(u2)(t) = f ' 
72( 2)(t) in other case 


1 iff=1 


A72(u3)(t) = ‘ 
raus) (0) 0 in other case 


6.3.5 Second Definition of the Operations: 
Zadeh's Extension Principle Approach 


Definition 6.11 
Given two type-2 fuzzy sets 


A= {(uj,A(uj, t)) | u; € U , t € [0, 1]} 
and 

B = {(u;, B(uj, t)) | u; € U , t€ [0, 1]} 
we can define (Fig. 6.4) 

ANB = {(u;, AN B(u;, t)) | u;E€ U , t € [0,1]} 
with 


ANB(u;,t)= sup min(A(u;, z), B(ui, w)) 


min(z,w)=t 
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and 
AUB = {(uj;, AU B(u;, f)) | u;€ U , t € [0, 1]} 1 
0.8 
with 0.6 
0.4 
AUB(u;,t)= sup min(A(u;, z), B(uj, w)) . 0.2 
max(z,w)=t 1 
0.8 
For instance, let us recover our previous example. 06 
Consider the type-2 fuzzy sets Ar2 and Br2. Then we DA 
have that Ae : 
i 4 45 
Ar UBro(u1,t) = a) D a eon oe io 


0 ifrg {4,4} 
sup (min(Arz(u1, z), Br2(u1, w))) . 
max(z.w)=t 


in other case 


But if t = i then, as } > 1 and since Ar2 (u1, z) = 0 for 
all z < L, it follows that min(Ar2(u1, z), Br2 (u1, w)) = 
! L, then 


0 whenever max(z,w) = z. Finally, if t= 
min(A72 (uy, $), Bro(ur, D) = 1, so we finally arrive at 


4 


0 ifti 


Ar U Br2 (u1, t) = ‘ 
72 U Br2(uy, t) i gisi 


Since for uz and u3 the same arguments work, we see 
that we indeed recover the fuzzy case. In particular, with 
respect to these new operations, we have the following 
result [6.21]. 
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Proposition 6.3 
Let U be a referential set. (T2FS(U), U, M) is not a lat- 
tice. 


In fact, the problem is that the absorption laws 


AM(ALB) =A c) 
and Fig. 6.3a-c Two different type 2 fuzzy sets (a) AUrrs B (b) 

AU(AMB)=A A rars B (o) 
do not hold. Nevertheless, it is also possible to provide ~- 
a positive result [6.21]. Remark 6.1 

We should remark the following: 
Proposition 6.4 1. If we work with the operations defined in Eqs. (6.3) 
Let U be a referential set. Then for any A,B,C € and (6.4), and consider fuzzy sets as particu- 
T2FS(U) the following properties hold: lar instances of type-2 fuzzy sets, then we do 
i) AUA=AandAnA =A: a U the classical operations defined by 
ii) AUB = BUA and ANB = BMA; a oe ” 
ee we 2. If we use the operations in Definition 6.11, then 
iii) AU (BUC) = (AUB)UC. ; j i 
we recover Zadeh’s classical operations for fuzzy 

That is, (T2FS(U),U, M) is a bisemilattice. sets, but we do not have a lattice structure. This 


fact makes that the use of type-2 fuzzy sets in many 
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AMB(u;, t) = sup min(A(u; 2), Banw < 


min(z,w)=t 


Fig. 6.4 Example of intersection of two membership sets A(u;, t) 
and B(u;, t). Green line is the set obtained 
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applications, such as decision making, is very com- 
plicate. 


Obviously, an interesting problem is to analyze 
which further conclusions and results can be obtained 
from this new formulation of the operations between 
type-2 fuzzy sets. 


6.3.6 About Computational Efficiency 


Note also that although the computational complexity 
and the efficiency in time of type-2 fuzzy sets are not as 
high as used to be a few years ago, it is clear that the use 


6.4 Interval-Valued Fuzzy Sets 


These sets were introduced in the 1970s. In May 1975, 
Sambuc [6.37] presented, in his doctoral thesis, the 
concept of an interval-valued fuzzy set named a @- 
fuzzy set. That same year, Jahn [6.38] wrote about 
these sets. One year later, Grattan-Guinness [6.39] es- 
tablished a definition of an interval-valued membership 
function. In that decade interval-valued fuzzy sets ap- 
peared in the literature in various guises and it was not 
until the 1980s, with the work of Gorzalczany and Tiirk- 
sen [6.40-45], that the importance of these sets, as well 
as their name, was definitely established. 

Let us denote by L([0, 1]) the set of all closed subin- 
tervals in [0, 1], that is, 


L((0, 1]) = fx = E x] | (a x) € [0, 1}? 


E (6.5) 
and x< x} . 


of these kinds of sets introduces additional complex- 
ity in any given problem. For this reason, many times 
the possible improvement of results is not as big as re- 
placing type-1 fuzzy sets by type-2 fuzzy sets in many 
applications. 

On the other hand, we can also define type-3 fuzzy 
sets as those fuzzy sets whose membership of each ele- 
ment is given by a type-2 fuzzy set [6.27]. Even more, 
it is possible to define recursively type-n fuzzy sets as 
those fuzzy sets whose membership values are type- 
(n— 1) fuzzy sets. The computational efficiency of these 
sets decreases as the complexity level of the building in- 
creases. From a theoretical point of view, we consider 
that it is necessary to carry out a complete analysis of 
type-n fuzzy sets structures and operations. But up to 
now no applications has been developed on the basis of 
a type-n fuzzy sets. 


6.3.7 Applications 


It is worth to mention the works by Mendel in comput- 
ing with words and perceptual computing [6.28-31], of 
Hagras [6.32,33], of Sepulveda et al. [6.34] in control, 
of Xia etal. in mobiles [6.35] and of Wang in neural 
networks [6.36]. We will see in the next section that 
the advantage of using these kinds of sets versus usual 
fuzzy sets has been shown only for a particular type 
of them, namely, the so-called interval-valued fuzzy 
sets. 


Definition 6.12 
An interval-valued fuzzy set (or interval type-2 fuzzy 
set) A on the universe U Æ @ is a mapping 


A: U > L({0, 1), 


such that the membership degree of u € U is given 
by A(u) = [A (u), A(u)] € L((0, 1]), where A: U — [0, 1] 


and A: U > [0, 1] are mappings defining the lower and 
the upper bounds of the membership interval A(w), re- 
spectively (Fig. 6.5). 


From Definition 6.12, it is clear that for these sets 
the membership degree of each element u; € U to A is 
given by a closed subinterval in [0, 1]; that is, A (u;) = 
[A(u;), A(u;)]. Obviously, if for every u; € U, we have 
A(u;) = A(u;), then the considered set is a fuzzy set. So 
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ov 


0 1 2 3 4 5 6 7 
Fig. 6.5 Example of interval valued fuzzy set 


fuzzy sets are particular cases of interval-valued fuzzy 
sets. 

In 1989, Deng [6.46] presented the concept of Grey 
sets. Later Dubois proved that these are also interval- 
valued fuzzy sets. 

We denote by JVFS(U) the class of all interval- 
valued fuzzy sets over U; that is, VFS(U) = L({0, 1])”. 
From Zadeh’s definitions of union and intersections, 
Sambuc proposed the following definition: 


Definition 6.13 
Given A, B € IVFS(U). 
A Ur(fo,11) B(ui) = [max(A(u;), B(ui)), 
max(A(u;), B(u;))] 
ANx((0,1) Bui) = [min(A(u;), B(ui)), 
min(A(u;), B(u;))] 


These operations can be generalized by the use of 
the widely analyzed concepts of IV t-conorm and IV 
t-norm [6.4749]. 


Corollary 6.2 
Interval valued fuzzy sets are a particular case of 
L-fuzzy sets. 


Proof: Just note that L([0, 1]) with the operations in 
Definition 6.13 is a lattice. a 


Proposition 6.5 
The set (VFS(U), Ur to,1}) Mz{0,1})) is a bounded lat- 
tice, where the order is defined as 


A SIVFS(U) Bif and only ifA Ur(fo.11) B=B 


or equivalently 
A S<IvFs(u) Bif and only ifA NL.) B=A. 
That is 


A <rrs(v) B if and only if 
max(A(u;), B(u;)) = B(u;) and 
max(A(u;), B(u;)) = B(u;) 


for all u; € U, or equivalently 


A <ivrs(u) B if and only if 
min(A(u;), B(u;)) = A(u;) and 
min(A(u;), B(uj)) = A(u;) 


for all u; € U. 


From Proposition 6.5, we deduce that the order 
A <westvy B if and only if A(u;) < B(u;) and A(u;) < 
B(u;) for all u;€ U is not linear. The use of these 
sets in decision making has led several authors to 
consider the problem of defining total orders between 
intervals [6.50]. In this sense, in [6.51] a construction 
method for such orders by means of aggregation func- 
tions can be found. 


6.4.1 Two Interpretations 
of Interval-Valued Fuzzy Sets 
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From our point of view, interval-valued fuzzy sets can 
be understood in two different ways [6.52]: 


1. The membership degree of an element to the set 
is a value that belongs to the considered interval. 
The interval representation is used since we can- 
not say precisely which that number is. For this 
reason, we provide bounds for that number. We 
think this is the correct interpretation for these 
sets. 

2. The membership degree of each element is the 
whole closed subinterval provided as membership. 
From a mathematical point of view, this interpre- 
tation is very interesting, but, in our opinion, it is 
very difficult to understand it in the applied field. 
Moreover, in this case, we find the following para- 
dox [6.53]: 

For fuzzy sets and with the standard negation it 
holds that min(A(u;), 1 —A(u;)) < 0.5 for all u; € U. 
But for interval-valued fuzzy sets, if we use the stan- 
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uy uz U 


Fig. 6.6 Construction of type-2 fuzzy sets from interval-valued 
fuzzy sets 
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dard negation N(A(u;)) = [1 — A(u;), 1 — A(u;)], we 
have that there is no equivalent bound for 


min [A(u;), A(u)] , [1 Aw), 1-A(u)] . 


6.4.2 Shadowed Sets Are a Particular Case 
of Interval-Valued Fuzzy Sets 


The so-called shadow sets were suggested by 
Pedrycz [6.54] and developed later together with 
Vukovic [6.55,56]. A shadowed set B induced by 
a given fuzzy set A defined in U is an interval-valued 
fuzzy set in U that maps elements of U into 0,1 
and the unit interval [0,1], i.e., B is a mapping 
B: U — {0, 1, [0, 1]}, where 0, 1, [0, 1] denote complete 
exclusion from B, complete inclusion in B and complete 
ignorance, respectively. Shadow sets are isomorphic 
with a three-valued logic. 


6.4.3 Interval-Valued Fuzzy Sets 
Are a Particular Case of Type-2 
Fuzzy Sets 


In 1995, Klir and Yuan proved in [6.27] that from an 
interval-valued fuzzy set, we can build a type-2 fuzzy 
set as pointed out in Fig. 6.6. 

Later in 2007 Deschrijver and Kerre [6.24, 25] and 
Mendel [6.57], proved that interval-valued fuzzy sets 
are particular cases of type-2 fuzzy sets. 


6.4.4 Some Problems 
with Interval-Valued Fuzzy Sets 


1. Taking into account the definition of interval-valued 
fuzzy sets, we follow Gorzalczany [6.41] and de- 


fine the compatibility degree between two interval- 
valued fuzzy sets as an element in L((0, 1]). The 
other information measures [6.58—62] (interval- 
valued entropy, interval-valued similarity, etc.) 
should also be given by an interval. However, in 
most of the works about these measures, the results 
are given by a number, and not by an interval. This 
consideration leads us to settle that, from a theoret- 
ical point of view, we should distinguish between 
two different types of information measures: those 
which give rise to a number and those which give 
rise to an interval. Obviously, the problem of inter- 
preting both types of measures arises. Moreover, if 
the result of the measure is an interval, we should 
consider its amplitude as a measure of the lack of 
knowledge [6.63] linked to the considered measure. 
2. In [6.57], Mendel writes: 


It turns out that an interval type-2 fuzzy set is the 
same as an interval-valued fuzzy set for which there 
is a very extensive literature. These two seemingly 
different kinds of fuzzy sets were historically ap- 
proached from very different starting points, which 
as we Shall explain next has turned out to be a very 
good thing. 


Nonetheless, we consider that interval-valued fuzzy 
sets are a particular case of interval type-2 fuzzy sets 
and therefore they are not the same thing. 

3. Due to the current characteristics of computers, we 
can say that the computational cost of working with 
these sets is not much higher than the cost of work- 
ing with type-1 fuzzy sets [6.64]. 

4. We have already said that the commonly used 
order is not linear. This is a problem for some 
applications, such as decision making. In [6.65], 
it is shown that the choice of the order should 
depend on the considered application. Often ex- 
perts do not have enough information to choose 
a total order. This is a big problem since the 
choice of the order influences strongly the final 
outcome. 


6.4.5 Applications 


We can say that there already exist applications of 
interval-valued fuzzy sets that provide results which are 
better than those obtained with fuzzy sets. For instance: 


1. In classification problems. Specifically, in [6.66— 
69] a methodology to enhance the performance of 
fuzzy rule-based classification systems (FRBCSs) 
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is presented. The methodology used in these papers 

has the following structure: 

1) An initial FRBCS is generated by using a fuzzy 
rule learning algorithm. 

2) The linguistic labels of the learned fuzzy rules 
are modeled with interval-valued fuzzy sets in 
order to take into account the ignorance degree 
associated with the assignment of a number as 
the membership degree of the elements to the 
sets. These sets are constructed starting from the 
fuzzy sets used in the learning process and their 
shape is determined by the value of one or two 
parameters. 

3) The fuzzy reasoning method is extended so as 
to take into account the ignorance represented 
by the interval-valued fuzzy sets throughout the 
inference process. 

4) The values of the system’s parameters, for in- 
stance the ones determining the shape of the 
interval-valued fuzzy sets, are tuned applying 
evolutionary algorithms. See [6.66-69] for de- 
tails about the specific features of each proposal. 

The methodology allows us to statistically out- 

performing the performance of the following ap- 

proaches: 

a) In [6.66], the performance of the initial FR- 
BCS generated by the Chi et al. algorithm [6.70] 
and the fuzzy hybrid genetics-based machine 
learning method [6.71] are outperformed. In ad- 
dition, the results of the GAGRAD (genetic 
algorithm gradient) approach [6.72] are notably 
improved. 

b) A new tuning approach is defined in [6.67], 
where the results obtained by the tuning of the 
lateral position of the linguistic labels ([6.73]) 
and the performance provided by the tuning 
approach based on the linguistic 3-tuples repre- 
sentation [6.74] are outperformed. 

c) Fuzzy decision trees (FDTs) are used as the 
learning method in [6.68]. In this contribu- 
tion, numerous decision trees are enhanced, in- 
cluding crisp decision trees, FDTs, and FDTs 
constructed using genetic algorithms. For in- 
stance, the well-known C4.5 decision tree 
([6.75]) or the fuzzy decision tree proposed by 
Janikow [6.76] is outperformed. 

d) The proposal presented in [6.69] is the most 
remarkable one, since it allows outperforming 
two state-of-the-art fuzzy classifiers, namely, 
the FARC-HD method [6.77] and the unordered 
fuzzy rule induction algorithm (FURIA) [6.78]. 


Furthermore, the performance of the fuzzy 
counterpart of the presented approach is outper- 
formed as well. 

2. Image processing. In [6.63,79-85], it has been 
shown that if we use interval-valued fuzzy sets to 
represent those areas of an image for which the ex- 
perts have problems to build the fuzzy membership 
degrees, then edges, segmentation, etc., are much 
better. 

3. In some decision-making problems, it has also been 
shown that the results obtained with interval-valued 
fuzzy sets are better than the ones obtained with 
fuzzy sets [6.86]. They have also been used in 
Web problems [6.87], pattern recognition [6.88], 
medicine [6.89], etc., see also [6.90, 91]. 


Construction of Interval-Valued Fuzzy Sets 

In many cases, it is easier for experts to give the mem- 
bership degrees by means of numbers instead of by 
means of intervals. In this case it may happen that the 
obtained results are not the best ones. If this is so, we 
should build intervals from the numerical values pro- 
vided by the experts. For this reason, we study methods 
to build intervals from real numbers. For any such meth- 
ods, we require the following: 


i) The numerical value provided by the expert should 
be interior to the considered interval. We require 
this property since we assume that the membership 
degree for the expert is a number but he or she is 
not able to fix it exactly so he or she provides two 
bounds for it. 

ii) The amplitude of the built interval is going to repre- 
sent the degree of ignorance of the expert to fix the 
numerical value he or she has provided us. 


The previous considerations have led us to define 
in [6.63] the concept of ignorance degree G; associated 
with the value given by an expert. In such definition, it 
is settled that if the degree of membership given by the 
expert is equal to 0 or 1, then the ignorance is equal to 
0, since the expert is sure of the fact that the element be- 
longs or does not belong to the considered set. However, 
if the provided membership degree is equal to 0.5, then 
ignorance is maximal, since the expert does not know 
at all whether the element belongs or not to the set. 
Different considerations and construction methods for 
such ignorance functions using overlap functions can 
be found in [6.92]. 

Taking into account the previous argumentation, in 
Fig. 6.7 we show the schema of construction of an in- 
terval from a membership degree u given by the expert 
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Membership function to FS 


Ignorance function of 
membership function of FS 


GU) 


ES) al 


Length = G(u(x)) 


Fig. 6.7 Construction with ignorance functions 


and from an ignorance function G; chosen for the con- 
sidered problem [6.63]: 

There exist other methods for constructing interval- 
valued fuzzy sets. The choice of the method depends 
on the application we are working in. One of the 
most used methods in magnetic resonance image pro- 
cessing (for fuzzy theory) is the following: several 
doctors are asked for building, for an specific region 
of an image, a fuzzy set representing that region. At 
the end, we will have several fuzzy sets, and with 
them we build an interval-valued fuzzy set as fol- 
lows. For each element’s membership, we take as lower 
bound the minimum of the values provided by the 
doctors, and as the upper bound, the maximum. This 
method has shown itself very useful in particular im- 
ages [6.83]. In Fig. 6.8, we represent the proposed 
construction. 


0.95 | 
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Fig. 6.8 Construction with different experts 


In [6.63], it is shown that for some specific ultra- 
sound images, if we use fuzzy theory to obtain the 
objects in the image, results are worse than if we use 
interval-valued fuzzy sets using the method proposed 
by Tizhoosh in [6.84]. Such method consists of the fol- 
lowing (see Fig. 6.9): from the numerical membership 
degree u4 given by the expert and from a numerical co- 
efficient œ > 1, associated with the doubt of the expert 
when he or she constructs u4 we generate the member- 
ship interval 


[ns (4), ye | : 


Interval-valued 
fuzzy set 


l4 I, 
Fuzzy set 
0.8 0.8 
Upper limit 
0.6 0.6 
0.4 Membership 0.4 
0.2 0.2 
Lower limit 
0 0 


Fig. 6.9 Tizhoosh’s construction 
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6.5 Atanasssov's Intuitionistic Fuzzy Sets or Bipolar Fuzzy Sets of Type 2 


or IF Fuzzy Sets 


In 1983, Atanassov presented his definition of intu- 
itionistic fuzzy set [6.93]. This paper was written in 
Bulgarian, and in 1986 he presented his ideas in English 
in [6.94]. 


Definition 6.14 
An intuitionistic fuzzy set over U is an expression A 
given by 


A = {(u;, pa (ui), va(ui)) [Ui € U}, 
where ua: U — [0,1] 
v: U — [0, 1] 
such that 0 < ua (u;i) + va (u;i) < 1 for every u; € U. 


Atanassov also introduced the following two essen- 
tial characteristics of these sets: 


1. The complementary of 
A = {(uj, pa (ui), va (ui)) |u; € U} 
is 
Ac = { (u;i, va (ui), pa (ui)) |u; € U3 . 


2. For each u; € U, the intuitionistic or hesitance index 
of such element in the considered set A is given by 


ma (uj) = 1 — pa (ui) — va (ui) . 


T4 (ui) is a measure of the hesitance of the expert to as- 
sign a numerical value to ua (u;) and va(u;). For this 
reason, we consider that these sets are an extension of 
fuzzy sets. It is clear that if for each u; € U we take 
va (u;i) = 1 — ua (u;i), then the considered set A is a fuzzy 
set in Zadeh’s sense. So fuzzy sets are a particular case 
of those defined by Atanassov. 

In 1993, Gau and Buehre [6.95] introduced the con- 
cept of vague set and later in 1994 it was shown that 
these are the same as those introduced by Atanassov in 
1983 [6.96]. 

We denote by A—JFS(U) the class of all intu- 
itionistic sets (in the sense of Atanassov) defined over 
the referential U. Atanassov also gave the following 
definition: 


Definition 6.15 
Given A, B € A—IFS(U). 


A Us—irs B = {(u;, max (ua (ui), a (ui)), 
min(va (uj), vg (ui))) |u; € U} 

A Da~rs B = {(u;, min(ua (ui), pg (ui)), 
max(va (ui), vg (ui)))|u; € U} . 


Definitions of connectives for Atanassov’s sets in 
terms of t-norms, etc. can be found in [6.49, 97]. 


Corollary 6.3 
Atanassov’s intuitionistic fuzzy sets are a particular 
case of L-fuzzy sets. 


Proof: Just note that L= {(x1,x2)|x; +x2 < 1 with 
x1, X2 € [0, 1]} with the operations in Definition 6.15 is 
a lattice. E 


Proposition 6.6 
The set (A — IFS(U), U,—irs, Qa—irs) is a bounded lat- 
tice, where the order is defined as 


A <,- rs B if and only if A Us—jrs B = B 
or equivalently 


A S<A-—IFS B if and only if A NA—IFS B=A. 


From Proposition 6.6, we see that the order 


A <,~:rs B if and only if u4 (u;i) < glui) and 


va (ui) > vg(ui) for all u; E U 


is not linear. Different methods to get linear orders 
for these sets can be found in [6.50, 51]. 


6.5.1 Relation Between Interval-Valued 
Fuzzy Sets and Atanassov's 
Intuitionistic Fuzzy Sets: 

Two Different Concepts 


In 1989, Atanassov and Gargov [6.98] and later De- 
schrijver and Kerre [6.24] proved that from an interval- 
valued fuzzy set we can build an intuitionistic fuzzy set 
and vice-versa. 
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Theorem 6.2 
The mapping 


®:IVFS(U) > A—IFS(U) , 
A>, 


where A’ = {(u;,A(u;), 1 —A(u;))|u; € U}, is a bijec- 
tion. 


Theorem 6.2 shows that interval-valued fuzzy sets 
and Atanassov’s intuitionistic fuzzy sets, are equivalent 
from a mathematical point of view. But, as pointed out 
in [6.52], the absence of a structural component in their 
description might explain this result, since from a con- 
ceptual point of view they very different models: 


a) The representation of the membership of an ele- 
ment to a set using an interval means that the expert 
doubts about the exact value of such membership, 
so such an expert provides two bounds, and we 
never consider the representation of the nonmem- 
bership to a set. 

b) By means of the intuitionistic index we, repre- 
sent the hesitance of the expert in simultaneously 
building the membership and the nonmembership 
degrees. 


From an applied point of view, the conceptual dif- 
ference between both concepts has also been clearly 
displayed in [6.99]. On page 204 of this paper, Ye 
adapts an example by Herrera and Herrera-Viedma 
appeared in 2000 [6.100]. Ye’s example runs as fol- 
lows: n experts are asked about a money investment in 
four different companies. Ye considers that the mem- 
bership to the set that represents each company is 
given by the number of experts who would invest their 
money in that company (normalized by n), and the non- 
membership is given by the number of experts who 
would not invest their money in that company. Clearly, 
the intuitionistic index corresponds to the experts that 
do not provide either a positive or a negative answer 
about investing in that company. In this way, Ye proves 
that: 


1. The results obtained with this representation are 
more realistic than those obtained in [6.100] using 
Zadeh’s fuzzy sets. 

2. In the considered problem, the interval interpreta- 
tion does not make much sense besides its use as 
a mathematical tool. 


6.5.2 Some Problems with the Intuitionistic 
Sets Defined by Atanassov 


Besides the missed structural component pointed out 
in [6.52]: 


1. In these sets, each element has two associated 
values. For this reason, we consider that the in- 
formation measures as entropy [6.59, 61], similar- 
ity [6.101, 102], etc. should also be given by two 
numerical values. That is, in our opinion, we should 
distinguish between those measures that provide 
a single number and those others that provide two 
numbers. This fact is discussed in [6.103] where the 
two concepts of entropy given in [6.59] and [6.61] 
are jointly used to represent the uncertainty linked 
to Atanassov’s intuitionistic fuzzy set. So we think 
that it is necessary to carry out a conceptual revi- 
sion of the definitions of similarity, dissimilarity, 
entropy, comparability, etc., given for these sets. 
Even more since nowadays working with two num- 
bers instead of a single one does not imply a much 
larger computational cost. 

2. Asin the case of interval-valued fuzzy sets, in many 
applications, there is a problem to choose the most 
appropriate linear order associated with that appli- 
cation [6.50,51]. We should remark that the chosen 
order directly influences the final outcome, so it is 
necessary to study the conditions that determine the 
choice of one order or another [6.65]. 


6.5.3 Applications 


Extensions have shown themselves very useful in prob- 
lems of decision making [6.99, 104-108]. In general, 
they work very well in problems for which we have to 
represent the difference between the positive and the 
negative representation of something [6.109], in par- 
ticular in cognitive psychology and medicine [6.110]. 
Also in image processing they have been used often, as 
in [6.111, 112]. We should remark that the mathemati- 
cal equivalence between these sets and interval-valued 
fuzzy sets makes that in many applications in which 
interval-valued fuzzy sets are useful, so are Atanassov’s 
intuitionistic fuzzy sets [6.113]. 


6.5.4 The Problem of the Name 


From Sect. 6.1.1, it is clear that the term intuitionistic 
was used in 1907 by Brouwer, in 1930 by Heyting, etc. 
So, 75 years before Atanassov used it, it already had 


The Origin of Fuzzy Extensions | 6.6 Atanassov's Interval-Valued Intuitionistic Fuzzy Sets 105 


a specific meaning in logic. Moreover, one year after 
Atanassov first used it in Bulgarian, Takeuti and Titani 
(1984) presented a set representation for Heyting ideas, 
using the expression intuitionistic fuzzy sets. From our 
point of view, this means that in fact the correct termi- 
nology is that of Takeuti and Titani. Nevertheless, all 
these facts have originated a serious notation problem 
in the literature about the subject. 

In 2005, in order to solve these problems, Dubois 
et al. published a paper [6.7] on the subject and, they 
proposed to replace the name intuitionistic fuzzy sets 
by bipolar fuzzy sets, justifying this change. Later, 
Atanassov has answered in [6.114], where he defends 
the reasons he had to choose the name intuitionistic and 
states a clear fact: the sets he defined are much more 
cited and used than those defined by Takeuti and Titani, 
so in his opinion the name must not change. 


In Dubois and Prade’s works about bipolarity 
types [6.115, 116], these authors stated that Atanassov’s 
sets are included in the type-2 bipolar sets, so they call 
these sets fuzzy bipolar sets of type-2. 

But we must say that nine years before Dubois 
et al.’s paper about the notation, Zhang in [6.117, 118] 
used the word bipolar in connection with the fuzzy 
sets theory and presented the concept of bipolar-valued 
set. 

All these considerations have led some authors to 
propose the name Atanassov’s intuitionistic fuzzy sets. 
However, Atanassov himself disagrees with this nota- 
tion and asserts that his notation must be hold; that is, 
intuitionistic fuzzy sets. Other authors use the name IF- 
sets (intuitionistic fuzzy) [6.119]. 

In any case, only time will fix the appropriate 
names. 


6.6 Atanassov's Interval-Valued Intuitionistic Fuzzy Sets 


In 1989, Atanassov and Gargov presented the following 
definition [6.98]: 


Definition 6.16 
An Atanassov’s interval-valued intuitionistic fuzzy set 
over U is an expression A given by 


A = {(u;, Ma (ui), Na (ui)) |ui € U} , 
where M4: U —> L((0, 1]), 
Na: U — L([0, 1]) 
such that 0 < Ma (u:i) + N4 (u;i) < 1 for every u; € U. 


In this definition, authors adapt Atanassov’s in- 
tuitionistic sets to Zadeh’s ideas on the problem of 
building the membership degrees of the elements to the 
fuzzy set. Moreover, if for every u; € U, we have that 
My (uj) = M4 (uj) and N4(u;) = N4 (uj), then we recover 
an Atanassov’s intuitionistic fuzzy set, so the latter 
are a particular case of Atanassov’s interval-valued in- 
tuitionistic fuzzy sets. As in the case of Atanassov’s 
intuitionistic fuzzy sets, the complementary of a set is 
obtained by interchanging the membership and non- 
membership intervals. 

We represent by A—JVIFS(U) the class of all 
Atanassov’s interval-valued intuitionistic fuzzy sets 
over a referential set U. 


Definition 6.17 
Given A, B € A—IVIFS(U). 


A Us—wvirs B = {(uj, A Us—ivirs B(uj)) |u; € U} 
where A Ug—jvirs B(u;) 
= | (max (Ma (ui), Ma (ui) , max (Ma (ui), Mp(ui))) | 
[min (Na (u), Ng (u;)) , min (Na (uj), Ne (ui) | . 
A Da—ivirs B = {(uj. A Da—ivirs B(ui)) |u; € U} 
where A Da—zvirs B(u;) 
= ( [min (M4 (ui), Mg (ui)) , min (M4 (u;), Mg (u:))] 
[max (Na (ui), Ng (u;)) , max (Na (ui), Ng (ui))] ) , 


Corollary 6.4 
Atanassov’s interval-valued intuitionistic fuzzy sets are 
a particular case of L-fuzzy sets. 


Proof: Just note that LL((0, 1]) = {(x, y) € 
L((0, 1])?|¥+ y} with the operations in Definition 6.17 
is a lattice. E 


Proposition 6.7 
The set (A = IVIFS(U), Ug—1VvIFs; Na—rvirs) is a bound- 
ed lattice, where the order is defined as 


A SA—IVIFS Bif and only ifA Us—ivirs B=B 
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or equivalently 


A <A~IVIFS B if and only if A NA—IVIFS B=A. 


Note that A <A~IVIFS B if and only if Ma (ui) <IVFS 
Mp(ui) and Na (uj) ZIVFS Npg(ui) for all ui € U; that is, 
A <A~IVIFS B if and only if M; (uj) S Mp(ui), Ma (ui) < 
Mg(uj), Na (ui) > Na(ui), and Na(u;) > Ng(u;) for all 
ui € U is not linear. — 

We make the following remarks regarding these ex- 
tensions: 


1. Itis necessary to study two different types of infor- 
mation measures: those whose outcome is a single 
number [6.120] and those whose outcomes are two 


intervals in [0, 1] [6.120]. It is necessary a study of 
both types. 

2. Nowadays, there are many works using these 
sets [6.121—123]. However none of them displays 
an example where the results obtained with these 
sets are better than those obtained with fuzzy sets or 
other techniques. As it happened until recent years 
with interval-valued fuzzy sets, it is necessary to 
find an application that provides better results us- 
ing these extensions rather than using other sets. To 
do so, we should compare the results with those ob- 
tained with other techniques, which is something 
that it is not done for the moment in the papers that 
make use of these sets. From the moment, most of 
the studies are just theoretical [6.124—126]. 


6.7 Links Between the Extensions of Fuzzy Sets 


Taking into account the study carried out in previous 
sections, we can describe the following links between 
the different extensions. 


1. FS CIVFS = Grey Sets = A — IFS = 
Vague sets C A — IVIFS C L— FS . 


6.8 Other Types of Sets 


In this section, we present the definition of other types 
of sets that have arisen from the idea of Zadeh’s fuzzy 
set. However, for us none of them should be considered 
an extension of a fuzzy set, since we do not represent 
with them the degree of ignorance or uncertainty of the 
expert. 


6.8.1 Probabilistic Sets 

These sets were introduced in 1981 by Hirota [6.127]. 
Definition 6.18 

Let (2, B, P) be a probability space and let B(0, 1) de- 


note the family of Borel sets in [0, 1]. A probabilistic set 
A over the universe U is a function 


A: U x 2 = ([0, 1], B(0, 1), 


where A(u;,-) is measurable for each u; € U. 


2. If we consider the operations in Definition 6.11, we 
have the sequence of inclusions: 


FS CIVFS = Grey Sets = A — IFS 
= Vague sets C T2FS C L— FS. 


6.8.2 Fuzzy Multisets and n-Dimensional 
Fuzzy Sets 


The idea of multiset was given by Yager in 1986 [6.128] 
and later developed by Miyamoto [6.129]. In these mul- 
tilevel sets, several degrees of membership are assigned 
to each element. 


Definition 6.19 
Let U be a nonempty set and n € N*. A fuzzy multiset 
A over U is given by 


, Ha, (ui)) |u; € US, 


where ua;: U —> [0, 1] is called the ith membership de- 
gree of A. 


A = { (ui, Ha, (ui), Has (Ui), - - - 


If in Definition 6.19 we require that: 4, < Ma, < 
-+ < Wa, We have an n-Dimensional fuzzy set [6.130, 
131]. Nevertheless, it is worth to point out the rela- 
tion of these families of fuzzy set with the classification 
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model proposed in [6.132], and the particular model 
proposed in [6.133], where fuzzy preference inten- 
sity was arranged according to the basic preference 
attitudes. 


6.8.3 Bipolar Valued Set or Bipolar Set 


In 1996, Zhang presented the concept of bipolar set as 
follows [6.117]: 


Definition 6.20 
A bipolar-valued set or a bipolar set on U is an object 


A= {(u, p7 ui), 97 (ui) |ui € U} 
with gt: U > [0, 1], 97: U > [-1, 0]. 


In these sets, the value g~ (u;) must be understood as 
how much the environment of the problem opposes to 
the fulfillment of gt (u;). Nowadays interesting studies 
exist about these sets [6.134—-138]. 


6.8.4 Neutrosophic Sets or Symmetric 
Bipolar Sets 


These sets were first studied by Smarandache in 
2002 [6.139]. They arise from Atanassov’s intuitionis- 
tic fuzzy sets ignoring the restriction on the sum of the 
membership and the nonmembership degrees. 


Definition 6.21 


A neutrosophic set or symmetric bipolar set on U is an 
object 


A = { (u;i, pa (ui), Va(ui)) |u; E U} , 


with ua: U —> [0, 1], va: U > [0, 1]. 


6.8.5 Hesitant Sets 


These sets were introduced by Torra and Naru- 
kawa in 2009 to deal with decision-making 
problems [6.140, 141]. 


Definition 6.22 

Let ([0, 1]) be the set of all subsets of the unit interval 
and U be a nonempty set. Let ua: U > go((0, 1]), then 
a hesitant fuzzy set (HFS in short) A defined over U is 
given by 


A = {(uj, Ma(ui))|ui € U} . (6.6) 


6.8.6 Fuzzy Soft Sets 


Based on the definition of soft set [6.142], Maji et al. 
present the following definition [6.143]. 


Definition 6.23 
A pair (F, A) is called a fuzzy soft set over U, where F 
is a mapping given by F: A — FP(U). 


Where FP(U) denotes the set of all fuzzy subsets of U. 
6.8.7 Fuzzy Rough Sets 


From the concept of rough set given by Pawlak 
in [6.144], Dubois and Prade in 1990 proposed 
the following definition [6.145]. From different point 
of views these sets could be considered as an ex- 
tension of fuzzy sets in our sense, besides these 
sets are being exhaustively studied, for this rea- 
son we consider that these sets need another 
chapter. 


Definition 6.24 

Let U be a referential set and R be a fuzzy similar- 
ity relation on U. Take A € FS(U). A fuzzy rough 
set over U is a pair (R| A, Rt A) € FS(U) x FS(U), 
where 


@ RJA: U = (0, 1] is given by 

R | A(u) = inf,ey max(1 — R(v, u), A(v)) 
© Rt A:U = (0, 1] is given by 

R * A(u) = sup,ey min(R(v, u), A(v)). 
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6.9 Conclusions 


In this chapter, we have reviewed the main types of 
fuzzy sets defined since 1965. We have classified these 
sets in two groups: those that take into account the 
problem of building the membership functions, which 
we have included in the so-called extensions of fuzzy 
sets, and those that appear as an answer to such a key 
issue. 

We have introduced the definitions and first proper- 
ties of the extensions, that is, type-2 fuzzy sets, interval- 
valued fuzzy sets; Atanassov’s intuitionistic fuzzy sets 
or type-2 bipolar fuzzy sets, and Atanassov’s interval- 
valued fuzzy sets. We have described the properties and 
problems linked to type-2 fuzzy sets, and we have pre- 


sented several construction methods for interval-valued 
fuzzy sets, depending on the application. We have also 
referred to some papers where it is shown that the use of 
interval-valued fuzzy sets improves the results obtained 
with fuzzy sets. 

In general, we have stated the main problem in 
fuzzy sets extensions, namely, to find applications for 
which the results obtained with these sets are better 
than those obtained with other techniques. This has only 
been proved, up to now, for interval-valued fuzzy sets. 
We think that the great defy for some sets that are ini- 
tially justified as a theoretical need is to prove their 
practical usefulness. 
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Irina Perfilieva 


The theory of the F-transform is presented and 
discussed from the perspective of the latest devel- 
opments and applications. Various fuzzy partitions 
are considered. The definition of the F-transform 
is given with respect to a generalized fuzzy parti- 
tion, and the main properties of the F-transform 
are listed. The applications to image processing, 
namely image compression, fusion and edge detec- 
tion, are discussed with sufficient technical details. 
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7.1 Fuzzy Modeling 


Fuzzy modeling is still regarded as a modern technique 
with a nonclassical background. The goal of this chap- 
ter is to bridge standard mathematical methods and 
methods for the construction of fuzzy approximation 
models. We will present the theory of the fuzzy trans- 
form (the F-transform), which was introduced in [7.1] 
for the purpose of encompassing both classical (usu- 
ally, integral) transforms and approximation models 
based on fuzzy IF-THEN rules (fuzzy approximation 
models). We start with an informal characterization of 
integral transforms, and from this discussion, we ex- 
amine the similarities and differences among integral 
transforms, the F-transform, and fuzzy approximation 
models. An integral transform is performed using some 
kernel. The kernel is represented by a function of two 
variables and can be understood as a collection of lo- 
cal factors or closeness areas around elements of an 
original space. Each factor is then assigned an aver- 
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age value of a transforming object (usually, a function). 
Consequently, the transformed object is a new function 
defined on a space of local factors. The F-transform 
can be implicitly characterized by a discrete kernel that 
is associated with a finite collection of fuzzy subsets 
(local factors or closeness areas around chosen nodes) 
of an original space. We say that this collection estab- 
lishes a fuzzy partition of the space. Then, similar to 
integral transforms, the F-transform assigns an aver- 
age value of a transforming object to each fuzzy subset 
from the fuzzy partition of the space. Consequently, 
the F-transformed object is a finite vector of average 
values. 

Similar to the F-transform, a fuzzy approximation 
model can also be implicitly characterized by a discrete 
kernel that establishes a fuzzy partition of an original 
space. Each element of the established fuzzy partition 
is a fuzzy set in the IF part (antecedent) of the re- 
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spective fuzzy IF-THEN rule. The rule characterizes 
a correspondence between an antecedent and an aver- 
age value of a transforming object (singleton model) or 
a fuzzy subset of a space of object values (fuzzy set 
model). 

To emphasize the differences among integral trans- 
forms, the F-transform, and fuzzy approximation mod- 
els, we note that the last two are actually finite collec- 
tions of local descriptions of a considered object. Each 
collection produces a global description of the consid- 
ered object in the form of the direct F-transform or the 
system of fuzzy IF-THEN rules. 

The idea of producing collections of local descrip- 
tions by fuzzy IF-THEN rules originates from the 
early works of Zadeh [7.2-5] and from the Takagi- 
Sugeno [7.6] approximation models. 

Similar to the conventional integral transforms (the 
Fourier and Laplace transforms, for example), the F- 
transform performs a transformation of an original 
universe of functions into a universe of their skeleton 


7.2 Fuzzy Partitions 


In this section, we present a short overview of various 
fuzzy partitions of a universe in which transforming 
objects (functions) are defined. As we learned from 
Sect. 7.1, a fuzzy partition is a finite collection of fuzzy 
subsets of the universe that determines a discrete kernel 
and thus a respective transform. Therefore, we have as 
many F-transforms as fuzzy partitions. 


7.2.1 Fuzzy Partition 
with the Ruspini Condition 


The fuzzy partition with the Ruspini condition (7.1) 
(simply, Ruspini partition) was introduced in [7.1]. 
This condition implies normality of the respective 
fuzzy partition, i. e., the partition-of-unity. It then leads 
to a simplified version of the inverse F-transform. 
In later publications [7.15,16], the Ruspini condition 
was weakened to obtain an additional degree of free- 
dom and a better approximation by the inverse F- 
transform. 


Definition 7.1 

Let xı <---<x, be fixed nodes within [a,b] such 
that xı = a,x, = b and n> 2. We say that the fuzzy 
sets A,,...,A,, identified with their membership func- 
tions defined on [a, b], establish a Ruspini partition of 


models (vectors of F-transform components) for which 
further computations are easier (see, e.g., an application 
to the initial value problem with fuzzy initial condi- 
tions [7.7]). In this respect, the F-transform can be as 
useful in applications as traditional transforms (see ap- 
plications to image compression [7.8, 9] and time series 
processing [7.10-14], for example). Moreover, some- 
times the F-transform can be more efficient than its 
counterparts; see the details below. 

The structure of this chapter is as follows. In 
Sect. 7.2, we consider various fuzzy partitions: uniform 
and with and without the Ruspini condition, among oth- 
ers; in Sect. 7.3, definitions of the F-transforms (direct 
and inverse) and their main properties are considered; 
in Sect. 7.4, the discrete F-transform is defined; in 
Sect. 7.5, the direct and inverse F-transform of a func- 
tion of two variables is introduced; in Sect. 7.6, a higher 
degree F-transform is considered; in Sect. 7.7, appli- 
cations of the F-transform and F'-transform to image 
processing are discussed. 


[a, b] if they fulfill the following conditions for k = 
i eee 


1. Ag: [a,b] — [0,1], Ax) = 1 

2. A(x) = 0 if x Z (xk—1, X41), Where for uniformity 
of notation, we set x) = a and x,4) = b 

3. A(x) is continuous 

4. A(x), for k=2,...,n, strictly increases on 
[x,—1, xz] and A; (x) fork = 1,...,n—1, strictly de- 
creases on [xk, xk+1] 

5. forall x € [a,b], 


Aw =i. (7.1) 
k=1 


The condition (7.1) is known as the Ruspini condi- 
tion. The membership functions A;,...,A, are called 
the basic functions. A point x € [a, b] is covered by the 
basic function A, if A(x) > 0. 

The shape of the basic functions is not predeter- 
mined and therefore, it can be chosen according to 
additional requirements (e.g., smoothness). Let us give 
examples of various fuzzy partitions with the Ruspini 
condition. In Fig. 7.1, two such partitions with triangu- 
lar and cosine basic functions are shown. The following 
formulas represent generic fuzzy partitions with the 
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Ruspini condition and triangular functions 


(x—x1) 
— , xE [x1, x], 
A(x) = hy 
0, otherwise , 
(x —xx~1) 
LOCEL yei, 
ħhk—ı 
A(x) = (x— x, 
œ% ja xE [XE eta]. 
hg 
0, otherwise , 
Se ekekrt, 
A, (x) = hy-1 
0, otherwise , 


where k = 2,...n—1 and hk = xg-41 — Xk. 

We say that a Ruspini partition of [a, b] is h-uniform 
if its nodes x,,...,%,, where n > 3, are equidistant, 
i.e., xx =ath(k—1), fork =1,...,n, where h = (b — 
a)/(n— 1), and the two additional properties are met: 


6. A(x — x) = Arx +x), for all xe [0,4], k= 
2,...,n—l, 

7. Ag(x) = Ap—1(x—h), for all kK=2,...,n—1 and 
x E [xk, Xe], and Ag+i(x) = A(x — h), for all 
k=2,...,n—Land x € [xk, Xk+ 1]. 


7.2.2 Fuzzy Partitions with the 
Generalized Ruspini Condition 


Fuzzy partitions with the generalized Ruspini con- 
dition were introduced in [7.15]. The generalization 
consists in replacing partition-of-unity (7.1) by fuzzy r- 
partition (7.2). This type of partition was investigated 
in [7.15,17], where the focus was on smoothing or 
filtering data using the inverse F-transform. The follow- 
ing definition is taken from [7.15]. 


Definition 7.2 
Let r > 1 and n > 2 be fixed integers such that r < n. 
Let a = xı < ++- < Xn = b be nodes within [a,b], and 
let xj, <+ < xo <a and b< X41) <+ < Xn+r be 
nodes outside of [a,b]. A fuzzy r-partition of [a, b] is 
a family of n + 2r —2 continuous, normal, convex fuzzy 
sets 
AWM 


2—pr te 


(r) (r) o) 
Ay get Ay ira Antri 


such that the following conditions are fulfilled: 


Fig. 7.1a,b Two Ruspini partitions with triangular (a) and cosine 


basic functions (b) 


1. Fork=1,...,n, Ao is a continuous function on 

[a, b] such that A? (xx) = 1 and AO (x) = 0 for x ¢ 

[max (x;,—,, a), min(x+,, b)| 

2. Fork=1,...,n, Aw is increasing on 

[max(x,—,, a), x] and decreasing on 

[xx, min(xk+r, b)] 

3. Fork =-—r+2,...,0, AO is decreasing on 

[max (x, a), Xk+r] 

4. Fok=n+1,...,n+r— Ae is increasing on 

|Xk—rs min(x,, b)| 

5. For all x € [a, b], the following partition-of-r condi- 
tion holds 


n+r—1 
Y aP @=r (7.2) 


k=—r+2 


If r= 1, then a fuzzy r-partition in the sense of Defi- 
nition 7.2 becomes the standard fuzzy partition in the 
sense of Definition 7.1, 1.e., the partition-of-unity. In 
Fig. 7.2, the fuzzy 2-partition with triangular basic 
functions is shown. 


0 
0 


Fig. 7.2 An example of a fuzzy 2-partition with triangular basic 


functions 
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7.2.3 Generalized Fuzzy Partitions 


A generalized fuzzy partition appeared in [7.16] in 
connection with the notion of the higher degree F- 
transform. Its even weaker version was implicitly in- 
troduced in [7.18] with the purpose of meeting the 
requirements of image compression. We summarize 
both these notions and propose the following definition. 


Definition 7.3 
Let [a,b] be an interval on R, n>2, and let 
X0, X1, -< <, Xn, Xn+1 be nodes such that 


a = Xo LS X1 <0 < Xn S Xni =b. 
We say that the fuzzy sets 
Aj,..-,An: [a,b] > [0, 1] 


constitute a generalized fuzzy partition of [a,b] if for 
every k = 1,...,n there exist h}, h > 0 such that 


h, thi > 0, [xk — hi, xk + hg] E [a,b] 


and the following three conditions are fulfilled: 

1. (locality) — Ag(x) > 0 if x € (xk — hi, Xx + hY), and 
A(x) = 0 if x € [a, b] \ [xe — hy xe + hy] 

2. (continuity) — A, is continuous on [xx — hy, xk + AY] 

3. (covering) — for x € [a, b], \7p—, Ax (x) > 0. 


It is important to remark that by conditions of local- 
ity and continuity, 


b 


Jawa >0. 


a 


An (h,h’)-uniform generalized fuzzy partition of 
[a, b] is defined for equidistant nodes 


xy =ath(k—1),k=1,...,n, 


where h = (b—a)/(n—1), h’ > h/2 and two additional 
properties are satisfied: 


4. A(x) = Ap—1(x—/A) for all K=2,...,n—1 and 
xE [xXx Xk+1], and Ag+i(x) = A,(x—h) for all k= 
2,...,n—Land x € [xx, Xk+ 1]. 

5. hy =h =0, h =h, =- =, =h, =k and 
for all k= 2,...,n—1 and all x€ [0, h], Ar (xk — 
x) = Ax (x, + x). 


An (h,h’)-uniform generalized fuzzy partition of 
[a, b] can also be defined using the generating function 
Ao: [—1, 1] — [0, 1], which is assumed to be even, con- 
tinuous, and positive everywhere except for on bound- 
aries, where it vanishes. (The function Ag : [—1, 1] > R 
is even if for all x € [0, 1], Ag(—x) = Ao(x).) Then, ba- 
sic functions A, of an (h, h’)-uniform generalized fuzzy 
partition are shifted copies of Ao in the sense that 


XX, 
Aj (x) Ao ( W 
1v) = 
0, otherwise , 


and for k = 2,...,n— 1, 


k xE [ix +h’), 


X— Xk 
Ax(x) Ao ( W 
k(x) = 
0, otherwise , 
Ao (=>) , xE [nk , Xn], 
0, otherwise . 


7 xek h, x +h’), 


’ 


An (x) = 


(7.3) 


As an example, we note that the function Ag(x) = 1—|.| 
is a generating function for all uniform triangular par- 
titions. The difference between them is in parameters h 


l 
1 
Xk Xk+l  Xn-1 |b Xn+1 


Fig. 7.3 Generating function Ao of an h-uniform generalized fuzzy partition (after [7.19]) 
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and h’. An (h, h)-uniform generalized fuzzy partition is 
simply called an h-uniform one (Fig. 7.3). 


Remark 7.1 
A generalized fuzzy partition can also be consid- 
ered in connection with radial membership functions; 


7.3 Fuzzy Transform 


The F-transform establishes a correspondence between 
a set of continuous functions on an interval of real num- 
bers and the set of n-dimensional (real) vectors. Each 
component of the resulting vector is a weighted local 
mean of a corresponding function over an area covered 
by a corresponding basic function. The vector of the 
F-transform components is a simplified representation 
of an original function that can be used instead of the 
original function in many applications. Among them, 
let us mention applications to image compression [7.8, 
9], image fusion, image reduction, time series process- 
ing [7.10-14], and the initial value problem with fuzzy 
initial conditions [7.7]. 


7.3.1 Direct F-Transform 


In this section, we give the definition of the F-transform 
according to [7.1] and recall the main properties of 
it. We assume that the universe is an interval [a, b] 
and x; <--- <x, are fixed nodes from [a, b] such that 
xı =a, Xn =b and n> 2. Let us formally extend the 
set of nodes by x9 =a and x,4); = b. Let Aj,...,Ap, 
be the basic functions that form a fuzzy partition of 
[a, b] according to Definition 7.3. Let C([a,b]) be the 
set of continuous functions on the interval [a, b]. The 
following definition introduces the fuzzy transform of 
a function f € C([a, b]). 


Definition 7.4 

Let A,,...,A, be the basic functions that form a gen- 
eralized fuzzy partition of [a,b] and f be any function 
from C([a,b]). We say that the n-tuple of real num- 
bers F[f] = (F1, ... , Fa) given by 


b 
p, = fa fOr 


z k=1,.. 
J, Ag(x)dx 


sN, (7.4) 


is the (integral) F-transform of f with respect to 
Aj,...,An. 


see [7.20]. In this case, every basic function has 
a generic representation in terms of a kernel o : RF —> 
R such that 


Ax (x) = g(x — xl); K=1,...,0. 


The elements F),...,F, are called the components of 
the F-transform. If A,,...,Ap is an h-uniform Ruspini 
partition, then (7.4) may be simplified as follows, 


2 Ff 
m=? J EAO 


Fa =< J rona, 


Xn—1 


Xk+ 
R= J ronwa, k=2,...,n—1. (7.5) 


“k—1 
The following is a list of some properties of the F- 
transform of f with respect to a generalized fuzzy 
partition of [a, b]: 


(a) If for all x € [a, b], f(x) = C, then 
F.=C, k= l,... n. 

(b) Iff = æg + Bh, then 
F[f] = aF [s] + BF (A). 

(c) If [c, d] = {f (x) | x € [a, b]}, then 
Fy = mingea] LE —y)?Ay(x)dx, k=1,..., n. 

(d) If f is twice continuously differentiable on [a, b], 
then Fy = f (xx) + O(h?), k=1,...,n. (This is true 
for an h-uniform Ruspini partition of [a,b] only. 
A similar estimation of the F-transform compo- 
nent Fy as a linear combination of f(x,—r-+ 
DY) ite Fesi f(x, +r—1) can be established 
for a fuzzy r-partition [7.15].) 

(e) If a generalized fuzzy partition is (h, h’)-uniform, 
then foreach k = 1,...,n—1, 


If) — Fil < 2o(h,f) . 
|f(t) —Frtil <2a(h,f). 


where A = max(h,h’), t € [xe Xk + hj, and 


o(h,f) = max max f |f(x+6)—f(x)|. 


|\8|<h x€[a.b— 


(7.6) 
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(f) 
n=l 


b 
Fi F, 
frown F+DF). 


(This is true for an h-uniform Ruspini partition of 
[a, b] only.) 


7.3.2 Inverse F-Transform 


It is clear that an original nonconstant function f can- 
not be precisely reconstructed from its F-transform F[f] 
because we lose information when passing from f to 
F[f]. However, the inverse F-transform f that can be 
reconstructed (using the inversion formula (7.7)) ap- 
proximates f in such a way that universal convergence 
can be established. 


Definition 7.5 
Let A,,...,A, be the basic functions that form a gener- 
alized fuzzy partition of [a, b] and f be a function from 
C((a, b]). Let F[f] = (F,,...,F,) be the F-transform 
of f with respect to Aj,...,A,. Then, the function f : 
[a, b] > R represented by 

jai FRAR(x) 


is 2. k=1 
f(x) = 7 Ag(x) 


is called the inverse F-transform. 


, (7.7) 


Remark 7.2 

If a fuzzy partition of [a,b] fulfills the generalized 
Ruspini condition (7.2) with r > 1, then the inversion 
formula (7.7) can be simplified to 


f@= : DFA) 
k=1 


or to (in the case of the Ruspini partition for which 
r=1) 


fœ) = DFA) : 
k=1 


The following theorem demonstrates that the in- 
verse F-transform f can approximate a continuous 
function f with arbitrary precision. Thus, it explains 
why the F-transform has convincing applications in 
various fields, including image and time series process- 
ing, and data mining [7.21]. In Fig. 7.4, we illustrate 


10.8 


0 1 2 3 4 5 6 


Fig. 7.4 The function f(x) = 10e~°—” ¥ (gray) and its 
inverse F-transform (brown) with respect to the uniform 
Ruspini partition of [0,6] by 29 triangular-shaped basic 
functions. The F-transform components are marked by 
small circles 


how the inverse F-transform approximates the function 
10e— 7)", 


Theorem 7.1 

Let f be a continuous function on [a, b]. Then, for any 
é > 0, there exist ng and a generalized fuzzy partition 
Aj,...,An, Of [a, b] such that for all x € [a, b], 


eq8|f (x) —fe(x)| <e, (7.8) 


where fo is the inverse F-transform of f with respect to 
the fuzzy partition A], ... , Ane- 


From Theorem 7.2, which is given below, we learn 
that for a pointwise approximation (as in Theorem 7.1), 
it is sufficient to compute the F-transform with respect 
to the simplest triangular fuzzy partition. Therefore, al- 
most all applications of the F-transform are based on 
this type of partition. 


Theorem 7.2 

Let f be any continuous function on [a,b], and let 
Al... ., A} and AY, ..., AV, for n > 3, be the basic func- 
tions that form different (,h’)-uniform generalized 
fuzzy partitions of [a, b]. Let f” and f” be the two in- 
verse F-transforms of f with respect to different sets of 
basic functions Aj,...,A’ or AY,...,A”. Then, for ar- 
bitrary x € [a, b], 


MOMOE 40h, f) 


where h = baa, h = max(h, h’) and œ(h, f) is the mod- 


ulus of continuity (7.6) of f on the interval [a, b]. 
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a) 


b) 


Fig. 7.5 (a) Function f(x) = 10e—°—* y (gray) and its in- 
verse F-transform (brown) with respect to the Ruspini 
partition given by the triangular-shaped basic functions 
Aj,...,As5 (gray). (b) Noisy function f + s (gray), where 
s(x) = sin(2x) + 0.6 sin(8x) + 0.3 sin(16x), and its inverse 
F-transform (brown) with respect to the same fuzzy par- 
tition. Both inverse F-transforms f and fs are equal on 
[x2, x4] 


7.4 Discrete F-Transform 


The discrete case of the F-transform, for which an orig- 
inal function f is defined (may be computed) on a finite 
set P = {p1,...,pi} C [a,b], was introduced in [7.1]. 
We will adapt the mentioned definition to the case of 
a generalized fuzzy partition of [a, b]. 

We assume that the domain P of the function f is 
sufficiently dense with respect to the fixed partition, i. e., 


YOGDA) > 0. 


Then, the (discrete) F-transform of f is defined as fol- 
lows. 


Definition 7.6 
Let A,,...,A,, for n > 2, be the basic functions that 
form a generalized fuzzy partition of [a, b], and let func- 


The proofs of Theorems 7.1 and 7.2 can be obtained 
from the respective proofs in [7.22, Theorems 2 and 3] 
after some necessary changes caused by the usage of 
the generalized fuzzy partition. 

Below, we list some properties of the inverse F- 
transform f of f that were considered and proved 
in [7.1, 15, 23]. If not specially mentioned, it is assumed 
that the F-transform is computed with respect to a gen- 
eralized fuzzy partition of [a, b]: 


(a) If for all x € [a,b], f@=C, then f(x) =C 

(b) Iff=ag+ Bh, thenf =a + Bh 

(c) f? fod = f? f(x)dx (This is true for the fuzzy r- 
partition (r > 1) of [a, b] only.) 

(d) Let A;,..., An be an h-uniform Ruspini partition of 
[a, b], where h = (b—a)/(n— 1) and n > 3. Let s : 
[a, b] > R be a continuous function such that one 
of the following two conditions are fulfilled: 


(i) s is 2h-periodical and for all x € [0, A], s(x, — 
x) = —s(xk + x), where k = 2,...,n—1 

(ii) s is h-periodical and [a _ 5(x)dx = 0, where 
k=2,...,n—1. 


Then, for x € [x2, X»—1], 
fafts. 


The last property is known as noise removal. This 
phrase implies that both functions f (non-noisy) and f + 
s (noisy) have the same inverse F-transform. The noise 
is represented by s and characterized by conditions (i) 
or (ii). We illustrate this property in Fig. 7.5. 


tion f be defined on the set P = {p),...,p7} C [a, b], 
which is sufficiently dense with respect to the partition. 
We say that the n-tuple of real numbers (F|,..., Fn) is 
the discrete F-transform of f with respect to A,,...,An 
if 
l 
F, = 2p FA) (7.9) 
dja 1 Ac) 


It is not difficult to demonstrate that the components of 
the discrete F-transform have similar properties to those 
listed in Sect. 7.3.1. 

In the discrete case, we define the inverse F- 
transform on the same set P on which the original 
function is defined. 
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Definition 7.7 

Let A;,...,A,, for n > 2, be the basic functions that 
form a generalized fuzzy partition of [a, b], and let func- 
tion f be defined on the set P = {p1,..., pi} C [a, b], 
which is sufficiently dense with respect to the parti- 
tion. Moreover, let F[f] = (F1, ..., Fn) be the discrete 
F-transform of f w.r.t. A1,..., An. Then, the function 
f : P — R represented by 


a Dok FrArp) 
f (pj) = TA 


is the inverse discrete F-transform of f . 


(7.10) 


Remark 7.3 

If a fuzzy partition of [a,b] fulfills the generalized 

Ruspini condition (7.2) with r > 1, i.e., for all pj € P, 
r=1 4k(p;) = r, then the inversion formula (7.10) can 

be simplified to 


x 1 n 
fp) = 7 9 FA) 
k=1 


or (in the case of Ruspini partition, i. e., r = 1) to 


fp) = Yo FAP) ; 
k= 


Analogous to Theorem 7.1, we can show that 
the inverse discrete F-transform f can approximate 
the original discrete function f on P with arbi- 
trary precision [7.1]. Moreover, the properties (a)— 
(c) that are listed in Sect. 7.3.2 have valid discrete 
analogies. 

An interesting comparison between the discrete F- 
transform and the least-square approximation was made 
in [7.20]. It was demonstrated that the discrete F- 
transform is invariant with respect to the interpolating 
and least-squares approximation of the set {(p;, f (p;)) | 
j= 1,...,l}. This means that the best approximation 
of f on P in the form of )~'_, a;A;, where n< l, 
has the same direct discrete F-transform as the origi- 
nal f. 


7.5 F-Transforms of Functions of Two Variables 


The direct and inverse F-transform of a function of two 
(and more) variables is a direct generalization of the 
case of one variable. We introduce it briefly and refer 
to [7.1] for more details. 

Suppose that the universe is a rectangle [a, b] x 
[c,d] € Rx R and that x; <--- < x, are the fixed nodes 
of [a, b] and yı <--- < Ym are the fixed nodes of [c, d] 
such that xı = a, x, = b, yı = C, Xm = d and n,m > 2. 
Let us formally extend the set of nodes by setting 
Xo = 4, Yo = C, Xn+1ı = b, and ym+1ı = d. Assume that 
A,...,A, are the basic functions that form a general- 
ized fuzzy partition of [a,b] and B4, ...,Bm are basic 
functions that form a generalized fuzzy partition of 
[c,d]. Then, the rectangle [a, b] x [c,d] is partitioned 
into fuzzy sets A, x Bı with the membership functions 
(AxB) (x, y) =Ag(W By), k= 1,...,n, l= 1,...,m 
Let C([a, b] x [c, d]) be the set of continuous functions 
of two variables on the domain and f € C([a, b] x [c, d]). 


Definition 7.8 

Let A,,...,A, be the basic functions that form a gen- 
eralized fuzzy partition of [a,b] and B4, ..., Bm be the 
basic functions that form a generalized fuzzy partition 
of [c,d]. Let f be any function from C([a, b] x [c, d]). 
We say that the n x m-matrix of real numbers F[f] = 
(Fri)nxm is the (integral) F-transform of f with respect 


ee and B,,..., 
1=1,. 


Bm if for each k= 1,...,n, 


oH nae y)AL(a)Bi(y)dxdy 
fo [Ë Ac Bi(y)dxdy 


Fu = (7.11) 


The components Fw (7.11) have properties (adapted 
to the case of two variables) similar to those listed 
in Sect. 7.3.1. For example, the property (e) has the 
following form (we assume that A;,...,A, form an hı- 
uniform Ruspini partition of [a, b] and By,..., Bm form 
an h2-uniform Ruspini partition of [c, d]) 


d b 
/ / f(x, y)dxdy 


hh 
a (Fi + Fim + Fri + Fim) 


oe le P (Sorat Shit Fut SF) 
n—1m—1 


+ hy MY} Fu. 
k=2 [=2 


In the discrete case, when an original function f is 
known only at points (p;, qj) € [a, b] x [c, d], where i = 
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1,...,N and j= 1,...,M, the (discrete) F-transform 
of f can be introduced in a manner analogous to the 
case of a function of one variable. This case is important 
for applications of the F-transform to image process- 
ing [7.8, 9, 18, 24-26]. 


Definition 7.9 

Let a function f be given at nodes (p;, q) € [a, b] x 
[c,d], for which i=1,...,N and j=1,...,M, and 
Aj,...,A, and B,,...,Bm, where n < N and m < M, be 
the basic functions that form generalized fuzzy parti- 
tions of [a, b] and [c, d], respectively. Suppose that sets P 
and Q of these nodes are sufficiently dense with respect 
to the chosen partitions. We say that the n x m-matrix of 
real numbers F[f] = (Fiz)am is the discrete F-transform 
of f with respect to A,,...,A, and B,,..., Bm if 


DO EL SP gerd Bq) 
Ye Die) Ac Bi(q) 
holds for all k = 1,...,n, l= 1,...,m. 


Fy (7:12) 


7.6 F'-Transform 


In [7.16], a higher degree F-transform was introduced 
for the purpose for advanced applications in time series 
and image processing [7.26, 27]. In this section, we give 
a description of the F'-transform, which has working 
applications, and refer to [7.16] for the F”"-transform 
for which m > 1. 

Throughout this section, we assume that A4, . . . , An, 
n> 2 is an h-uniform generalized fuzzy partition of 
[a,b] such that there exists a generating function Ag : 
[—1, 1] — [0, 1] such that for all k =1,...,n, Ay is de- 
fined by (7.3) (the illustration is in Fig. 7.3). 

Let k be a fixed integer from {1,...,”}, and let 
L,(Ax) be a normed space of square-integrable func- 
tions f: [xx—1. xk+1] > R, where the norm |[f||, is 
given by 


| GE POA) dx 


Xk—1 


By L:(Aı,..., An) we denote a set of functions f : 
[a, b] — R such that for all k = 1,... sN, f Eip] E 
L(A), where fliw—i.4ı] i8 the restriction of f on 
[er=i: X4]: 

For any function f from L2(A1,..., An) we define 
the F!-transform of f with respect to Ay,...,A, as the 


The inverse F-transform of a function of two vari- 
ables is a simple extension of (7.7). It will be given 
below for the continuous version of a function. 


Definition 7.10 

Let Aj,...,A, and Bi,...,Bm be the basic func- 
tions that form generalized fuzzy partitions of [a,b] 
and [c,d], respectively. Let f be a function from 
C([a, b] x [c, d]) and F[f] be the F-transform of f with 
respect to A;,...,A, and B,,...,B,,. Then, the func- 
tion f : [a, b] x [c, d] > R represented by 


ke Dope FAn(x)Bi(y) 
Via D1 Ac) BQ) 


is called the the inverse F-transform. 


f(xy) = (7.13) 


Similar to the case of a function of one variable, we 
can prove that the inverse F-transform f can approxi- 
mate the original continuous function f with arbitrary 
precision, and the (adapted) properties (a)—-(c), which 
are listed in Sect. 7.3.2, are fulfilled. 


vector of linear functions 


F' [f] = (cio +01,1 8-1), -> Cn,0 + Cn,1 &—Xn)) 5 


(7.14) 
where for every k= 1,...,n, 
SE FODAR)dx 
Ck,9 = —————_ 
hso 
So FO xWAc (ade 
G1 = Sa 5 ‘ (7.15) 
S œ x) A dx 
and 
1 
so = IEOS 


The kth component of the vector F! [f] is denoted by 
Fi]. 

The following is a list of properties of the F!- 
transform of f with respect to a generalized fuzzy 
partition of [a, b]. They are particular cases of the prop- 
erties of the F’”-transform proved in [7.16]: 


(a) Let Fy and cg.o + ck. (x— xp), for k= 1,...,n, be 
respective kth components of F! [f] and F[f]. Then, 
Fy, = Ck,0- 
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Fig. 7.6 Function f, its F!-transform components F}, ..., F} 

F?,...,F?,..., F? (star nodes) (after [7.16]) 

(b) If for all x € [a, b], f(x) = d+ cx, then all the com- In Fig. 7.6, we show a schematic representation of 
ponents of F!-transform of d + cx are equal to (d+ the F!-transform components of a generic function f. 
cx) +c(x— x), k= 1,...,n. Finally, we give simplified expressions of F!- 

(c) Iff = æg + Bh, then F! [f] = «F! [g] + BF' [A]. transform components with respect to an h-uniform 

(d) cko + ck, (4 — xk) = min ||f (x) — (d + c(x— triangular fuzzy partition [7.16] 
xoll, K=1,...,n, where min is considered Xe 
over the set of functions of the form (d + c(x—x,)). az as FO)AR(a) dx (7.16) 

(e) If f is four times continuously differentiable on , h 
[a, b], then 12 SE" SOE xAc dx 

cro =f) + OUP), S 7 pee 
cki =f A) +O), k=1,...,n. where k = 1,...,n. 


7.7 Applications 


In this section, we consider applications of the F- 
transform and F'!-transform to image processing. 


7.7.1 Image Compression 
and Reconstruction 


A method of lossy image compression and reconstruc- 
tion using fuzzy relations was proposed in [7.19]. The 


dominant idea was a choice of suitable granulation (rep- 
resented by a fuzzy relation) of an image domain. We 
will refer to this method as FEQ. F-transform image 
compression (FTR) is based on the same idea of gran- 
ulation but connects it with fuzzy partitions [7.1,9]. In 
the cited papers, two approaches were proposed: a uni- 
form fuzzy partition of the entire domain [7.1] and 
a two-step partition [7.9] in which initially the entire do- 
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main is partitioned into blocks and second, each block is 
uniformly partitioned into fuzzy sets. Both approaches 
were compared with JPEG and other compression tech- 
niques (including FEQ) [7.9], and the conclusion was 
that the F-transform-based method is slightly worse 
than JPEG but better than FEQ. Two further improve- 
ments of the F-transform-based compression have been 
proposed in [7.18, 28], where an advantage over JPEG 
was achieved in many cases. 

In this section, after reiterating the principles of 
image compression and reconstruction using the F- 
transform and its inverse, we explain how a proper 
choice of a fuzzy partition improves the quality of the 
reconstructed image. A detailed elaboration and com- 
parison with other existing techniques is in [7.18, 28] 
and will be presented in subsequent papers. 


Principles of Image Compression 

Using the F-Transform 
Let a grayscale image of size N x M pixels be repre- 
sented by a function of two variables u : NxM — [0, 1]. 
The value u(i, j) represents the intensity range of each 
pixel in the gray scale. The problem of image compres- 
sion is to reduce the image’s size to save space or trans- 
mission time. A desirable size n x m (where n < N and 
m < M) of a compressed image can be obtained from 
the compression ratio, p = nm/(NM). If a compression 
method is lossy (JPEG, FEQ, and the F-transform, for 
example), then the respective reconstruction ĉ to a full 
size image is compared with the original image using 
the two quality indices PSNR (peak signal-to-noise ra- 
tio) and RMSE (root-mean-square error), where 


PSNR = 201In 2 : 
RMSE 
and 
evan i Ti Deap- itn 
z i 


Simple F-Transform Compression 
In [7.1], we proposed representing a compressed image 
by the n x m matrix U of F-transform components 


Uy... Uim 


Uni tae Unm 


computed over uniform fuzzy partitions (usually, trian- 
gular) Aj,...,A, and B}, ..., Bm of the entire domains 


[1, N] and [1, M], respectively 
SE DE ui DAROBD 


DL Die Ae BiG) 
ea lrst Pa lam: 


Uns 


We proposed reconstructing U to a full-size image using 
the inverse F-transform of u such that 


n m 


aii, j) = 5 > UpAr@Bif/) . 


k=1/=1 


This method does not take advantage of any property 
of the original image and therefore, its quality is not 
very high. Let us illustrate it on the image Camera- 
man taken from the Corel Gallery. In Fig. 7.7, we 
show the original image and its reconstruction using the 
simple F-transform compression described above. The 
compression ratio is p = 0.25, and PSNR = 25.422 
(compare with PSNR = 38.8 for JPEG with a similar 
compression ratio). 


F-Transform Compression with Block 

Decomposition 
This F-transform-based compression [7.9] was inspired 
by the JPEG method in which, at first, the entire domain 
was decomposed into blocks and then, each block was 
compressed according to a compression ratio. In [7.9], 
the same principle is used. In the first step, a decompo- 
sition into blocks of the same size is performed, where 
the size (chosen experimentally) is such that a certain 
quality of approximation by the inverse F-transform 
should be guaranteed (Theorem 7.1). Each block is then 
uniformly partitioned into cosine-shaped fuzzy sets and 
compressed by the simple F-transform method accord- 
ing to a compression ratio. In comparison with the 
simple F-transform compression, this method consid- 
ers the peculiarities of the original images when making 


Fig. 7.7a,b Original image Cameraman (a) and its reconstruc- 
tion after applying the simple F-transform compression (b) with 
PSNR = 25.422 
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the block decomposition. In Fig. 7.8, we show the 
ESNE Camera PSNR quality measure of the image Cameraman com- 
504 quality 8 
45 pressed using three methods: FEQ, F-transform with 
40 block decomposition and JPEG. It is easily observed 
that the JPEG method is still better than the F-transform 
2 with block decomposition, whereas the latter is better 
w than FEQ. However, for the particular image Camera- 
25 e man and the compression ratio p = 0.25, the value of 
20 PSNR of the F-transform with block decomposition 
15 = a is similar to that of the simple F-transform compres- 
10 — JPEG sion: 25.0676 versus 25.422, respectively. This means 
5 that the uniform partition, even when applied to both 
0 »! steps independently, is not effective with respect to 
0 0.2 0.4 „06 the quality estimated by PSNR. In the next subsection, 
Compres ionta we propose an F-transform compression method [7.18] 
Fig. 7.8 The PSNR values of the image Cameraman com- that is almost nonlossy and is based on a nonuni- 
pressed using three methods: FEQ, the F-transform with form generalized partition adapted to each particular 
block decomposition, and JPEG (after [7.29]) image. 
Advanced Image Compression 
If we analyze the properties of the F-transform 
(Sect. 7.3.1), then it is immediate from (a) that the 
more the function behaves like a constant, the better is 
the approximation quality of the inverse F-transform. 
Thus, the following recommendation regarding the 
choice of a proper generalized fuzzy partition can be 
made: 
= © A generalized fuzzy partition of the domain [1, N] x 
[1, M] into fuzzy sets A, x Bı, where k= 1,...,n 
and /=1,...,m, should guarantee that the differ- 
ence between extremal values of the image over 
each A; x Bı is not greater than ¢ >0 or (if the 
Fig. 7.9 The quad tree algorithm and the generalized fuzzy preceding condition cannot be fulfilled) the area of 
partition on its base A, x Bı is not greater than ô > 0. 


Fig. 7.10a,b Two reconstructions of the image Cameraman after 
applying the advanced F-transform compression (the ratio is 0.188) 
with the histogram restoring (a) and without it (b). The PSNR val- 
ues are 29 (a) and 30 (b) 


There are several algorithms that can produce a gen- 
eralized fuzzy partition with the mentioned property. 
In [7.18], we used the quad tree algorithm for this pur- 
pose; see the illustration in Fig. 7.9. 

Let us add that the advanced image compression al- 
gorithm [7.18] uses the following two tricks to increase 
the quality of the reconstructed image: 


@ Preserve sharp edges 
@ Restore the histogram of the original image. 


Figure 7.10 shows how the histogram restoration 
influences the quality of the reconstructed image. In 
Fig. 7.11, we see that the PSNR values of the advanced 
F-transform and the JPEG are almost equal. 
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7.7.2 Image Fusion 


Image fusion aims to integrate complementary distorted 
multisensor, multitemporal, and/or multiview scenes 
into one new image that contains the best parts of each 
scene. Thus, the primary problem in image fusion is to 
find the least distorted scene for every pixel. 

A local focus measure is traditionally used for the 
selection of an undistorted scene. The scene that maxi- 
mizes the focus measure is selected. Usually, the focus 
measure is a measure of high-frequency occurrences 
in the image spectrum. This measure is used when 
a source of distortion is connected with blurring, which 
suppresses high frequencies in an image. In this case, 
it is desirable that a focus measure decreases with an 
increase in blurring. 

There are various fusion methodologies that are cur- 
rently in use. They can be classified according to the 
primary technique: aggregation operators [7.22], fuzzy 
methods [7.30], optimization methods (e.g., neural net- 
works and genetic algorithms [7.29]), and multiscale 
decomposition methods based on various transforms 
(e.g., discrete wavelet transforms; [7.31]). 

The F-transform approach to image fusion was 
proposed in [7.32,33]. The primary idea is a com- 
bination of (at least) two fusion operators, both of 
which are based on the F-transform. The first fu- 
sion operator is applied to the F-transform compo- 
nents of scenes and is based on a robust partition 
of the scene domain. The second fusion operator is 
applied to the residuals of scenes with respect to 


PSNR Camera 


> 
0 0.2 0.4 0.6 
Compression rate 


Fig. 7.11 The PSNR values of the image Cameraman 
compressed using four methods: FEQ, the F-transform 
with block decomposition, the advanced F-transform, and 
JPEG 


inverse F-transforms with fused components and is 
based on a finer partition of the same domain. Al- 
though this approach is not explicitly based on focus 
measures, it uses the fusion operator, which is able 
to choose an undistorted scene among the available 
blurred scenes. 


Principles of Image Fusion 

Using the F-Transform 
In this subsection, we present a short overview of the 
two methods of fusion that were proposed in [7.32, 33] 
and introduce a new method [7.34] that is a weighted 
combination of those two. We will demonstrate that the 
new method is computationally more effective than the 
first two. 

The F-transform fusion is based on a certain decom- 
position of an image. We assume that the image u is 
a discrete real function u = u(x, y) defined on the N x M 
array of pixels P= {(i,/)|i=1,...,N,j=1,...,M} 
such that u : P —> R. Moreover, let fuzzy sets A4, . . . , An 
and Bi,...,Bm, where 2<n<N,2<m< M, estab- 
lish uniform Ruspini partitions of [1,N] and [1, M], 
respectively. We begin with the following representa- 
tion of u on P, 


u(x, y) = Unm(X, y) + e(x, y) Fi (7.18) 
e(x, y) = u(x, y) —Unm(%, y) , (7.19) 


where unm is the inverse F-transform of u and e is the 
respective first difference. If we replace e in (7.18) by 
its inverse F-transform eyy with respect to the finest 
partition of [1, N] x [1, M], the above representation can 
then be rewritten as follows, 


u(x, y) T Unm(X, y) F enm (X, y) . (7.20) 


We call (7.20) a one-level decomposition of u on P. 
If u is smooth, then the function eyy is small (this 
claim follows from the property (e) in Sect. 7.3.1), 
and we can stop at this level. In the opposite case, 
we continue with the decomposition of the first dif- 
ference e in (7.18). We decompose e into its inverse 
F-transform é,’,y (with respect to a finer fuzzy parti- 
tion of [1, N] x [1, M] with n’ :n <n! < N and m :m < 
m <M basic functions) and the second difference e’. 
Thus, we obtain the second-level decomposition of u 
nP 


u(x, y) = Unm(X, Y) + ewm (x, y) + e' (x, y), 
e' (x, y) = e(x, y) — enw (X, y) . 
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In the same manner, we can obtain a higher level de- 
composition of u on P 


u(x, y) = Unm (x, y) =r e (x, y) a 


+e?) @y)te% Gy), (7.21) 
where 
O<n <m- <ni <N, 
0<m <m <-+-<m_- | <M, 
e) (x, y) = u(x, y) — Unm Œ, y) , 
e® (x, y) = eG, y) — ein y), 
i=2,...,k-1. (7.22) 


Three Algorithms for Image Fusion 
In [7.33], we proposed two algorithms: 


1. The simple F-transform-based fusion algorithm 
(SA) and 

2. The complete F-transform-based fusion algorithm 
(CA). 


The principal role in the fusion algorithms CA and 
SA is played by the fusion operator k : RX > R, which 
is defined as follows: 


K(x1,...,X«K) =x, if |x| = max(|x1|,..., |xx|) - 


(7.23) 


The Simple F-Transform-Based Fusion 

Algorithm 
In this subsection, we present a block description of 
the SA without technical details, which can be found 
in [7.33]. We assume that K > 2 input (channel) images 
C1, ..., Cg With various types of degradation are given. 
Our aim is to recognize undistorted parts in the given 
images and to fuse them into one image. The algorithm 
is based on the decompositions given in (7.20), which 
are applied to each channel image: 


1. Choose values n and m such that 2<n<N,2< 
m < M and create a fuzzy partition of [1, N] x [1, M] 
by fuzzy sets Ay, x Bı, where k=1,...,n and l = 
l,...,m. 

2. Decompose the input images c1, . . . , cx into inverse 
F-transforms and error functions according to the 
one-level decomposition (7.20). 

3. Apply the fusion operator (7.23) to the respective F- 
transform components of c),..., cg, and obtain the 
fused F-transform components of a new image. 


4. Apply the fusion operator to the respective F- 
transform components of the error functions e;, i = 
1,...,K, and obtain the fused F-transform compo- 
nents of a new error function. 

5. Reconstruct the fused image from the inverse F- 
transforms with the fused components of the new 
image and the fused components of the new error 
function. 


The SA-based fusion is very efficient if we can 
guess values n and m that characterize a proper fuzzy 
partition. Usually, this is performed manually according 
to the user’s skills. The dependence on fuzzy partition 
parameters can be considered as a primary shortcoming 
of this otherwise effective algorithm. Two recommen- 
dations follow from our experience: 


@ For complex images (with many small details), 
higher values of n and m yield better results. 

© Ifa triangular shape for a basic function is chosen, 
than the generic choice of n and m is such that the 
corresponding values of n, and m, are equal to 3 (re- 
call that n, is the number of points that are covered 
by every full basic function A4). 


The Complete F-Transform-Based Fusion 

Algorithm 
The CA-based fusion does not depend on the choice 
of only one fuzzy partition (as in the case of the SA) 
because it runs through a sequence (7.22) of increasing 
values of n and m. The algorithm is based on the decom- 
position presented in (7.21), which is applied to each 
channel image. The description of the CA is similar to 
that of the SA except for step 4, which is repeated in 
a cycle. Therefore, the quality of fusion is high, but the 
implementation of the CA is rather slow and memory 
consuming, especially for large images. For an illustra- 
tion, Fig. 7.12, Tables 7.1 and 7.2. 


Table 7.1 Basic characteristics of the three algorithms ap- 
plied to the image Balls 


Image Resolution Time (s) Memory (MB) 
CA SA ESA CA SA ESA 
Balls 16001200 340 1.2 36 270 | St3 | sy) 


Table 7.2 MSE (mean-square error) and PSNR character- 
istics of the three fusion methods applied to the image 
Balls 


Image set MSE PSNR 
CA SA ESA CA SA ESA 
Balls 1.28 6.03 0.86 48.91 43.81 52.57 
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Enhanced Simple Fusion Algorithm 
In [7.34], we proposed an algorithm that is as fast as the 
SA and as efficient as the CA. We aimed at achieving 
the following goals: 


© Avoid running through a long sequence of possible 
partitions (as in the case of CA). 

@ Automatically adjust the parameters of the fusion 
algorithm according to the level of blurring and the 
location of blurred areas in input images. 


The algorithm adds another run of the F-transform 
over the first difference (7.18). The explanation is as fol- 
lows: the first run of the F-transform is aimed at edge 
detection in each input image, whereas the second run 
propagates only sharp edges (and their local areas) to 
the fused image. We refer to this algorithm as to en- 
hanced simple algorithm (ESA) and give its informal 
description: 

for all input (channel) images do 
Compute the inverse F-transform 
Compute the first absolute difference between the 
original image and the inverse F-transform of it 
Compute the second absolute difference between 
the first one and its inverse F-transform and set 
them as the pixel weights 
end for 
for all pixels in an image do 
Compute the value of sow — the sum of the weights 
over all input images 
for all input images do 
Compute the value of wr — the ratio between the 
weight of a current pixel and sow 
end for 
Compute the fused value of a pixel in the resulting 
image as a weighted (by wr) sum of input image 
values 
end for 


The primary advantages of the ESA are: 


@ Time — the execution time is smaller than for the CA 
(Table 7.1). 


Fig. 7.12a-c The SA (a), CA (b) and 
ESA (c) fusions of the image Balls. 
The ESA fusion has the best quality 
(Table 7.2) 


© Quality —the quality of the ESA fusion is better than 
that of the SA and for particular cases (Table 7.2), it 
is better than that of the CA. 


Because of space limitations, we present only one 
illustration of the F-transform fusion performed using 
the three algorithms, SA, CA, and ESA. We chose the 
image Balls with geometric figures to demonstrate that 
our fusion methods are able to reconstruct edges. In 
Fig. 7.13, two (channel) inputs of the image Balls are 
given, and in Fig. 7.12, three fusions of the same image 
are demonstrated. 

In Table 7.1, we demonstrate that the complexity 
(measured by the execution time or by the memory 
used) of the ESA is greater than the complexity of the 
SA and less than the complexity of the CA. 

In Table 7.2, we demonstrate that for the particular 
image Balls, the quality of fusion (measured by the val- 
ues of MSE and PSNR) of the ESA result is better (the 
MSE value is smaller) than the quality of the SA result 
and even than the quality of the CA result. 


7.7.3 F'-Transform Edge Detector 


Edge detection is inevitable in image processing. In par- 
ticular, it is a first step in feature extraction and image 
segmentation. We focused on the Canny edge detec- 
tor [7.35], which is widely used in computer vision. 
It was developed to ensure three basic criteria: good 
detection, good localization, and minimal response. In 


AN 


Yl 
22 
Fig. 7.13a,b Two inputs for the image Balls. The central 


ball is blurred in (a), and conversely, it is the only sharp 
ball in (b) 
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Fig. 7.14a-d Original images (a,c) and their F!-transform 
edges (b,d) 


these aspects, the Canny detector can be considered an 
optimal edge detector. In [7.26], we proposed using the 
F!-transform with the purpose of simplifying the first 
two steps of the Canny algorithm. Below, we provide 
the details of our proposal. 

The Canny algorithm is a multistep procedure for 
detecting edges as the local maxima of the gradient 
magnitude. The first step, performed using a Gaus- 
sian filter, is image smoothing and filtering noise. 
The second step is computation of a gradient of the 
image function to find the local maxima of the gra- 
dient magnitude and the gradient’s direction at each 
point. This step is performed using a convolution of 
the original image with directional masks (edge de- 
tection operators, such as those of Roberts, Prewitt, 
and Sobel, are some examples of these filters). The 
next step is called nonmaximum suppression [7.36], 
and it selects those points whose gradient magnitudes 
are maximal in the corresponding gradient direction. 
The final step is tracing edges and hysteresis thresh- 
olding, which leads to preserving the continuity of 
edges. 

In our experiment, we removed the first two steps in 
the Canny algorithm and replaced them by computation 
of approximate gradient values using the F!-transform. 
The reason is that the F!-transform (similar to the 
ordinary F-transform) filters out noise when comput- 
ing approximate values of the first partial derivatives 
given by (7.15). We assume that the image is repre- 
sented by a discrete function u: P — R of two vari- 


ables, where P = {(i,j) |i=1,...,N,j=1,...,M} is 
an N x M array of pixels, and the fuzzy sets Ay,...,An 
and B4, .. . , Bm establish a uniform triangular fuzzy par- 
tition of [1, N] and [1, M], respectively. 

Let x1,...,%, €[1,N] and yi,...,Ym €[1,M] be 
the Ay and h,-equidistant nodes of [1, N] and [1, M], re- 
spectively. 

According to property (e) in Sect. 7.6, the coeffi- 
cients cy,; of the linear polynomials of the F!-transform 
components are approximate values of the first par- 
tial derivatives of the image function at nodes (xx, y1) 
(for simplicity, we assume k= 2,...,n—1 and l= 
2,...,m), where by (7.17) and (7.5) the following 
hold, 


12 N M 

ce = DD uD- ABQ) , 
xY j=1 j=1 
12 N M 

ca) = ay 2 2 ul. G-yAKDBI() - 
wY i=l j=l 


Then, we can write approximations of the first par- 
tial derivatives as the respective inverse F-transforms 
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All other steps of the Canny algorithm — namely, 
finding the local maxima of the gradient magnitude and 
its direction, nonmaximum suppression, tracing edges 
through the image and hysteresis thresholding — are the 
same as in the original procedure. 

In the two examples in Fig. 7.14, we demonstrate 
the results of the F'!-transform edge detector on 
images chosen from the dataset available at ftp:// 
figment.csee.usf.edu/pub/ROC/edge_comparison_ 
dataset.tar.gz. 

We observe that many thin edges/lines are detected 
as well as their connectedness and smoothness. More- 
over, the following properties are retained: 


@ Smoothness of circular lines 
© Concentricness circles 
@ Smoothness of sharp connections. 
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7.8 Conclusions 


In this chapter, the theory of the F-transform has been 
discussed from the perspective of the latest develop- 
ments and applications. The importance of a proper 
choice of fuzzy partition has been stressed. Various 
fuzzy partitions have been considered, including the 
most general partition (currently known). The definition 
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of the F-transform has been adapted to the general- 
ized fuzzy partition, and the main properties of the 
F-transform have been re-established. The applications 
to image processing, namely image compression, fu- 
sion and edge detection, have been discussed with 
sufficient technical details. 
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8. Fuzzy Linear Programming and Duality 


Jaroslav Ramik, Milan Vlach 


The chapter is concerned with linear programming 
problems whose input data may be fuzzy while 
the values of variables are always real numbers. We 
propose a rather general approach to these types of 
problems, and present recent results for problems 
in which the notions of feasibility and optimality 
are based on the fuzzy relations of possibility and 
necessity. Special attention is devoted to the weak 
and strong duality. 


Formulation of an abstract model applicable to a com- 
plex decision problem usually involves a tradeoff be- 
tween the accuracy of the problem description and 
the tractability of the resulting model. One of the 
widespread models of decision problems is based on 
the assumption of linearity of constraints and optimiza- 
tion criteria, in spite of the fact that, in most instances 
of real decision problems, not all constraints and opti- 
mization criteria are linear. Fortunately, in many such 
cases, solutions of decision problems obtained through 
linear programming are exact or numerically tractable 
approximations. Given the practical relevance of lin- 
ear programming, it is not surprising that attempts to 
extend linear programming theory to problems involv- 
ing fuzzy data have been appearing since the early 
days of fuzzy sets. To obtain a meaningful extension 
of linear programming to problems involving fuzzy 
data, one has to specify a suitable class of permit- 
ted fuzzy numbers, introduce fundamental arithmetic 
operations with such fuzzy numbers, define inequal- 
ities between fuzzy numbers, and clarify the mean- 
ing of feasibility and optimality. Because this can 
be done in many different ways, we can hardly ex- 
pect a unique extension that would be so clean and 
clear like the theory of linear programming with- 
out fuzzy data. Instead, there exist several variants 
of the theory for fuzzy linear programming, the re- 
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sults of which resemble in various degrees some of 
the useful results established in the conventional linear 
programming. 

Certainly, the most influential papers for the early 
development of optimization theory for problems with 
fuzzy data were papers written by Bellman and 
Zadeh [8.1], and Zimmermann [8.2]. As pointed out 
in a recent paper by Dubois [8.3], fuzzy optimization 
that is based on the Bellman and Zadeh, and Zim- 
mermann ideas comes down to max—min bottleneck 
optimization. Thus, strictly speaking, the fuzzy linear 
programming problems are not necessarily linear in the 
standard sense. 

Throughout the chapter, we assume that some or all 
of the input data defining the problem may be fuzzy 
while the values of variables are always real num- 
bers. For problems with fuzzy decision variables, see 
e.g. [8.4]. Moreover, we not always satisfy the require- 
ment of the symmetric model of [8.1, 2] which demands 
that the constraints and criteria are to be treated in the 
same way. In general, we take into consideration the 
fact that in many situations, in practice, the degree of 
feasibility may be essentially different from the degree 
of optimality attainment. 

The structure of the chapter is briefly described as 
follows. In the next section, we first recall the basic 
results of the conventional linear programming, espe- 
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cially the results on duality. As a canonical problem we 
consider the problem of the form: Given real numbers 


by, bo, 2.6, Bm C1, C25 6-65 Carai Ariam 
maximize CX; + C2X2 +++: + CnXn 
subject to aix + anx +++: + dinx, < bi, 
t= L2 aM; 
420, j=1,2,...,n. 


Then we review the basic notions and terminology of 
fuzzy set theory, which we need for precise formula- 
tion and description of results of linear programming 
problems involving fuzzy data. After these necessary 
preliminaries, in Sect. 3, we introduce and study fuzzy 
linear programming problems. We focus attention on 
analogous canonical form, namely on the following 


8.1 Preliminaries 


8.1.1 Linear Programming 


Linear programming is concerned with optimization 
problems whose objective functions are linear in the un- 
knowns and whose constraints are linear inequalities or 
linear equalities in the unknowns. The form of a lin- 
ear programming problem may differ from one problem 
to another but, fortunately, there are several standard 
forms to which any linear programming problem can 
be transformed. We shall use the following canonical 
form. 


Given real numbers b4, b2,...,bm, C1, C2,...,Cm 
11, 412, . -mns 
maximize c1X1 + C2X2 +-+- + CnXn (8.1) 
subjectto ajx + anx +++ + Ainkn < bi , 
i= 1,2,...,m, (8.2) 
z0, 
J=1,2,...,n. (8.3) 


The set of all n-tuples (%1,x2,...,%,) of real numbers 
that simultaneously satisfy inequalities (8.2) and (8.3) is 
called the feasible region of problem (8.1)—(8.3) and the 
elements of feasible region are called feasible solutions. 
A feasible solution £ such that no other feasible solution 
x satisfies 


cix + cX Hees F CnXn > Ciki + Cok. Hees H Cnn 


problem 
maximize Cx; 4- + ČnXn 
subject to aj x1 + ee + GinXn Pb; s 
i= 1,2 5.8045; 
420, j=l,2,...,n, 


where G;, aj, and b; are fuzzy quantities and the mean- 
ings of subject to and maximize are based on the 
standard possibility and necessity relations introduced 
in [8.5]. The final section is devoted to duality the- 
ory for fuzzy linear programming problems. First, we 
recall some of the early approaches that are based 
on the ideas of Bellman and Zadeh [8.1], and Zim- 
mermann [8.2]. Then we present recent results of 
Ramik [8.6, 7]. 


is called an optimal solution of (8.1)-(8.3), and the set 
of all optimal solutions is called the optimal region. 

Using the same data bj, b2,..., bm, C1, C25... ,Cns 
411,412,- - , Amn, We can associate with problem (8.1)— 
(8.3) another linear programming problem, namely, the 
problem 


minimize y,b, + y2b2 +--+ + Ymbm (8.4) 
subject to yjayj + y2dyj + +++ + YmAnj = G , 
FSA 2p acy ts (8.5) 
yz, 
FH 12,2203 (8.6) 


Analogously to the case of maximization, we say 
that the set of all m-tuples (y1, y2,...,¥m) of real num- 
bers that simultaneously satisfy inequalities (8.5) and 
(8.6) is the feasible region of problem (8.4)—(8.6), and 
that an element of the feasible region such that no 
other element y of the feasible region satisfies 


diy, + boyz +++ + bmYm < by, + b292 +e + bmm 


is an optimal solution of (8.4)—(8.6). 

The problem (8.1)-(8.3) is then called the pri- 
mal problem and the associated problem (8.4)-(8.6) is 
called the dual problem to (8.1)-(8.3). However, this 
terminology is relative because if we rewrite the dual 
problem into the form of the equivalent primal problem 
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and again construct the corresponding dual, then we ob- 
tain a linear programming problem which is equivalent 
to the original primal problem. In other words, the dual 
to the dual is the primal. Consequently, it is just the mat- 
ter of convenience which of these problems is taken as 
the primal problem. 

The main theoretical results on linear programming 
are concerned with mutual relationship between the 
primal problem and its dual problem. They can be sum- 
marized as follows, see also [8.8, 9]. 

Let R” and R} denote the set of real n-vectors and 
real nonnegative n-vectors equipped by the usual eu- 
clidean distance. For n = 1, we simplify the notation to 
R and R+. The scalar product of vectors x and y from 
R” is denoted by xy: 


1. Ifx is a feasible solution of the primal problem and 
if y is a feasible solution of the dual problem, then 
cx < yb. 

2. Ifx is a feasible solution of the primal problem, and 
if y is a feasible solution of the dual problem, and 
if cx = yb, then x is optimal for the primal problem 
and y is optimal for the dual problem. 

3. If the feasible region of the primal problem is 
nonempty and the objective function x +> cx is not 
bounded above on it, then the feasible region of the 
dual problem is empty. 

4. If the feasible region of the dual problem is 
nonempty and the objective function y+> yb is not 
bounded below on it, then the feasible region of the 
primal problem is empty. 


It turns out that the following deeper results con- 
cerning mutual relation between the primal and dual 
problems hold: 


5. If either of the problem (8.1)-(8.3) or (8.4)-(8.6) 
has an optimal solution, so does the other, and the 
corresponding values of the objective functions are 
equal. 

6. If both problems (8.1)-(8.3) and (8.4)-(8.6) have 
feasible solutions, then both of them have optimal 
solutions and the corresponding optimal values are 
equal. 

7. A necessary and sufficient condition that feasible 
solutions x and y of the primal and dual problems 
are optimal is that 


x >0> yA =G, l<j<n, 
y=0 SA >ç, l<j<n, 
y>OSAx=b,, L<i<m, 
yı =0 4&A;x<bi, L<i<m, 


where A’ and A; stand for the j-th column and i-th 
row of A = {aj}, respectively. 


It is also well known that the essential duality results 
of linear programming can be expressed as a saddle- 
point property of the Lagrangian function, see [8.10]: 


8. Let L: R} x RY — R be the Lagrangian function 
for the primal problem (8.1)—(8.3), that is, L(x, y) = 
cx + y(b—Ax). The necessary and sufficient con- 
dition that x € R”_ be an optimal solution of the 
primal problem (8.1)—(8.3) and y € R? be an opti- 
mal solution of the dual problem (8.4)-(8.6) is that 
(x,y) be a saddle point of L; that is, for all x € R} 
and y E€ R”, 


L(x, Y) < L(x, Y) < LX, y) . (8.7) 
8.1.2 Sets and Fuzzy Sets 


A well-known fact about subsets of a given set is 
that their properties and their mutual relations can be 
studied by means of their characteristic functions. How- 
ever, these two notions are different, and the notion of 
characteristic function of a subset of a set is more com- 
plicated than that of a subset of a set. Indeed, because 
the characteristic function y4 of a subset A of a fixed 
given set X is a mapping from X into the set {0, 1}, we 
not only need the underlying set X and its subset A but 
also one additional set; in particular, the set {0, 1}. In 
addition, we also need the notion of an ordered pair and 
the notion of the Cartesian product of sets because func- 
tions are specially structured binary relations; in this 
case, special subsets of X x {0, 1}. 

The phrases the membership function of a fuzzy 
set ... or the fuzzy set defined by membership func- 
tion ... (and similar ones), which are very common in 
the fuzzy set literature, clearly indicate that a fuzzy set 
and its membership function are different mathematical 
objects. If we introduce fuzzy sets by means of their 
membership functions, that is, by replacing the range 
{0, 1} of characteristic functions with the unit interval 
[0, 1] of real numbers ordered by the standard ordering 
<, then we are tacitly assuming that the membership 
functions of fuzzy sets on X are related to fuzzy sets on 
X in an analogous way as the characteristic functions 
of subsets of X are related to subsets of X. What are 
those objects that we call fuzzy sets on X in set-theoretic 
terms? Obviously, they are more complex than just sub- 
sets of X because the class of functions mapping X into 
the lattice ([0, 1], <) is much richer than the class of 
functions mapping X into {0, 1}. We follow the opinion 
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that fuzzy sets are special-nested families of subsets of 
a set, see [8.11]. 


Definition 8.1 

A fuzzy subset of a nonempty set X (or a fuzzy set on 
X) is a family {Aa}wefo,1] Of subsets of X such that 
Ao =X, Ag C Aa whenever 0 < a < B < 1, and Ag = 
No<a<gAa Whenever 0 < f < 1. 


Definition 8.2 

IfA = {Aw}qaefo,1] is a fuzzy subset of X, then the mem- 
bership function of A is the function u4 from X into the 
unit interval [0, 1] defined by u4 (x) = sup{« : x € Ag}. 


Remark 8.1 

It is worth noting that by defining a fuzzy subset of 
a set X as a special family of subsets of X, we can 
easily avoid certain troublesome phrases. For example, 
we are used to say a subset A of X and not a subset 
Xa : X — {0, 1} of a set X. Similarly, it is more natural 
to say a fuzzy subset A of X than to say a fuzzy sub- 
set ua : X — [0, 1] in X. Moreover, if a fuzzy set on X 
would be defined as a function jz from X to [0, 1], then 
we would obtain statements like fuzzy set u is function 
LL, or a fuzzy set u is convex if and only if p is quasi- 
concave. 


Let A be a subset of a set X and let {Ag}yefo,1] 
be the family of subsets of X defined by Ag = X and 
Aq =A for each positive œ from [0, 1]. It can easily 
be seen that this family is a fuzzy set on X and that 
its membership function is equal to the characteristic 
function of A; see [8.12, 13] for details. This one-to-one 
correspondence between the characteristic functions of 
subsets of X and the membership functions of certain 
fuzzy sets on X provides an embedding of the set of 
subsets of X into the set of fuzzy sets on X. Conse- 
quently, we can view subsets of X as special fuzzy sets 
on X. When we need to distinguish the latter from the 
other fuzzy sets on X, we call them the crisp fuzzy sets 
on X. Moreover, we can also view the elements of X as 
a special fuzzy sets on X by additionally employing the 
one-to-one correspondence that assigns to each element 
x of X the singleton {x}. When we need to distinguish 
an element x € X from the crisp fuzzy sets on X corre- 
sponding to {x}, we write k(x) for the latter. 

We denote the collection of all fuzzy sets on X by 
F(X). When A is from F(X) and p4 is the membership 
function of A, then we use the following terminol- 


ogy. The value j14(x) is called the membership degree 
of x in A. The set {xe X: a(x) = 1} is called the 
core of A. If the core of A is nonempty, then A is 
said to be normalized. The complement of A is the 
fuzzy set c(A) on X whose membership function is 
Hea (x) = 1 — ua (x). For each a € [0, 1], the set {x € 
X | a(x) = æ} is called the a-cut of A and is denoted 
by [A]. If X is a nonempty subset of a real finite- 
dimensional normed space, then a fuzzy set A in X is 
called closed, bounded, compact, or convex if the a-cut 
[Ala is a closed, bounded, compact or convex subset of 
X for every a € (0, 1], respectively. 

Following the terminology of [8.7], we say that 
a fuzzy subset A of R is a fuzzy quantity whenever A 
is normal, compact, and its membership function ua is 
semistrictly quasiconcave in the following sense: The 
membership function ua of A is semistrictly quasicon- 
cave on R if there exist a,b,c,d € R, —œ0 <a < b < 
c < d < +o, such that 


Halt) =0 ift<aort>d, 

a is Strictly increasing on the interval [a, b], 
HAQ=1 ifb<t<c, 

a is Strictly decreasing on the interval [c, d]. 


The set of all fuzzy quantities is denoted by Fo(R). 
Note that F(R) contains well-known classes of fuzzy 
numbers: crisp (real) numbers, crisp intervals, triangu- 
lar fuzzy numbers, trapezoidal, and bell-shaped fuzzy 
numbers etc. However, F (IR) does not contain fuzzy 
sets with stair-like membership functions. 

Recall that the binary relations on X are subsets 
of the Cartesian product X x X and that the fuzzy sets 
on X xX are called the fuzzy binary relation on X, or 
simply fuzzy relation on X. Because the binary rela- 
tions on X are subsets of X x X, we can view them as 
special fuzzy relations on X; namely, as those fuzzy re- 
lations on X whose membership functions are equal to 
the characteristic functions of the corresponding binary 
relations. Again, we call them crisp. Since the member- 
ship functions of fuzzy sets provide a mathematical tool 
for introducing grades in the notion of set membership, 
the fuzzy relations on X can be used for introducing 
grades in comparison of elements of X. However, if we 
need to compare not only elements of X but also fuzzy 
sets on X, then we need binary relations and fuzzy bi- 
nary relations on the set of fuzzy sets on X, that is, on 
F(X) x F(X). 

Let R be a fuzzy relation on X and let Q be a fuzzy 
relation on F(X), that is, R belongs to F (X x X) and Q 
belongs to F (F(X) x F(X)). We say that Q is a fuzzy 
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extension (or briefly an extension) of R from X to F(X) 
if, for each pair x and y in X, 


Ho(k(x), k(y)) = ur, y) . (8.8) 


8.2 Fuzzy Linear Programming 


As mentioned in the beginning, we can hardly expect 
that some unique extension of the conventional linear 
programming to problems with fuzzy data can be es- 
tablished which would be so clean and clear like the 
theory of the conventional linear programming in finite- 
dimensional spaces. This can also be easily seen from 
the current literature where we can find a number of 
different extensions, the results of which resemble in 
various degrees some of the useful results established 
in the conventional linear programming. 

When dealing with problems that arise from the 
canonical linear programming problem (8.1)-(8.3) by 
permitting the input data c;, aj, and b; in (8.1)-(8.3) to 
be fuzzy quantities, we distinguish the fuzzy quantities 
from real numbers by writing the tilde above the corre- 
sponding symbol. Thus, we write ¢, aj, and b; and con- 
sequently uz :R — [0, 1], May: R > [0,1] and uz: 
R = (0, 1], respectively, for i € M = {1,2,...,m} and 
JEN = {1,2,...,n}. When the tilde is omitted, it sig- 
nifies that the corresponding data or values of variables 
are considered to be real numbers. Notice that if č; and 
ay are fuzzy quantities, then, for every (x1,X2,...,Xn) 
from R”, the fuzzy subsets ¢)x; +---+,x, and 
ax; + +++ + Ginx, of R defined by the extension prin- 
ciple are again fuzzy quantities. Also notice that it is 
possible to consider the conventional linear program- 
ming problems as special cases of such fuzzy problems 
because the real numbers can be identified with crisp 
fuzzy quantities. 

As the canonical fuzzy counterpart of the canonical 
linear programming problem (8.1)-(8.3), we consider 
the problem 


maximize (yx) 4 -<+ + GnXn 
subject to (axı 1 +++ 4 Ginx) Pi bi, ieM, 
>20; JEN, (8.9) 
where, for each ic M, the fuzzy quantities &;xı 
t+ F Ginx, and b; from F(R) are compared by 
a hey relation P; on F(R), and where the meanings 


of subject to and maximize, that is, the meanings of fea- 
sibility and optimality, remain to be specified. 


Because the set of the conventional binary relations 
on X can be embedded into the set of fuzzy relation 
on X, we also obtained from (8.8) extensions of conven- 
tional binary relations on X to fuzzy relations on F(X). 


Primarily, we shall study the case in which all P; ap- 
pearing in the constraints of problem (8.9) are the same. 
Namely, let P be a fuzzy relation on Fo(R) and let us as- 
sume that P; = P for all i € M. Then (8.9) simplifies to 


maximize jx) 4 -++ + nXn 
subject to (axı + +++ + GinxXn) P b; , TEM, 
420, JEN, (8.10) 
where the meaning of feasibility and optimality are 
specified as follows. 


© Feasibility: Let p be a positive number from (0, 1]. 
By a -feasible region of problem (8.10) we under- 
stand the B-cut of the fuzzy subset X of R” whose 
membership function ug is given by 


Mx (x) = 
min pp(aax; +--+ + Ginxn, bi) 
1<i<m 
if 4 =0 forall JEN, 
0 otherwise . 


(8.11) 


The elements of 6-feasible region are called £- 
feasible solutions of problem (8.10), and X defined 
by (8.11) is called the feasible region of problem 
(8.10). It is worth mentioning that when the data in 
(8.10) are crisp, then X become the feasible region 
of the canonical linear programming problem (8.1)— 
(8.3). 

© Optimality: When specifying the meaning of opti- 
mization, we have to take into account that the set of 
fuzzy values of the objective function is not linearly 
ordered, and that the relation for making compari- 
son of elements of this set may be independent of 
that used in the notion of feasibility. We propose 
to use the notion of a-efficient (a-nondominated) 
solution of the fuzzy linear programming (FLP) 
problem. (Some other approaches can be found in 
the literature; for example, see [8.6].) 
First, we observe that a feasible solution ĉ of non- 
fuzzy problem (8.1)—(8.3) is optimal exactly when 
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there is no other feasible solution x such that cx > Equivalently, we write a ~<Pos b and @ <Nec b, instead of 
cx. This suggests the introduction of a suitable fuzzy —_[upos(@, b) and [Nec (G, b), respectively, and by & =P% b 
extensions of >. Let Q be a fuzzy relation on R and we mean b ~<Pos 3, 
let œ € (0, 1]. If & and b are fuzzy quantities, then The proofs of the following propositions can be 
we write found in [8.7]. 
å Qab , if we(a,b) > œ (8.12) Proposition 8.1 
7 . Let abe Fo(R) be fuzzy quantities. Then, for each 
and call Qg the w-relation on R associated toQ.We œ € (0, 1), we have 
also write 2 5 
Ls oe z Lpos(a, b) > a iff inflalg < sup[ble , (8.17) 
ee eee 2 
aQq b, if (4Qqb and uglb,ã) <a), (8.13) LNec(@,b) >a iff sup[ai—e < inflo. (8.18) 
and call ox the strict a-relation on R associated =~ 
to Q. Now let a and B be positive numbers from Let d € F(R) be a fuzzy quantity, let £ € [0, 1], and let 
[0, 1]. We say that a B-feasible solution £ of (8.10) d'(B) and d? (£) be defined by 
is (a, B)-maximal solution of (8.10) if there is no ~L . ~ gr 
B-feasible solution x of (8.10) different from £ such ¢ (P) = inf tile S lds j E intial , 
that d®(B) = sup {t|t € [d]g} = sup[d],. (8.19) 
CX] + C2X2 +e + Enka On či ee 
a os Proposition 8.2 
C2X2 + + CyXn (8.14) a 
i) Let P=<° and let £ € [0,1]. A vector x= 
(xı, - - - Xn) is a -feasible solution of the FLP prob- 
Remark 8.2 lem (8.10) if and only if it is a nonnegative solution 
Note that Oy and OF are binary relations on the set of of the system of inequalities 
fuzzy quantities Fo(R) that are constructed from fuzzy . Fs ; 
relation Q at the level œ € (0, 1], and that relation Q* is > àB) <b), iEM. 
the strict relation associated with the relation Òx. Also ISAN 
notice that if & and b are crisp fuzzy numbers corre- ii) Let P= <Nee, A vector x= (x1, ..., Xn) is a f- 
sponding to real numbers a and b, respectively, and Q feasible solution of the FLP problem (8.10) if and 
is a fuzzy extension of relation <, then a Qg b holds if only if it is a nonnegative solution of the system of 
and only if a < b does. Then, for œ € (0,1), a Q% b if inequalities 
and only ifa < b. 
——— >> ad -B)y <HU-f), ieM. 
Significance and usefulness of duality results for jEN 
linear programming problems with fuzzy data depend 
crucially on the choice of fuzzy relations P and Q ap- The following proposition is a simple consequence of 
= pearing in the definition of feasibility and optimality. In the above results applied to the particular fuzzy rela- 
o] what follows, we use the natural extensions of binary tions P = <?°S and P = <Ne, 
z relations < and > on R to fuzzy relations on F(R) that 
— are based on the possibility and necessity relations Pos Proposition 8.3 
5 and Nec defined on F(R) by Let @ and b be fuzzy quantities, a € (0, 1]. 


Hros (ã, b) 
= supfmin(uz (x), uz O) . ur Œ, Y) |x. y ER}, 
(8.15) 
LNec(G, b) 
= inf{max(1— uz (x) , 1 — uz 0), 
ur (x, y)) lx, yE R}. (8.16) 


i) Let P= <°% be a fuzzy relation on R defined by 
(8.15). Then 


b iff a'(w) < Ba), 


Py 
PX itt aa) <b (a). oan 


a 
a 


ii) Let P = <N“ be a fuzzy relation on R defined by 
(8.16). Then 
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iff @(1—a) < b (1 — a), 
iff æ (1 —a) < b(1— a) and 
(1-a) <b (1-a). 


RAQ 
ero 


R*R 


HeSr 


(8.21) 


As to the optimal solution of FLP problem, we ob- 
tain the following result, see also [8.7]. 


Proposition 8.4 7 
Let «œ, f € (0,1) and let X be a feasible region of the 
FLP problem (8.10) with P = <P°S, Let c; be such that 


č (a) < o <E (a) for all je N. If x* = QÑ, y) 
is an optimal solution of the LP problem 


maximize view Gy 
subjectto rica ai; (B)x) <bR(p), ieM, 
HzO, JEN, 
(8.22) 


then x* is an (œ, B)-maximal solution of the FLP prob- 
lem (8.10). 


8.3 Duality in Fuzzy Linear Programming 


8.3.1 Early Approaches 


Dual Pairs of Rodder and Zimmermann 
One of the early approaches to duality in linear pro- 
gramming problems involving fuzziness is due to Röd- 
der and Zimmermann [8.14]. To be able to state the 
problems considered by Rédder and Zimmermann con- 
cisely, we first observe that conditions (8.7) bring up the 
pair of optimization problems 


maximize min L(x, y) subjecttoxe R" , (8.23) 
y= 


minimize max L(x, y) subjecttoye R” . (8.24) 
x=0 Sy 


Let u and u’ be the real-valued functions on R} 
and R}, respectively, and let rje R” and {vy eR" 
be families of real-valued functions on R and R}. 
respectively. Furthermore, let p, and Yy be real-valued 
functions on R}_ and R} defined by 


p(x) = min(u (x), vx(y)) , (8.25) 
Y0) = min(u’(y), y) - (8.26) 
Now let us consider the following pair of families of 
optimization problems 
Family {P,}: Given y € R? , 
maximize g(x) subject to x € R}. 
Family {D,}: Given x € R”_, 
maximize Yx(y) subject to y € R}. 


Motivated and supported by economic interpretation, 
Rödder and Zimmermann [8.14] propose to specify 


functions u and u’ and families {v,} and {v/} as fol- 
lows: Given an mx n matrix A, mx 1 vector b, 1xn 
vector c, and real numbers y and 6, define the functions 
LL, W’, vy and vý by 


u(x) = min(1,1—(y—ex)), 


wO) = min(1, 1 — (yb — 8)) ; (8.27) 
vx(y) = max(0, y(b—Ax)) , 
vy (x) = max(0, (vA—c)x). (8.28) 


Strictly speaking, we do not obtain a duality scheme 
as conceived by Kuhn because there is no relationship 
between the numbers y and ô. Indeed, if the family 
{Py}y>o is considered to be the primal problem, then 
we have the situation in which the primal problem is 
completely specified by data A, b,c, and y. However, 
these data are not sufficient for specification of fam- 
ily {D,},>0 because the definition of {Dx}x>0 requires 
knowledge of ô. Thus, from the point of view that the 
dual problem is to be constructed only on the basis 
of the primal problem data, every choice of ô deter- 
mines a certain family dual to {P,},>0. In this sense, we 
could say that every choice of ô gives a duality, the ô- 
duality. Analogously, if the primal problem is {D,},>0, 
then every choice of y determines some family {P,},>0 
dual to {D,},>0, and we obtain the y-duality. In other 
words, for every y,6, we obtain (y,6)-duality. It is 
worth noticing that families {P,} and {D,} consist of 
uncountably many linear optimization problems. More- 
over, every problem of each of these families may have 
uncountably many optimal solutions. Consequently, the 
solution of the problem given by family {P,},>0 is the 
family {X(¥)})>0 of subsets of R} where X (y) is the set 
of maximizers of p, over R"_. Analogously, the family 
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{¥(x)}.>0 of maximizers of Yy over R} is the solution 
of problem given by family {Dx}x>0. R6dder and Zim- 
mermann propose to replace the families {P,} and {D,} 
by the families {P/} and {D} of problems defined as 
follows 


@ Family {P}: For every u > 0, 
maximize A 
subject to A<1+cx—y 
A <u(b— Ax) 
x>0, (8.29) 


@ Family {D‘}: For every x > 0, 


minimize 7 
subject to ņn > ub—ô-— 1 
n > (c—uA)x 
u>0. (8.30) 


They call these families of optimization problems 
the fuzzy dual pair and claim that the families {P,} and 
{D,} become families {P/} and {D/} when 4, u’, vy 
and v/ are defined by (8.27) and (8.28). To see that 
this claim cannot be substantiated, it suffices to observe 
that the value of function g, cannot be greater than 1, 
whereas the value of A is not bounded above whenever 
A and b are such that both cx and —yAx are positive for 
some x € R”. 

To obtain a valid conversion, one needs to add the 
inequalities A < 1 and 7 > —1 to the constraints. Thus, 
it seems that more suitable choice of functions vy and v/ 
in the Rédder and Zimmermann duality scheme would 
be 


vx(y) = min(1, 1 + y(b—Ax)) , (8.31) 
vy, (x) = min(1, 1 + QA —c)x). (8.32) 


Another objection to the Rödder and Zimmermann 
model arises from the fact that the duality results for 
the proposed fuzzy dual pair do not reduce to the stan- 
dard duality results for the crisp scenario, that is, for 
A = 1,ņ = —1. Again an easy remedy is to work with 
v, and vy defined by (8.31) and (8.32) instead of vy 
and v’ from (8.28). Similar approaches can be found 
in (8.15, 16]. 


Dual Pairs of Bector and Chandra 
In contrast to the usual practice, in the Rédder and Zim- 
mermann model, the range of membership functions u 


and u’ is (—oo, 1], and the range of membership func- 
tions vy and v’ is [0,00) or [1,0o) instead of usual 
[0, 1]. Bector and Chandra [8.17] proposed to replace 
the relations < and > appearing in the dual pair of 
linear programming problems by suitable fuzzy rela- 
tions on R. In particular, the inequality < appearing in 
the i-th constraint of the primal problem (8.1)-(8.3) is 
replaced by the fuzzy relation <; whose membership 
function u<: R x R — [0, 1] is defined by 


1 if a<B 
j<(0,B)=4 1-8 if P<a<ftp, 
0 if Btp,<a 


where p; is a positive number. Analogously, the in- 
equality > appearing in the j-th constraint of the dual 
problem (8.4)—(8.6) is replaced by the fuzzy relation >; 
with the membership function 


1 if «a> 
u~ (a, p)=3 1-6 if p>azb-q, 
0 if fp-q>a 


where q; is a positive number. The degree of satisfac- 
tion with which x € R” fulfills the i-th fuzzy constraint 
Aix <; b; of the primal problem is expressed by the 
fuzzy subset of R” whose membership function pi 
is defined by u;(x) = u<; (Aix, bi), and the degree of 
satisfaction with which y € R” fulfills the j-th fuzzy 
constraint yA! >; c; of the dual problem is expressed by 
the fuzzy subset of R” whose membership function p 
is defined by y) = p> A, c). 

Similarly, we can express the degree of satisfaction 
with a prescribed aspiration level y of the objective 
function value cx by the fuzzy subset of R” given by 
[Lo(x) = >, (cx, y) where, for the tolerance given by 
a positive number po, the membership function jz~, is 
defined by 


1 if «>$ 
Hx (a,B)= 4) 1-£* if B>a>B—po 
0 if B-po>a. 


Analogously, for the degree of satisfaction with the as- 
piration level 5 and tolerance qo in the dual problem, we 
have poly) = M<, (8, yb) where 


1 if «<£ 
uxla p)=4 1-* if B<a<B+qo 
0 if Btaq<a. 
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This leads to the following pair of linear programming 
problems. 

Given positive numbers po, p1, ... 
number y, maximize À subject to 


,Pm, and a real 


(A—1)po <cx-y, 
(A—1)p; < bi— Aix, l<i<m, 
O<A<1, x=0. (8.33) 


Given positive numbers go, 41,---,9n, and a real 


number 6, minimize —7 subject to 


(7—1)go = 5—yb, 

(n—l)q<yA'-G, 1<j<n, 

O<n<1l, y=0. (8.34) 

Bector and Chandra call this pair the modified fuzzy 
pair of primal dual linear programming problems, and 


they show that if x, A, and u, 7 are feasible solutions of 
the corresponding problems, then 


A-D upit+a-) > ax, 


i=1 j=l 
<) umbi-} 9%, 
i=1 j=1 
(à — 1)po + (n— 1)qo 


m 


< oy-} ubi+(8-y). 


j=l i=1 
It follows that, for the crisp scenario A = 1 and n= 1, 
we have 


m n 


64 < do uibi < Do gt 6-y). 


j=l i=1 j=l 


Moreover, for y < ô, feasible solutions x, À and u, 7 are 
optimal if 


© (A-1) i at G—1) j= a 

= a1 Midi — i=! CA 
e (A—lN)pot (n= a0 

= } j= GH — Lin Mibi + (8 — y). 

Again we see that the dual problem is not stated 
by using only the data available in the primal problem. 
Indeed, if problem (8.33) is considered to be the pri- 
mal problem, then to state its dual problem one needs 


additional information; namely, a number ô and num- 
bers go, q1,---»4n3 if problem (8.34) is considered to 
be primal, then one needs a number y and numbers 
PO0>P15+++sPm- 


Dual Pairs of Verdegay 
Verdegay’s approach to duality in fuzzy linear prob- 
lems presented in [8.18] is based on two natural ideas: 
(i) Solutions to problems involving fuzziness should be 
fuzzy; (ii) the dual problem to a problem with fuzziness 
only in constraints should involve fuzziness only in the 
objective. 

The primal problem considered in [8.19] has the 
form 


maximize cx 
subject to Ajx<;b;, i=1,2,...,m 


x>0, (8.35) 


where the valued relation <; in the ith constraint is the 
same as in the previous section, that is, 


1 if a<B 
phla. p= 1-8 if B<a<B+pi 


0 if Bpi<a. 


The fuzzy solution of problem (8.35) is given by the 
fuzzy subset of R” whose each y-cut, 0 < y < 1, is the 
solution set of the problem 
maximize cx 
subject to pi (Aix, b)>y, ti=1,2,...,m 
x>0. (8.36) 
Consequently, we obtain the following problem of 
parametric linear programming. 
For0<y <1, 
maximize cx 
subject to Ajx <b; + (1 — y)pi , 
x>0. (8.37) 


i=1,2,...,m 


Consider now the ordinary dual problem to (8.37), that 
is, forO<y <1, 
minimize > ui(bi + (1 — y)pi) 
i=l 
subject to uA > c 
u>0. (8.38) 
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This suggests to introduce variables y1, y2,...,Ym by 


y=b+6p;, i=1,2,...,m 
with 6 = 1 — y, and consider the family of problems: 
Given 0 <6 < 1 and u > 0 with uA > c, 
minimize X uy 
i=1 
subject to y; > bi + 6p; , 


i=1,2,...,m. (8.39) 


Consequently, in terms of the membership functions 
ui» we obtain the family of problems: Given 0 < ô < 1 
and u > 0 with uA > c, 
minimize > UiYi 
i=1 
subject to pi Yis bi) <1—6, i=1,2,...,m. 
(8.40) 


8.3.2 More General Approach 
In this section, we return to the general canonical FLP 
problem, that is, to the problem 
maximize (xy +++ + ČnXn 
subject to (&axı + +++ + GinXn) Pb, 
JEN. 


ieEM, 


G20, (8.41) 


We will call it the primal FLP problem and denote it by 
8. The feasible region of $ is introduced by (8.11) and 
the meaning of (œ, 6)-maximal solution is explained in 
(8.14) and (8.22). 

To introduce the dual problem to problem %$, we 
first define a suitable notion of duality for fuzzy rela- 
tions on F(X). Let and W be mappings from F(X x 
X) into F (F(X) x F(X)), and let © be a nonempty sub- 
set of F(X xX). We say that mapping ® is dual to 
mapping ¥ on Ø, if 

P(c(P)) = c(W(P)) (8.42) 
for each P € ©. Moreover, if P is in © and @ is dual 
to W on Ø, then we say that the fuzzy relation ®(P) on 
F(X) is dual to fuzzy relation ¥ (P). 


The dual FLP problem (denoted by D) to problem 
$ is formulated as 


minimize biyi 4- 4 Di Soi 
subject to GO (ay Foe pein) JEN, 
yi>0, i€M. 
(8.43) 


Here, P and a) are dual fuzzy relations to each other, 
particularly P = <P”, Q = ~<Nec, or, P= <*, Q= 
~<Pos Th problem $, maximization is considered with 
respect to fuzzy relation P,in problem ©, minimization 
is considered with respect to fuzzy relation Q. The pair 
of FLP problems and 9, that is, (8.41) and (8.43), is 
called the primal-dual pair of FLP problems. Now, we 
introduce a concept of feasible region of problem 9, 
which is a modification of the feasible region of primal 
problem §, see also [8.20]. 

Let Mii; and MG i€ M, j E€ N, be the membership 
functions of fuzzy quantities aj and &, respectively. Let 
P be a fuzzy extension of a binary relation P on R. 
A fuzzy set Y, whose membership function [Ly is de- 
fined for all y € R” by 


min{up(čı s uyi *- s tami Yn), 
sey Lep(Cn, Giny1 + ` tain) 
ify;>Oforalie™, 


0 otherwise , 


u0) = 


(8.44) 


is called a fuzzy set of feasible region or shortly feasible 
region of dual FLP problem (8.43). Moreover, if f € 
(0, 1], then the vectors belonging to [Y]g are called B- 
feasible solutions of problem (8.43). 

By the parallel way, we define an optimal solution 
of the dual FLP problem D. 

Let Gs aj, and b;, i € M, j € N, be fuzzy quantities 
on R. Let O bea fuzzy relation on F(R) that is a fuzzy 
extension of the usual binary relation < on R, and let 
a, B € (0, 1]. A B-feasible solution of (8.43) y € [Y]g is 
called the (a, B)-minimal solution of (8.43) if there is 
no y' €[¥]g.y’ Æ y, such that 

biyi + boy + ae + bmn Qù biyı + boy2 

Ani bites (8.45) 
where o* is the strict -relation on R associated to Q. 

Let P be the usual binary operation < on R. Now, 
we shall investigate FLP problems (8.36) and (8.43) 
with pairs of dual fuzzy relations in the constraints, par- 
ticularly P = <Pos, Q = <M“, or, P = <Ne, Ọ = <Pos. 
The values of objective functions z and w are maximized 
and minimized with respect to fuzzy relation P and Q, 
respectively. 

The feasible region of the primal FLP problem $ 
is denoted by X, the feasible region of the dual FLP 
problem D by Y. Clearly, X is a fuzzy subset of R”, Y 
is a fuzzy subset of R”. 
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Note that in the crisp case, that is, when the pa- 
rameters G, aj, and b; are crisp fuzzy quantities, then 
by (8.15) and (8.16), the relations <?° and <N coin- 
cide with <. Hence, $ and Ð forms a primal—dual pair 
of linear programming problems in the classical sense. 
The following proposition is a useful modification of 
Proposition 8.4 and gives a sufficient conditions for y* 
to be an (a, 8)-minimal solution of the FLP problem 


(8.43). 


Proposition 8.5 

Let ¢, aj and b; be fuzzy quantities for all į € M and 
jE N, a, B € (0,1). Let Y be a feasible region of the 
FLP problem (8.43) with P= <% Let b; be such that 
bi (a) < b; < b? (æ) for all ie M. If y* = (7,..., y5) 
is an optimal solution of the LP problem 


minimize ) biyi 


iEM 
subject to > ap Biz gB), JEN, 

iEM 

y>0, leM; (8.46) 


then y* is a (œ, 6)-minimal solution of the FLP prob- 
lem (8.43). 


Dual Pairs of Ramik 

When presenting duality theorems obtained by Ramik 
in [8.21] (see also [8.7]), we always present two ver- 
sions: i) for fuzzy relation < Pos and ii) for fuzzy relation 
<Nec Tn order to prove duality results we assume that 
the level of satisfaction a of the objective function is 
equal to the level of satisfaction 6 of the constraints. 
Otherwise, the duality theorems in our formulation do 
not hold. The proofs of the following theorems can be 
found in [8.7]. 


Theorem 8.1 First Weak Duality Theorem 

Let G, aj and b; be fuzzy quantities, i € M and j € N, 

a € (0,1). 

i) Let X be a feasible region of the FLP problem (8.36) 
with P= <P°S | and Ÿ be a _ feasible region of 
the FLP problem (8.43) with Q = <%°, If a vec- 
tor x= (x1, ...,Xn) =O belongs to [Xe and y= 
(1,--+;¥m) = 0 belongs to —_ then 


VW F@% <>) Hy. (8.47) 


jEN icM 


ii) Let X be a feasible region of the FLP problem 
(8.36) with P = <M"! , Y be a feasible region of 
the FLP problem (8.43) with Q = <?°S. If a vec- 


tor x = (xı, . . Xn) = 0 belongs to [X]i—q and y = 
Ois- - -s Ym) Z 0 belongs to [Y]q, then 
VF Oy s DA @yi. (8.48) 
JEN iEM 


Theorem 8.2 Second Weak Duality Theorem 
Let G, aj and b; be fuzzy quantities for all į € M and 
jEN,« € (0,1). 


i) Let X¥ be a feasible region of the FLP problem 
(8.36) with P= <°”, Y be a feasible region of 
the FLP problem (8.43) with Q = <N°. If for 


some x= (X1,...,%n) = 0 belonging to [X]q and 

Y= Or- Ym) = 0 belonging to [Y];—q it holds 
2 Goa = >, bR(a)y; . (8.49) 
JEN iE M 


then x is an (œ, œ)-maximal solution of the FLP 
problem (8.36) and y is an (1—«, 1 — æ)-minimal 
solution of the FLP problem 9, (8.43). 

ii) Let X be a feasible region of the FLP problem (8.36) 
with P = Nec. Y be a feasible region of the FLP 
problem (8.43) with Q = <Pes, 

If for some x= (x1,...,Xn) >20 belonging to 
[X]i—a and y = ()1,..., Ym) = 0 belonging to [Y]q 


it holds 
> G (œ) = 5 bi (wy: , (8.50) 
JEN iEM 


then x is an (1 — œ, 1 — æ)-maximal solution of the 
FLP problem 8, (8.36) and y is an (œ, w)-minimal 
solution of the FLP problem 9, (8.43). 


Remark 8.3 
In the crisp case, Theorems 8.1 and 8.2 are the standard 
linear programming weak duality theorems. 


Remark 8.4 

Let «> 0.5. Then [X]e C[X]i-a, [Ya C [Yli-e, 
hence in the first weak duality theorem we can change 
the assumptions as follows: x € [X]q and ye [¥]q. 
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However, the statements of the theorem remain un- 
changed. The same holds for the second weak duality 
theorem. 


Finally, let us direct our attention to the strong dual- 
ity. Motivated by the pairs of Propositions 8.4 and 8.5, 
in Theorem 8.2, we consider a pair of dual LP problems 
corresponding to FLP problems (8.36) and (8.43) with 
fuzzy relations P = <Pos O= <M" q = B 


maximize 5, CIE 


JEN 

(P1) subject to > Gi; (ov) x < bR (a) , LEM, 
JEN 
ţjz0, FEN, (8.51) 

minimize > bR (a)y; 

IEM 

(D1) subject to ` laiz Ga), JEN, 
IEM 
yız0, i€M. (8.52) 


Moreover, we consider a pair of dual LP problems with 
fuzzy relations P = Nec PP — ~<Pos 


maximize > č (ax; 
JEN 
(P2) subject to 5 ai (a)y < bi (a) , LEM, 
JEN 


GB=0, JEN, 


minimize > bi (a)y; 
IEM 
(D2) subject to DD plaiz ča), jen, 
IEM 
yiz0, 


(8.53) 


ieM. (8.54) 
Notice that (P1) and (D1) are classical dual linear 
programming problems and the same holds for (P2) 
and (D2). 
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Theorem 8.3 Strong Duality Theorem 
Let čj, aj, and b; be fuzzy quantities for all i € M and 
JEN. 


i) Let X be a feasible region of the FLP problem 
(8.36) with P = <Pos. Y be a feasible region of the 
FLP problem (8.43) with Q = <N, If for some 
a € (0,1), [X]q and [¥];-q are nonempty, then 
there exists x* — an (œ, w)-maximal solution of the 
FLP problem %, and there exists y* — an (l—a, 
1—q@)-minimal solution of the FLP problem © 
such that 


Ff @x =>" Foy. 


jEN iEM 


(8.55) 


ii) Let X be a feasible region of the FLP problem 
(8.36) with P = «Nee, y be a feasible region of the 
FLP problem (8.43) with Q = <P°S. If for some a € 
(0, 1), [X]ı— and [Y]q are nonempty, then there ex- 
ists x* — an (1 —a@, 1 —@)-maximal solution of the 
FLP problem 8, and y* — an (a, @)-minimal solu- 
tion of the FLP problem D such that 


P Tak" = Do bre. (8.56) 
JEN iEM 
Remark 8.5 


In the crisp case, Theorem 8.3 is the standard linear pro- 
gramming (strong) duality theorem. 


Remark 8.6 7 7 : 7 

Let wa >0.5. Then [X]q C [Xi-a, Wla C Whee. 
hence in the strong duality theorem, we can assume 
x € [X]q and y€ [Y]q. Evidently, the statement of the 
theorem remains unchanged. 


Remark 8.7 

Theorem 8.3 provides only the existence of the (a, a)- 
maximal solution (or (1 —@, 1 —@)-maximal solution) 
of the FLP problem 8, and (1 —a@, 1 — @)-minimal so- 
lution ((@, w)-minimal solution) of the FLP problem © 
such that (8.55) or (8.56) holds. However, the proof of 
the theorem gives also the method for finding the so- 
lutions by solving linear programming problems (P1) 
and (D1). 
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8.4 Conclusion 


The leading idea of this chapter is based on the fact 
that, in many cases, the solutions of decision prob- 
lems obtained through linear programming are numer- 
ically tractable approximations of the original nonlin- 
ear problems. Because of the practical relevance of 
linear programming and taking into account a vast 
literature on this subject, we extended linear pro- 
gramming theory to problems involving fuzzy data. 
To obtain a meaningful extension of linear program- 
ming to problems involving fuzzy data, we specified 
a suitable class of permitted fuzzy values called fuzzy 
quantities or fuzzy numbers, introduced fundamental 
arithmetic operations with such fuzzy numbers, de- 
fined inequalities between fuzzy numbers, and clarified 
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9. Basic Solutions of Fuzzy Coalitional Games 


Tomas Kroupa, Milan Vlach 


This chapter is concerned with basic concepts of 

solution for coalitional games with fuzzy coalitions 
in the case of finitely many players and transfer- 
able utility. The focus is on those solutions which 
preoccupy the main part of cooperative game the- 
ory (the core and the Shapley value). A detailed 

discussion or just the comprehensive overview of 
current trends in fuzzy games is beyond the reach 
of this chapter. Nevertheless, we mention current 
developments and briefly discuss other solution 

concepts. 


The theory of cooperative games builds and analyses 
mathematical models of situations in which players can 
form coalitions and make binding agreements on how 
to share results achieved by these coalitions. One of 
the basic models of cooperative games is a cooperative 
game in coalitional form (briefly a coalitional game or 
a game). Following Osborne and Rubinstein [9.1] we 
assume that the data specifying a coalitional game are 
composed of: 


@ A nonempty set 92 (the set of players) and 
a nonempty set X (the set of consequences), 

© A mapping V that assigns to every subset S of 2 
a subset V(S) of X, and 

© A family {>;}:eq@ of binary relations on X (players’ 
preference relations). 


The set §2 of all players is usually referred to as the 
grand coalition, subsets of §2 are called coalitions, and 
the mapping V is called the characteristic function (or 
coalition function) of the game. 

This definition provides a rather general frame- 
work for analyzing many classes of coalitional games. 
The games of this type are usually called coalitional 
games without side payments or without transferable 
payoff (or utility). Obviously, for many purposes, this 
framework is too general because it neither speci- 
fies some useful structure of the set of consequences 
nor properties of preference relations. At the same 


9.1 Coalitional Games 


with Transferable Utility................0....... 146 
GA. WE CORE cssisianccssscavassactacasandcseawen 147 
9.1.2 The Shapley Value ........cccsccccccsese 147 
9.1.3 Probabilistic Values................0.... 149 
9.2 Coalitional Games 
with Fuzzy Coalitions .....................0... 150 
9.2.1 Multivalued Solutions................... 151 
9.2.2 Single-Valued Solutions ............... 153 
9.3. Final Remarks... 155 
REFEFENCES........ occ cece cece eee eeeeeceeaeseeeaeenenees 156 


time, this framework is also too restrictive because of 
requiring that the domain of the characteristic func- 
tion must be the system of all subsets of the player 
set. 

In this chapter, we are mainly concerned with coali- 
tional games in which the number of players is finite. 
The number of players will be denoted by n and, 
without loss of generality, the players will be named 
by integers 1,2,...,n. In other words, we set 2 = 
N where N = {1,2,...,n}. Moreover, we assume that 
the sets V(S) of consequences are subsets of the n- 
dimensional real linear space R”, and that each player i 
prefers (x1, ... , Xn) to (y1,...,¥,) if and only if x; > y;. 
Furthermore, we significantly restrict the generality by 
considering only the so-called coalitional games with 
transferable payoff or utility. This class of games is 
a subclass of games without transferable utility that is 
characterized by the property: for each coalition S, there 
exists a real number v(S) such that 


V(S) = fxe R": X x; < v(S) and x; = Oifj¢ Sp. 


icS 


Evidently, each such game can be identified with the 
corresponding real-valued function v defined on the sys- 
tem of all subsets of N. 

In coalitional games, whether with transferable or 
nontransferable utility, each player has only two alter- 
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natives of participation in a nonempty coalition: full 
participation or no participation. This assumption is too 
restrictive in many situations, and there has been a need 
for models that give players the possibility of partici- 
pation in some or all intermediate levels between these 
two extreme involvements. 

The first mathematical models in the form of coali- 
tional games in which the players are permitted to 
participate in a coalition not only fully or not at all 
but also partially were proposed by Butnariu [9.2] 
and Aubin [9.3]. Aubin notices that the idea of par- 
tial participation in a coalition was used already in the 
Shapley—Shubik paper on market games [9.4]. In these 
models, the subsets of N no longer represent every pos- 
sible coalition. Instead, a notion of a coalition has to be 
introduced that makes it possible to represent the partial 
membership degrees. 

It has become customary to assume that a member- 
ship degree of player i € N is determined by a number 
ai in the unit interval 7 = [0, 1], and to call the result- 
ing vector a = (a),...,d,) € I” a fuzzy coalition. The 


n-dimensional cube J” is thus identified with the set 
of all fuzzy coalitions. Every subset S of N, that is, 
every coalition S, can be viewed as an n-vector from 
{0, 1}” whose ith components is 1 when i € S and 0 
when ig S. These special fuzzy coalitions are often 
called crisp coalitions. Hence, we may think of the set 
of all fuzzy coalitions 7” as the convex closure of the set 
{0, 1}” of all crisp coalitions. This leads to the notion of 
an n-player coalitional game with fuzzy coalitions and 
transferable utility (briefly a fuzzy game) as a bounded 
function v: 7” > R satisfying v(0) = 0. 

It turns out that most classes of coalitional games 
with transferable utility and most solution concepts 
have natural counterparts in the theory of fuzzy games 
with transferable utility. Therefore, in what follows, we 
start with the classical case (Sect. 9.1) and then deal 
with the fuzzy case (Sect. 9.2). Taking into account 
that, in comparison with the classical case, the theory 
of fuzzy games is relatively less developed, we focus 
attention on two well-established solution concepts of 
fuzzy games: the core and the Shapley value. 


9.1 Coalitional Games with Transferable Utility 


We know from the beginning of this chapter that from 
the mathematical point of view, every n-player coali- 
tional game with transferable utility can be identified 
with a real-valued function v defined on the system 
of all subsets of the set N = {1,2,...,n}. For conve- 
nience, we assume that always v(@) = 0. 

It is customary to interpret the value v(S) of the 
characteristic function v at coalition S as the worth of 
coalition S or the total payoff that coalition § will be 
able to distribute among its members, provided exactly 
the coalition S forms. However, equally well, the num- 
ber v(S) may represent the total cost of reaching some 
common goal of coalition S that must be shared by the 
members of S; or some other quantity, depending on the 
application field. In conformity with the players pref- 
erences stated previously, we usually assume that v(S) 
represents the total payoff that S can distribute among 
its members. 

Since the preferences are fixed, we denote the game 
given through N and v by (N,v), or simply v, and 
the collection of all games with fixed N by Gy. The 
sum v + w of games from Gy defined by (v + w)(S) = 
v(S) + w(S) for each coalition S is again a game from 
Gy. Moreover, if multiplication of v € Gy by a real 
number g is defined by (av)(S) = av(S) for each coali- 


tion S, then av also belongs to Gy. An important and 
well-known fact is that Gy endowed with these two al- 
gebraic operations is a real linear space. 


Example 9.1 Simple games 

If the range of a game v is the two-element set {0, 1} 
only, then the game can be viewed as a model of a vot- 
ing system where each coalition A C N is either winning 
(v(A) = 1) or loosing (v(A) = 0). Then it is natural to 
assume that the game also satisfies monotonicity; that 
is, if coalition A is winning and B is a coalition with 
A CB, then B is also winning. It is also natural to con- 
sider only games with at least one winning coalition. 
Thus, we define a simple game [9.5, Section 2.2.3] to 
be a {0, 1}-valued coalitional game v such that the grand 
coalition is winning and v(A) < v(B), whenever A C B 
foreach A,B C N. 


We say that a game v is superadditive if v(A U B) > 
v(A) + v(B), for every disjoint pair of coalitions A, B C 
N. Consequently, in a superadditive game, it may be ad- 
vantageous for members of disjoint coalitions A and B 
to form coalition A U B because every pair of disjoint 
coalitions can obtain jointly at least as much as they 
could have obtained separately. Consequently, it is ad- 
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vantageous to form the largest possible coalitions, that 
is, the grand coalition. 

The strengthening of the property of superadditivity 
is the assumption of nondecreasing marginal contri- 
bution of a player to each coalition with respect to 
coalition inclusion: a game v is said to be convex 
whenever 


v(A U {i}) — v(A) < v(B U {i}) — v(B) 


for each i € N and every AC B C N \ {i}. It can be di- 
rectly checked that convexity of v is equivalent to 


v(A UB) + (AN B) > v(A) + v(B) 
foreveryA,BCN. 


Example 9.2 
Let B be a nonempty coalition in a simple game (N, v). 
Then the game vg given by 


1, ADB, 


vp(A) = 
B(A) 0, otherwise, 


ACN, 


is a convex simple game. 


Example 9.3 Bankruptcy game [9.6] 

Let e>0 be the total value of assets held in 
a bankruptcy estate of a debtor and let N be the set of all 
creditors. Furthermore, let d; > 0 be the debt to creditor 
i € N. Assume that e < J` ;ey di. The bankruptcy game 
is then the game such that, for every A C N, 


v(A) = max | 0,e — > d; 


iEN\A 
It can be shown that the bankruptcy game is convex. 


There is a variety of solution concepts for coali- 
tional games with n players. Some, like the core, stable 
set or bargaining set, may consist of sets of real n- 
vectors, while others offer as a solution of a game 
a single real n-vector. 


9.1.1 The Core 


Let v be an n-player coalitional game with transferable 
utility. The core of v is the set of all efficient payoff 
vectors x € R” upon which no coalition can improve, 


that is, 


C(v) = jx € R" 


oxi = v(N) 


iEN 


and Xox > v(A) for each A CN}. (9.1) 


icA 


The Bondareva—Shapley theorem [9.5, Theorem 3.1.4] 
gives a necessary and sufficient condition for the core 
nonemptiness in terms of the so-called balanced sys- 
tems. It is easy to see that the core of every game is 
a (possibly empty) convex polytope. Moreover, the core 
of a convex game is always nonempty and its vertices 
can be explicitly characterized [9.7]. 


9.1.2 The Shapley Value 


Let f = (fi.f2,..-.f,) be a mapping that assigns to ev- 
ery game v from some collection of games from Gy 
a real n-vector f(v) = (fi(v), P), ..-,fn(v)). Follow- 
ing the basic interpretation of values of a characteristic 
functions as the total payoff, we can interpret the values 
of components of such a function as payoffs to individ- 
ual players in game v. 

Let A be a nonempty collection of games from Gy. 
A solution function on A is a mapping f from A into 
the n-dimensional real linear space IR”. If the domain 
A of f is not explicitly specified, then it is assumed to 
be Gy. The collection of such mappings is too broad to 
contain only the mappings that lead to sensible solution 
concepts. Hence, to obtain reasonable solution concepts 
we have to require that the solution functions have some 
reasonable properties. One of the natural properties in 
many contexts is the following property of efficiency. 


Property 9.1 Efficiency 

A solution function f on a subset A of Gy is efficient on 
A iffi) +fav) +++: +f, (v) = v(N) for every game v 
from A. 


This property can be interpreted as a combina- 
tion of the requirements of the feasibility defined by 
Ai@) +h) ++: +fav) < v(N) and collective rational- 
ity defined by fi ~) + fa(v) ++ ++ +fn(v) 2 vN). 

In addition to satisfying the efficiency condition, 
solution functions are required to satisfy a number of 
other desirable properties. To introduce some of them, 
we need further definitions. 

Player i from N is a null player in game v if 
v(SU {i}) = v(S) for every coalition S that does not con- 
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tain player i; that is, participation of a null player in 
a coalition does not contribute anything to the coalition 
in question. 

Player i from N is a dummy player in game v if v(S'U 
{i}) = v(S) + v({i}) for every coalition S that does not 
contain player i; that is, a dummy player contributes to 
every coalition the same amount, his or her value of the 
characteristic function. 

Players i and j from N are interchangeable in 
game v if v(SU {i}) = v(SU {j}) for every coalition S 
that contains neither player i nor player j. In other 
words, two players are interchangeable if they can re- 
place each other in every coalition that contain one of 
them. 


Property 9.2 Null player 
A solution function f satisfies the null player property 
if f(v) = 0 whenever v € Gy and iis a null player in v. 


Property 9.3 Dummy player 

A solution function f satisfies the dummy player prop- 
erty if f;(v) = v({i}) whenever v € Gy andi is a dummy 
player in v. 


Property 9.4 Equal treatment 

A solution function f satisfies the equal treatment prop- 
erty if fi) =f(v) for every v € Gy and every pair of 
players i, j that are interchangeable in v. 


These three properties are quite reasonable and 
attractive, especially from the point of fairness and im- 
partiality: a player who contributes nothing should get 
nothing; a player who contributes the same amount to 
every coalition cannot expect to get anything else than 
he or she contributed; and two players who contribute 
the same to each coalition should be treated equally by 
the solution function. 

The next property reflects the natural requirement 
that the solution function should be independent of the 
players’ names. Let v be a game from Gy and mx: N > N 
be a permutation of N, and let the image of coalition S 
under 7x be denoted by 7 (S). It is obvious that, for every 
v € Gy, the function zv defined on Gy by (zv)(S) = 
v(x (S)) is again a game from Gy. Apparently, the game 
xv differs from game v only in players’ names; they are 
interchanged by the permutation z. 


Property 9.5 Anonymity 
A solution function f is said to be anonymous if, for 


every permutation x of N, we have fj(zv) = faa (v) for 
every game v € Gy and every player i € N. 


When a game consists of two independent games 
played separately by the same players or if a game is 
split into a sum of games, then it is natural to require 
the following property of additivity. 


Property 9.6 Additivity 

A solution function f on Gy is said to be additive if 
f(u+ v) =f(u)+ f(v) for every pair of games u and v 
from Gy. 


The requirement of additivity differs from the previ- 
ous conditions in one important aspect. It involves two 
different games that may or may not be mutually depen- 
dent. In contrast, the dummy player and equal treatment 
properties involve only one game, and the anonymity 
property involves only those games which are com- 
pletely determined by a single game. 


Remark 9.1 

The terminology introduced in the literature for various 
properties of players and solution functions is not com- 
pletely standardized. For example, some authors use the 
term dummy player and symmetric players (or substi- 
tutes), for what we call null player and interchangeable 
players, respectively. Moreover, the term symmetry is 
sometimes used for our equal treatment and sometimes 
for our anonymity. 


One of the most studied and most influential single- 
valued solution concept for coalitional games with 
transferable utility is the Shapley solution function 
or briefly the Shapley value, proposed by Shapley in 
1953 [9.8]. The simplest way of introducing the Shap- 
ley value is to define it explicitly by the following 
well-known formula for calculation of its components. 


Definition 9.1 

The Shapley value on a subset A of Gy is a solution 
function g on A whose components ¢ (Vv), p2(v),..., 
Pn (v) at game v € A are defined by 


(s—1)!(n—s)! i 
g=}, ——[v(S) — SD], 
: n! 
Si ES 
(9.2) 
where the sum is meant over all coalitions § contain- 
ing player i, and s generically stands for the number of 
players in coalition S. 
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To clarify the basic idea behind this definition, we 
first recall the notion of players’ marginal contributions 
to coalitions. 


Definition 9.2 

For each player i and each coalition S, a marginal con- 
tribution of player i to coalition S in game v from Gy is 
the number m; (S) defined by 


v(S)—v(S\ {i}) if ies 


m=] SU- if igs 


Now imagine a procedure for dividing the total payoff 
v(N) among the members of N in which the players en- 
ter a room in some prescribed order and each player 
receives his or her marginal contribution as payoff to 
the coalition of players already being in the room. Sup- 
pose that the prescribed order is (7x (1), 2(2),..., 7z (n)) 
where 1: N — N is a fixed permutation of N. Then the 
procedure under consideration determines the payoffs 
to individual players as follows: before the first player 
z(l) entered the room, there was the empty coalition 
waiting in the room. After player 7x (1) enters, the coali- 
tion in the room becomes {7 (1)} and the player receives 
v({z(1)}) —v(@). Similarly, before the second player 
z (2) entered, there was coalition {2 (1)} waiting in the 
room. After player 7 (2) enters, coalition {7 (1), 7(2)} 
is formed in the room and player (2) receives 
v({z (1), (2)}) — v({w(1)}). This continues till the last 
player x (n) enters and receives v(N) — v(N \ {x (n)}). 

Let S7 denote the coalition of players preceding 
player i in the order given by (z(1),7(2),...,2(n)); 
that is, S7 = {x (1), z (2),..., x (j— 1)} where j is the 
uniquely determined member of N such that i = x (j). 
Because there are n! possible orders, the arithmeti- 
cal average of the marginal contributions of player i 
taken over all possible orderings is equal to the num- 
ber (1/n!) X m; (S7) where the sum is understood over 
all permutations z of N. This number is exactly the i-th 
component of the Shapley value. Therefore, in addition 
to the equality (9.2) we also have the equality 


1 (QT 
giv) = = 2 (s7) (9.3) 


for computing the components of the Shapley value. 

In addition to satisfying the condition of efficiency, 
the Shapley value has a number of other useful prop- 
erties. In particular, it satisfies all properties 9.2-9.6. 
Remarkably, no other solution function on Gy satisfies 


the properties of null player, equal treatment, and addi- 
tivity at the same time. 


Theorem 9.1 Shapley 

For each N, there exists a unique solution function on 
Gy satisfying the properties of efficiency, null player, 
equal treatment, and additivity; this solution function is 
the Shapley value introduced by Definition 9.1. 


The standard proof of this basic result follows from 
the following facts: 


@ The collection {ur :T Æ Ø,T C N} of unanimity 
games defined by 


1 ifTCS 


y= 0 otherwise , 


(9.4) 
form a base of the linear space Gy. 

@ The null player and equal treatment properties guar- 
antee that o is determined uniquely on multiples of 
unanimity games. 

@ The property of additivity (combined with the fact 
that the unanimity games form a basis) makes it 
possible to extend g in a unique way to the whole 
space Gy. 


In addition to the original axiomatization by Shap- 
ley, there exist several equally beautiful alternative 
axiomatizations of the Shapley value that do not use the 
property of additivity [9.9, 10]. 


9.1.3 Probabilistic Values 


Let us fix some player i and, for every coalition S that 
does not contain player i, denote by a(S) the num- 
ber s!(n— s — 1)!/n!. The family {a;(S) : S CN \ {i}} is 
a probability distribution over the set of coalitions not 
containing player 7. Because the i-th component of the 
Shapley value can be computed by 


gv) = DY) a(S)[(SU {i}) —v(S)], 


SEND 


we see that the i-th component of the Shapley value 
is the expected marginal contribution of player i with 
respect to the probability measure {a;(S) : S CN \ {i} 
and that the Shapley value belongs to the following 
class of solution functions: 


Definition 9.3 
A solution function f on a subset A of Gy is called 
probabilistic on A if, for each player i, there ex- 
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ists a probability distribution {p;(S):S CN \ {i}} on 
the collection of coalitions not containing i such 
that 


fv = J pilS)Su {i}) — v(S)] 


SSN\iF 


(9.5) 


for every v € A. 


The family of probabilistic solution functions embraces 
an enormous number of functions [9.11]. The efficient 
probabilistic solution functions are often called quasi- 
values, and the anonymous probabilistic solution func- 
tions are called semivalues. Since the Shapley value 
is anonymous and efficient on Gy, we know that it is 
both a quasivalue and a semivalue on Gy. Moreover, the 
Shapley value is the only probabilistic solution function 
with these properties. 


Theorem 9.2 Weber 

If N has at least three elements, then the Shapley value 
is the unique probabilistic solution function on Gy, that 
is, anonymous and efficient. 


Another widely known probabilistic solution func- 
tion is the function proposed originally only for voting 
games by Banzhaf [9.12]. 


Definition 9.4 
The Banzhaf value on Gy is a solution function y on 
Gy whose components at game v are defined by 


yi0) = by : 


a=] 
SSN\Li} 


PSU ti}) — v(S)] . (9.6) 


Again, by simple computation, we can verify that the 
Banzhaf solution is a probabilistic solution function. 
Consequently, the i-th component of the Banzhaf so- 
lution is the expected marginal contribution of player i 
with respect to the probability measure {6;(S) : SC N \ 
{i}}, where B;(S) = 1/2"! for each subset S of N \ {i}. 
From the probabilistic point of view, the Banzhaf so- 
lution is based on the assumption that each player i is 
equally likely to join any subcoalition of N \ {i}. On the 
other hand, the Shapley value is based on the assump- 
tion that the coalition the player i enters is equally likely 
to be of any size s, 0 < s < n—1, and that all coalitions 
of this size are equally likely. 


9.2 Coalitional Games with Fuzzy Coalitions 


Since the publication of Aubin’s seminal paper [9.3], 
cooperative scenarios allowing for players’ fractional 
membership degrees in coalitions have been studied. In 
such situations, the subsets of N no longer model ev- 
ery possible coalition. Instead, a notion of coalitions has 
to be introduced that makes it possible to represent the 
partial membership degrees. It has become customary 
to assume that a membership degree of player i € N is 
determined by a number a; in the unit interval J = [0, 1], 
and to call the resulting vector a= (d),...,d,) € I” 
a fuzzy coalition. (The choice of J” is not the only pos- 
sible choice, see [9.13] or the discussion in [9.14].) 
The n-dimensional cube 7” is thus identified with the 
set of all fuzzy coalitions. Every subset A of N, that 
is, every classical coalition, can be viewed as a vector 
ly € {0, 1}” with coordinates 


1 ifieA 
1 i= , 
(la) 0 otherwise . 
These special fuzzy coalitions are also called crisp 


coalitions. When A = {i} is a singleton, we write 
simply 1; in place of lsp. Hence, we may think 


of the set of all fuzzy coalitions J” as the con- 
vex closure of the set {0,1}" of all crisp coalitions; 
see [9.14] for further explanation of this convexification 
process. 

Several definitions of fuzzy games appear in the lit- 
erature [9.3, 15]. We adopt the one used by Azrieli and 
Lehrer [9.13]. However, note that the authors of [9.13] 
use a slightly more general definition, since they con- 
sider a fuzzy coalition a to be any nonnegative real 
vector such that a < q, where q € R” is a given non- 
negative vector. 


Definition 9.5 

An n-player game (with fuzzy coalitions and transfer- 
able utility) is a bounded function v: I” —> R satisfying 
v(1g) = 0. 


If we want to emphasize the dependence of Defini- 
tion 9.5 on the number n of players, then we write (/”, v) 
in place of v. Further, by v we denote the restriction of 
v to all crisp coalitions 


(A) =v(I4), ACN. (9.7) 
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Hence, every game with fuzzy coalitions v induces 
a classical coalition game v with transferable utility. 
Most solution concepts of the cooperative game the- 
ory have been generalized to games with fuzzy coali- 
tions. A payoff vector is any vector x with n real coordi- 
nates, x = (x,,...,X,) € R”. In a particular game with 
fuzzy coalitions (/",v), each player i€ N obtains the 
amount of utility x; as a result of his cooperative activity. 
Consequently, a fuzzy coalition a € J" gains the amount 


(a, x) = X ax s 
i=1 

which is just the weighted average of the players’ pay- 
offs x with respect to their participation levels in the 
fuzzy coalition a. By a feasible payoff in game (I",v), 
we understand a payoff vector x with (1y,x) < v(1y). 

The following general definition captures most so- 
lution concepts for games with fuzzy coalitions. 


Definition 9.6 

Let Iy be a class of all games with fuzzy coalitions 
(I”, v) and let Ay be its nonempty subclass. A solution 
on Ay is a function o that associates with each game 
(I, v) in Ay a subset o(/", v) of the set 


{x € R”|(1x. x) < vy)} 
of all feasible payoffs in game (7”, v). 


The choice of ø is governed by all thinkable rules 
of economic rationality. Every solution ø is thus de- 
termined by a system of restrictions on the set of all 
feasible payoff vectors in the game. For example, we 
may formulate a set of axioms for ø to satisfy or single 
out inequalities making the payoffs in o(/”, v) stable, in 
some sense. 


9.2.1 Multivalued Solutions 


Core 
The core is a solution concept o defined on the whole 
class of games with fuzzy coalitions Iy. We present the 
definition that appeared in [9.3]. 


Definition 9.7 
Let N = {1,...,n} be the set of all players and v € Ty. 
The core of v is a set 
C(v) = {x € R"|(1y, x) = v(x), (a,x) > v(a), 
for everyael"}. 
(9.8) 


In words, the core of v is the set of all payoff vec- 
tors x such that no coalition a € J" is better off when 
accepting any other payoff vector y ¢ C(v). This is 
a consequence of the two conditions in (9.8): Pareto ef- 
ficiency (1y,x) = v(1y) requires that the profit of the 
grand coalition is distributed among all the players in N 
and coalitional rationality (a,x) > v(a) means that no 
coalition a € 7” accepts less than is its profit v(a). 

Observe that the core C(v) of a game with fuzzy 
coalitions v is an intersection of uncountably many 
halfspaces (a,x) > v(a) with the affine hyperplane 
(1y,x) =v(1y). This implies that the core is a possi- 
bly empty compact convex subset of R”, since C(v) is 
included in the core (9.1) of a classical coalition game v 
given by (9.7). In this way, we may think of the Aubin’s 
core C(v) as a refinement of the classical core (9.1). 

A payoff x in the core C(v) must meet uncountably 
many restrictions represented by all coalitions /”. This 
raises several questions: 


1. When is C(v) nonempty/empty? 

2. When is C(v) reducible to the intersection of finitely 
many sets only? 

3. For every a € l”, is there a core element x € C(v) 
giving coalition a exactly its worth v(a)? 

4. Is there an allocation rule for assigning payoffs in 
C(v) to fuzzy coalitions? 


Azrieli and Lehrer formulated a necessary and a suf- 
ficient condition for the core nonemptiness [9.13], thus 
generalizing the well-known Bondareva—Shapley the- 
orem for classical coalition games. We will need an 
additional notion in order to state their result. The 
strong superadditive cover of a game v € Ty is a game 
» € Iy such that, for every a € 1”, 


£ 
(a) = sup | X iw) LEN, a <a, à> 0, 


k=1 


£ 
X Aa’ =a, feal ; 


k=1 


The nonemptiness of C(v) depends on value of > at 
one point only. 


Theorem 9.3 Azrieli and Lehrer [9.13] 
Let v € Iy. The core C(v) is nonempty if and only if 
v(x) = (ly). 


The above theorem answers Question 1. Neverthe- 
less, it may be difficult to check the condition v(1y) = 
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v(1y). Can we simplify this task for some classes of 
games? In particular, can we show that the shape of the 
core is simpler on some class of games? This leads nat- 
urally to Question 2. Branzei et al. [9.16] showed that 
the class of games for which this holds true is the class 
of convex games. We say that a game v € I) is convex, 
whenever the inequality 


v(a + c)— v(a) < v(b +c) —v(b) (9.9) 


is satisfied for every a, b, c € I” such that b + c € J" and 
a <b. A word of caution is in order here: in general, 
as shown in [9.13], the convexity of the game v € Iy 
does not imply and is not implied by the convexity of v 
as an n-place real function. The game-theoretic convex- 
ity captures the economic principle of nondecreasing 
marginal utility. Interestingly, this property makes it 
possible to simplify the structure of C (v). 


Theorem 9.4 Branzei et al. [9.12] 

Let ve Iy be a convex game. Then C(v)#ø and, 
moreover, C (v) coincides with the core C (v) of the clas- 
sical coalition game v. 


The previous theorem, which solves Question 2, pro- 
vides in fact the complete characterization of core on 
the class of convex games with fuzzy coalitions. Indeed, 
since the game v is convex, we can use the result of 
Shapley [9.7] to describe the shape of C(v) = C (v). 
The point 3 motivates the following definition. A 
game v € Iy is said to be exact whenever for every 
a € l”, there exists x € C(v) such that (a, x) = v(a). The 
class of exact games can be explicitly described [9.13]. 


Theorem 9.5 
Let v € Iy. Then the following properties are equiva- 
lent: 


i) vis exact; 
ii) v(a) = min {(a, x)|x € C(v)}; 
iii) v is simultaneously 
a) aconcave, positively homogeneous function on 
T”, and 
b) v(a + (1 —A)ly) = Av(a) + 1 —åà)v(1x), for 
every a € l” andevery0 <A <1. 


The second equivalent property enables us to generate 
many examples of exact games — it is enough to take the 
minimum of a family of linear functions, each of which 
coincides at point ly. 


Question 4 amounts to asking for the existence of 
allocation rules in the sense of Lehrer [9.17] or dy- 
namic procedures for approximating the core elements 
by Wu [9.18]. A bargaining procedure for recovering the 
elements of the Aubin’s core C(v) is discussed in [9.19], 
where the authors present the so-called Cimmino-style 
bargaining scheme. For a game v € Iy and some initial 
payoff x? € R”, the goal is to recover a sequence of pay- 
offs converging to a core element, provided that C(v) Æ 
Ø. We consider a probability measure that captures the 
bargaining power of coalitions a € I": a coalitional as- 
sessment is any complete probability measure v on 7”. 
In what follows, we will require that v is such that, for 
every Lebesgue measurable set A C 7”, 


v(A)>0, whenever A is openor ly EA. (9.10) 


Let x € R” be an arbitrary payoff and a € 7”. We denote 


C (v) = ty € R"|(a, y) > v(a)} aéel"\ {In}, 
‘ {y€R"|(Iy,y) =vdy)} a=1y. 


What happens when payoff x is accepted by a, that 
is, x € C,(v)? Then coalition a has no incentive to bar- 
gain for another payoff. On the contrary, if x € C,(v), 
then a may seek the payoff Pax € C,(v) such that Pax 
is the closest to x in some sense. Specifically, we will 
assume that P,x minimizes the Euclidean distance of x 
from set C,(v). This yields the formula 


Px 
= arg Mie, (v) lly — x|l 
max{0, v(a) — (a, x)} 
lal? 


= P v(1y) — (iy, x) 


ael"\ {1g, 1n}, 


1 N a= 1 N, 
n 
x a= lg š 
where ||- || is the Euclidean norm. After all coalitions 


a € I” have raised their requests on the new payoff Pax, 
we will average their demands with respect to the coali- 
tional assessment v in order to obtain a new proposal 
payoff vector Px. Hence, Px is computed as 


Px = [re dv(a). 
jy 


The integral on the right-hand side is well defined, 
whenever v is Lebesgue measurable. The amalgamated 
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projection operator P is the main tool in the Cimmino- 
style bargaining procedure: an initial payoff x° is arbi- 
trary, and we put x* = Px! for each k = 1,2,... 


Theorem 9.6 
Let v € Ty be a continuous game with fuzzy coalitions 
and let v be a coalitional assessment satisfying (9.10): 


1. If the sequence (x*)zen generated by the Cimmino 
procedure is bounded and 


lim ed dv(a) =0, (9.11) 


k—oo 
[0.1]” 


then C(v) Æ Ø and lim x € C). 
k—->oo 


2. Ifthe sequence (x*),en is unbounded or (9.11) does 
not hold, then C(v) = Ø. 


The interested reader is invited to consult [9.19] for fur- 
ther details and numerical experiments. 


9.2.2 Single-Valued Solutions 


Shapley value 
Aubin defined Shapley value on spaces of games 
with fuzzy coalitions possessing nice analytical prop- 
erties [9.3, 14, Chapter 13.4]. Specificically, let a func- 
tion v: R” > R be positively homogeneous and Lip- 
schitz in the neighborhood of 1y. Such functions are 
termed generalized sharing games with side payments 
by Aubin [9.14, Chap. 13.4]. The restriction of v onto 
the cube 7” is clearly a game with fuzzy coalitions and 
therefore we would not make any distinction between v 
and its restriction to 7”. In addition, assume that func- 
tion v is continuously differentiable at 1y and denote by 
Gy the class of all such games with fuzzy coalitions. 
Hence, we may put 


o(v)=Vv(ly), veGy. (9.12) 


Each coordinate 0;(v) of the gradient vector o(v) cap- 
tures the marginal contribution of player i € N to the 
grand coalition Iy. As pointed out by Aubin, the gradi- 
ent measures the roles of the players as pivots in game v. 
Moreover, the operator o given by (9.12) can be con- 
sidered as a generalized Shapley value on the class of 
games G}, (cf. Theorem 9.1): Aubin proved [9.14, Chap- 
ter 13.4] that the operator defined by (9.12) satisfies 


(Ix. o @)) = vy), 


for every game v € G}, and 
oi(av) = Ora V), 


for every player i € N and every permutation x of N. 
Moreover, o fulfills a certain variant of the Dummy 
Property. 

When defining a value on games with fuzzy coali- 
tions, many other authors [9.15,20] proceed in the 
following way: a classical cooperative game is extended 
from the set of all crisp coalitions to the set of all fuzzy 
coalitions. The main issue is to decide on the nature of 
this extension procedure and to check that the extended 
game with fuzzy coalitions inherits all or at least some 
properties of the function that is extended (such as su- 
peradditivity or convexity). Clearly, there are as many 
choices for the extension as there are possible interpo- 
lations of a real function on {0, 1}” to the cube [0, 1]”. 

Tsurumi et al. [9.20] used the Choquet integral as 
an extension. Specifically, for every a € 1”, let Vz = 
{a;|a; > 0, i € N} and let na = |V,|. Without loss of gen- 
erality, we may assume that the elements of V, are 
ordered and write them as bı <--- <b,,. Further, put 
[a], = {i € Nla; > y}, for each a € J” and for each y € 
[0, 1]. 


Definition 9.8 
A game with fuzzy coalitions (/",v) is a game with 
Choquet integral form whenever 


Na 


v(a) = X v (laln) bi-b-1), acr, 


i=1 


where bo = 0. Let IF be the class of all games with 
Choquet integral form. 


In the above definition the function v is the so-called 
Choquet integral [9.21] of a with respect to the restric- 
tion v of v to all crisp coalitions. It was shown that 
every game v € IY is monotone whenever y is mono- 
tone [9.20, Lemma 2] and that v is a continuous function 
on 7” [9.20, Theorem 2]. The authors define a mapping 


f: TE > ((0,00)")", 
which is called a Shapley function, by the following as- 
signment 


Na 


FOA = X PO) bi- bim) 


i=1 


ieN, verf, acl", 
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where 
POS) 
~ (JA ! ! 
= y A= DMI 49 Gta) -sAN t), 
Ach |B|! 
i€A 
BCN, 


whenever i € B, and f? (V) (B) = 0, otherwise. Observe 
that f;(v)(a) is the Choquet integral of a with respect 
to f°(v) and that £°(¥)(B) is the Shapley value of y 
with the grand coalition N replaced with the coalition B. 
Before we show that the Shapley function has some ex- 
pected properties, we prepare the following definitions. 
Let a € I” and i,j € N. For each b e I” with b < a, de- 
fine a vector bj, whose coordinates are 


bi AG k=i, 
(bj), = yb AG k=j, keN. 
by otherwise , 
For an arbitrary b € 7”, put 
b k=i, 
(b; lal), = bi k=j, ken. 


b, otherwise , 


Clearly, we have both bi < aand bi; [a] < a. The follow- 
ing theorem is proved in [9.20]. 


Theorem 9.7 
The operator f : Tf > ({0,00)")" has the following 
properties: 


1. Ifve Df anda €17”, then 
XAO =va) and foa)=0, 
iEN 
for every j € N such that a = 0. 


2. Ifve Tl’, a el", and be!" such that v(b Ac) = 
v(b), for every c € J" with c < a, then 


fi) (a) =fi(v)(b) foreveryieN. 
3. fve, a er, aj; is such that v(a; ^c) = v(b), 


for every c € I” with c < a, and v(b) = v(bi,) for ev- 
ery b € I” with b < aj then 


fO =f). 


4. If v, w € IE, then v+ w € If, and 


FiO +w) =fMOt+hw@ . 


for every i € N and every a € J”. 


The previous theorem thus says that the Shapley func- 
tion f on the class of games [F has the properties 
analogous to the Shapley value: efficiency, the carrier 
property, symmetry, and additivity. 

Butnariu and Kroupa [9.15] studied a value operator 
on the class of fuzzy games (7”, v) satisfying 


a= D> yva), ael, 
tEe[0,1] 
where y : [0, 1] > R fulfills 
(w(t) = 0 iff t = 0) and y (1) = 1 
and 


d = {i€ Njai = t} r 


The class of such fuzzy games is denoted by T, a The 
so-called Shapley mapping function can be axioma- 
tized on T, w [9.15, Axioms 1-3]: it turns out that there 
is only one Shapley mapping ® : rY — (R”)” [9.15, 
Theorem 1]. 


Theorem 9.8 
There exists a unique Shapley mapping ® : T, vy > 
(R”)” and it is given by the following formula: 


P(v)(a) = 
vm X 


SEP;(a") 


asl = Dla" — |S)! 


la’|! 


(v(S) —v(S\ D), 


ifa;=r>0, 


0, otherwise, 
where 
Pi(a’)={RCN|ieRandRCa}. 


The expected total allocation of player i € N is then ob- 
tained as 


dv) = J @,(v)(a) da, 


rP 
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provided that the above Lebesgue integral exists. The 
operator d= (Ê, oe Ên) is called the cumulative 
value of v. If the weight function w is bounded and 
Lebesgue integrable, then [9.15, Theorem 2] shows that 
the cumulative value is well-defined and its coordinates 
are 


1 
Êv) = v(1i) | YA de 
| 


for each i € N. 

Owen’s approach to classical Shapley value [9.22] 
cannot be, strictly speaking, classified as an attempt to 
define a Shapley-style value on some class of games 
with fuzzy coalitions, but we mention his construction 
for the sake of completeness. The idea is to extend 
a game v € Gy with crisp coalitions from its domain 
{1,4|A C N} to the whole unit cube 7” by way of the 
multilinear interpolation. The resulting multilinear ex- 
tension v can be described explicitly as the function 


va) = Y | [Ja] [a-a) | va), 


ACN | i&A iA 


a= (ai,..., an) ET”. (9.13) 


9.3 Final Remarks 


We presented results concerning basic concepts of so- 
lution for coalitional games with fuzzy coalitions and 
finitely many players in the case of transferable utility. 
We concentrated on those solutions which preoccupy 
the main part of cooperative game theory (the core and 
the Shapley value). A detailed discussion or just the 
comprehensive overview of the current trends in fuzzy 
games is beyond the reach of this chapter. Neverthe- 
less, in this section we mention current developments 
and briefly discuss other solution concepts. The reader 
should always consult the relevant reference for the 
specification of the concepts used by the cited authors; 
for example, we can find at least two definitions of 
a convex fuzzy game: 


1. Azrieli and Lehrer [9.13] and [9.16] use the defini- 
tion (9.9) employed herein; 

2. Tsurumi et al. [9.20] call a game with fuzzy coali- 
tions v convex whenever 


v(av b) + v(an b) = v(a) + v(b) 
holds true for every a,b € 1”. 


Function Y is linear in each of its variables separately 
and v(A) = v(1,), for each A C N. The usual formula 
(9.13) for the Shapley value (v) of v now takes the 
following diagonal form [9.22] 


1 
640) = f Een dt. (9.14) 
Ox; 
0 


Hence, ¢;(v) is completely determined by the behavior 
of the function v in the neighborhood of the diagonal 
in 7”. The formula (9.14) is important from the com- 
putational point of view: its use in connection with 
statistical techniques can enhance computations with 
the Shapley value — see [9.23, Chap. XII.4] for further 
details. 

Since the space of games with crisp coalitions is fi- 
nite dimensional unlike the space of games with fuzzy 
coalitions, there is no general approach to the Shap- 
ley value of fuzzy games. Even a direct comparison of 
the cumulative value introduced above with the Shap- 
ley function on the space of games If of Tsurumi 
et al. [9.20] is hardly possible since the domains of 
Shapley operators are essentially different. The selec- 
tion of the right space of games and an appropriate 
solution thus vary from one application to another. 


Shellshear [9.24] employs the concavification of 
the fuzzy game — the strong supperadditive cover — in 
order to show [9.24, Theorem 4.4] that the strong sup- 
peradditive cover has a stable core if and only if the 
original game has a stable core. Further, he investigates 
important properties of the concavification and its su- 
perdifferential; new necessary and sufficient conditions 
for core stability are given in [9.23, Chap. XIL.4]. 

Yang et al. [9.25] introduced the concept of bargain- 
ing sets for games with fuzzy coalitions; they prove 
that the bargaining set coincides with the Aubin core 
whenever a game is continuous and convex. Liu and 
Liu [9.26] extended the results from [9.25] in order 
to overcome some weakness of the previously used 
fuzzy bargaining sets. The concept of the classical Mas- 
Colells bargaining set was also generalized and the 
authors proved existence theorems for such fuzzy bar- 
gaining sets. Moreover, both Aumann and Maschler and 
Mas-Colell fuzzy bargaining sets of a continuous con- 
vex cooperative fuzzy game coincide with its Aubin 
core. 
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A fuzzy game is represented as a convex program 
in [9.27]. It is shown that the optimum of the program 
determines the optimal coalitions as well as the optimal 
rewards for the players. Further, this framework seems 
to unify a number of existing representations of solu- 
tions: the core, the least core, and the nucleolus. 

Wu [9.28] investigates various types of cores based 
on the dominance among payoff vectors and the con- 
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Janos C. Fodor, Imre J. Rudas 


In this chapter we summarize basic knowledge on 
fuzzy logics and fuzzy sets. After a short histori- 
cal overview of ideas strongly connected to and 
preceding the notion of fuzzy logics and fuzzy 
sets, we outline links between many-valued and 
fuzzy logics. Then fuzzy subsets of a universe are 
introduced. Interpretations of unary and binary 
connectives in fuzzy logics as appropriate functions 
(operations) on the unit interval are central to the 
approach. Fundamental knowledge on these func- 
tion classes is presented then, including results on 
triangular norms and conorms, as well as on impli- 


In everyday life we use and process vague, imprecise 
linguistic terms like young, hot, or around midnight. 
Classical mathematics is unable and inadequate to pro- 
vide models that can express the complex semantics of 
such terms. Fuzzy sets, introduced by Zadeh [10.1] on 
the basis of his observation that 


more often than not, the classes of objects en- 
countered in the real physical world do not have 
precisely defined criteria or membership, 


are appropriate for modeling the semantics of vague lin- 
guistic terms. Fuzzy sets offer a framework to deal with 
predicates whose satisfaction is a matter of degree. 
Some forerunners discussed ideas or formal def- 
initions for describing vague predicates or classes 
with imprecise boundaries, very close to the basic 
notions introduced by Zadeh [10.1]. We should men- 
tion Peirce [10.2], Russel [10.3], Lukasiewicz [10.4], 
Black [10.5], Weyl [10.6], Kaplan and Schott [10.7]. 
The mathematician Karl Menger was the first (in 1951) 
who used the term ensemble flou (the French counter- 
part for fuzzy set) in the title of a paper in French [10.8]. 
In addition, Menger’s work on probabilistic metric 
spaces also led to the introduction of so-called tri- 
angular norms and conorms, extensively studied by 
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cations. Our concluding remarks suggest further 
reading, beyond the basics. 


Schweizer and Sklar [10.9], and which later have turned 
out to be basic operators for fuzzy sets [10.10]. For 
more historical facts we refer to [10.11]. 

We want to emphasize that Zadeh’s motivations 
and background were quite different from those of the 
above-mentioned authors. He introduced the concept of 
a fuzzy set completely independently of their proposals 
in order to provide a tool for representing and reason- 
ing with the available information in a manner similar 
to the way humans express knowledge and summarize 
data. 

This Chapter is organized as follows. In the next 
section we briefly recall some notions from classical 
mathematics and its underlying two-valued (Boolean) 
logic. We extend this material, and in Sect. 10.3 we 
introduce key terms related to fuzzy sets. Sect. 10.4 
contains the core knowledge on interpretations of con- 
nectives in fuzzy logic and fuzzy set-theoretic oper- 
ations. This includes fundamentals of negations, tri- 
angular norms and conorms, together with the most 
important parametric families and particular opera- 
tions. Fuzzy implications are also handled in a similar 
way. Concluding remarks are given at the end, in- 
cluding several suggested literature items for further 
reading. 
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Part B | Fuzzy Logic 


10.1 Classical Mathematics and Logic 


Classical mathematics is based on two-valued logic, 
in which the set of truth values consists of two ele- 
ments: {0, 1}. There are two basic binary operations A 
(AND), v (OR), and the unary complement — (NOT). 
All other logical operations, e.g. the implication >, 
the logical equivalence <>, and the exclusive or XOR, 
can be constructed from the three basic operations 
A,V,7. 

A proposition is either an atomic propositional vari- 
able pı, p2,..., or a compound expression (p A q), (pV 
q), or =p, where p and q are propositions. A propo- 
sition is either true (with truth value 1) or false (with 
truth value 0), but not both. 

A set A is a collection of objects in a given uni- 
verse X, where, for each possible object x from X, it 
either belongs to the set A (in symbols: x € A) or not 
(x Z A). A set A is a subset of B if all objects in A are in 
B as well (in symbols: A C B). We write A C Bif A CB 
and there is at least one element in B which is not in 
A. The set of all subsets of X is denoted by P(X). The 
empty set, which does not contain any object, is denoted 
by Ø. 


We consider three fundamental operations on sets. 
The intersection of two sets A and B, denoted by A N B, 
is the set of objects from X which belong both to A and 
B. The union of two sets A and B, denoted by A U B, is 
the set of objects from X which belong at least to one 
of the sets A and B. The complement of a set A, denoted 
by A‘, is the set of objects from X which do not belong 
to A. 

A function %4 : X — {0, 1} is called the character- 
istic function of the set A if 


DE 1 ifxeA 
MA) No ifxgA 


The characteristic function discriminates between mem- 
bers and nonmembers of the set A. With the help of 
characteristic functions, set operations can be expressed 
as follows 


Xans) = Zax) A xr), 
Xaus X) = Zax) V Xe), 
Nac (x) = maa). 


10.2 Fuzzy Logic, Membership Functions, and Fuzzy Sets 


The idea behind fuzzy logic is to replace the set of 
truth values {0, 1} by the entire unit interval [0, 1]. Then 
a fuzzy set on a universe X is represented by a function 
which maps each element x € X to a degree of member- 
ship from the unit interval [0, 1]. Larger values indicate 
higher degrees of membership. 

For several decades, many-valued logic was con- 
sidered as a pure mathematical topic. The introduction 
of fuzzy sets [10.1] produced a new impact to the 
investigation of many-valued logics. Informally speak- 
ing, fuzzy logic is understood as an extension of 
many-valued logics, with an ultimate goal of providing 
foundations for approximate reasoning with imprecise 
propositions using fuzzy set theory as the principal tool. 

A many-valued propositional logic in which the 
class of truth values is modelled by the unit interval 
[0, 1], and which forms an extension of the classical 
Boolean logic, i. e., the two-valued logic with truth val- 
ues {0,1}, is quite often called a fuzzy logic [10.10]. 
For sake of simplicity, it is assumed that all fuzzy log- 
ics have the same syntax, they may differ only by their 
semantics. 


A membership function u4 is a mapping from the 
universal set X to the unit interval, i.e., a :X > 
[0, 1]. Membership functions are direct generalizations 
of characteristic functions. In a logical setting, the de- 
gree of membership u(x) can also be seen as the truth 
value of the statement x is element of A. 

Notice that a membership grade can have three 
meanings: 


© Degree of similarity. The membership grade pa (x) 
represents the degree of proximity of x from proto- 
type elements of A. 

© Degree of preference. A represents a set of more or 
less preferred objects, and j14(x) represents an in- 
tensity of preference in favor of object x. 

© Degree of uncertainty. The degree j14(x) can be 
viewed as the degree of plausibility that a param- 
eter p has value x, given that all that is known about 
it is that p is A. 


These three semantics of fuzzy sets appear in the 
works of Zadeh and he was the first to propose each of 
them. 
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A fuzzy set on X (or a fuzzy subset of X) is defined as 
the collection of the ordered pairs of elements of X and 
their membership grades. Practically, a fuzzy set A on 
X is identified with the membership function u4. The 
family of all fuzzy subsets of X is denoted by F(X). 
Classical subsets of X are special fuzzy subsets on X, 
and are called crisp sets. Note that one may represent 
membership grades not only by the unit interval but also 
by a (partially or completely) ordered set. 

Given two fuzzy subsets A and B of X, we say that 
A is equal to B (in symbols A = B) if u4 = up, and that 
A is a subset of B (in symbols A C B) if u4 < Hp. 


10.3 Connectives in Fuzzy Logic 


In order to generalize the classical set-theoretical op- 
erations like intersection, union and complement, it is 
quite natural to use interpretations of logic connectives 
A, V and >, respectively. Indeed, the values (A N B)(x), 
(A U B)(x) and A‘(x) describe the truth values of the 
statements x is element of A AND x is element of B, 
x is element of A OR x is element of B, and x is NOT 
element of A, respectively. 

We introduce appropriate classes of functions 
N : [0,1] > [0,1], T : [0, 1]? — [0, 1] and S: [0, 1]? > 
[0, 1] in order to interpret logic operations ~, A and 
V, respectively, on the evaluation set [0, 1]. In addition, 
fuzzy implications are also introduced later on. 

Then, the complement of a fuzzy set A, the intersec- 
tion and union of fuzzy sets A, B are specified by the 
functions N, T and S, respectively, such that 


Ay (x) = N(A(Q)) , 
(ANr B)(x) = TAQ), BO) . 
(A Us B)(x) = S(A(x), B(x) , (10.1) 


where x € X, A,B € F(X). Therefore, desired proper- 
ties of fuzzy set-theoretic (or equivalently, logic) op- 
erations can be obtained through the corresponding 
properties of the above functions N, T and S. 


10.3.1 Negations 


Starting with the negation —, it is clear that its inter- 
pretation should map 1 to 0 and O to 1, in order to 
be an extension of the interpretation of the classical 
two-valued negation. Another natural property is that 
the interpretation of the negation — be a non-increasing 
function. To simplify notations, and since there is no 


As it is emphasized in [10.11], membership func- 
tions express a vertical view of fuzzy sets. Another view 
is to consider a fuzzy set as a nested family of classical 
sets, by using the notion of a-cuts. For any a € [0, 1] we 
can introduce the a-cut Aq of a fuzzy set A. By defini- 
tion, Aq is the crisp subset of X that contains all the 
elements of X that have a membership grade greater 
than or equal to the specified value a. More formally, 
Aq = {x E€ X| ua (x) = a}, œ € [0, 1]. 

In the sequel, membership functions and fuzzy sets 
will be denoted by the same symbol: we write simply 
A(x) instead of u4 (x) for A € F(X) and x € X. 


confusion possible, an interpretation of the negation — 
will also be called a negation. 


Definition 10.1 

A decreasing function N : [0, 1] — [0, 1] with N(0) = 
1, N(1) = 0 is called a negation. A strictly decreasing 
continuous negation is called a strict negation. A strict 
negation N is said to be a strong negation if N is also 
involutive: N(N (x)) = x holds for all x € [0, 1]. 


Since a strict negation N is a strictly increasing and 
continuous function, its inverse NT! is also a strict 
negation, generally different from N. Obviously, we 
have NT! = N if and only if N is involutive: N(N(x)) = 
x holds for all x € [0, 1]. This means that the graph of 
the function N is symmetric with respect to the line 
{(x, y) | x = y}. 

Another important property of a strict negation N 
is that there exists a unique value 0 < v < 1 such that 
N(v) = v. Then we also have NT! (v) = v. 

A negation which is neither strong nor strict is the 
Gédel (or intuitionistic) negation given by 


ifx=0 


1 
No = fo if x € ]0, 1] 


By duality, we can define the dual Gédel negation as 
follows 


1 ifxe [0,1 
Nac = Jo nee 


It is easy to see that for any negation N we have 


Ne SN <MNac - 
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A strict but not strong negation can be given by N(x) = 
1-x. 

A parametric family of strong negations is defined 
as follows (see [10.12] under the name A-complement) 


=% 
1+Ax’ 


N(x) = ii), 


The standard negation N; is defined simply as 
Ns(x) =1—x, x € [0,1]. This is the most frequently 
used negation, which is obviously a strong negation. It 
plays a key role in the representation of strong nega- 
tions presented in the following theorem. In this chapter 
we call a continuous, strictly increasing function ọ : 
[0, 1] > [0,1] with g(0) =0, g(1) =1 an automor- 
phism of the unit interval. 


Theorem 10.1 

A function N : [0, 1] — [0, 1] is a strong negation if and 
only if there exists an automorphism ¢ of the unit inter- 
val such that [10.13] 


Naw =¢ '(1-¢@), xe [0,1]. (10.2) 


In this case Nọ denotes N in (10.2) and is called a g- 
transform of the standard negation. If the complement 
of fuzzy sets on X is defined by Ng, we use the short 
notation Aj, for A € F(X), instead of writing ANg: 


10.3.2 Triangular Norms and Conorms 


It is assumed that the conjunction ^, which is always 
in the tuple of connectives, is interpreted by a t-norm, 
which, in a canonical way, is a generalization of the 
interpretation of the conjunction in Boolean logic. In 
a logical sense, a t-conorm is an ideal candidate for the 
interpretation of the disjunction V, since it is a canon- 
ical extension of the interpretation of the two-valued 
disjunction. This is clear from the following definition. 


Definition 10.2 

A triangular norm (shortly: a t-norm) is a function 
T : [0, 1]? — [0, 1] which is associative, commutative 
and increasing, and satisfies the boundary condition 
T(1,x) = x for all x € [0, 1]. 

A triangular conorm (shortly: a t-conorm) is a func- 
tion S : [0, 1]? > [0, 1] which is associative, commuta- 
tive and increasing, with boundary condition S(0, x) = 
x for all x € [0, 1]. 


The class of t-norms (with slightly different axioms) 
was introduced in the theory of statistical (probabilistic) 
metric spaces as a tool for generalizing the classi- 
cal triangular inequality by Menger [10.14] (see also 
Schweizer and Sklar [10.9], Alsina etal. [10.15], and 
Klement et al. [10.10]). 

Notice that continuity of a t-norm and a t-conorm is 
not taken for granted. Even more: conditions in Def- 
inition 10.2 do not even imply that all t-norms, as 
two-place functions, are measurable (see [10.16] for 
a counter-example). 

However, the definition implies the following prop- 
erties 


T(x, y) < min(x, y) , 
S(x,y) > max(x,y) (x,y € [0,1]), 
and 


T(0,y)=0, S(1,y)=1 forall ye [0,1]. 


The smallest t-norm is the drastic product Tp given by 


0 if (x, y) € [0, 1[ 


Tı = ; f 
Dy) min(x, y) otherwise. 


The greatest (and the only idempotent) t-norm is ob- 
viously Ty = min, the minimum t-norm. Thus, for any 
t-norm T we have 


Tp <T<Tmu. 


The smallest and greatest t-norms (Tp and Ty) to- 
gether with the product t-norm Tp(x, y) = xy, and the 
Lukasiewicz t-norm Ty, given by 


Ti (x, y) = max (0,x + y— 1) 


are called basic t-norms. 
The first known left-continuous and not continuous 
t-norm is the nilpotent minimum [10.17] defined by 


0 ifx+y<1 


T, Ka i 
nM (X, y) min(x, y) otherwise. 


A t-conorm S is called the dual to the t-norm T if 
S(x,y) = 1— T(1 — x, 1 — y) holds for all x, y € [0, 1]. 

The t-conorm SĮ = max is the smallest t-conorm 
and it is dual to the greatest t-norm min. The dual to 
the drastic product is the t-conorm Sp given by 


1 if (x, y) € JO, 1] 


Sp, y) = max(x,y) otherwise. 


For each t-conorm S, we have Sm < S < Sp. 
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The dual t-conorm to the product Tp is called 
the probabilistic sum and it is denoted by Sp, with 
Sp(x, y) =x +y— xy. 

The Łukasiewicz t-conorm S, called also the 
bounded sum, is given by S (x, y) = min (1, x + y). 

From algebraic point of view, a function T: 
[0, 1]? — [0, 1] is a t-norm if and only if ([0, 1], T, <) 
is a fully ordered commutative semigroup with 
neutral element 1 and annihilator 0. Similarly, 
a function S: [0,1]? > [0,1] is a t-conorm if and 
only if ([0,1],S,<) is a fully ordered commuta- 
tive semigroup with neutral element 0 and annihila- 
tor 1. 

Clearly, for every t-norm T and strict negation N, 
the operation S defined by 


S(x,y) =N~"'(TING),NG))), x,y € [0,1] 


(10.3) 


is a t-conorm. In addition, if N is a strong nega- 
tion then NT! = N, and we have for x, y € [0, 1] that 
T(x, y) = N(S(N(x), N(y))). In this case S and T are 
called N-duals. In case of the standard negation (i.e., 
when N = N,) we simply speak about duals. Obviously, 
equality (10.3) expresses the De Morgan’s law. 


Definition 10.3 

A triplet (T, S, N) is called a De Morgan triplet if and 
only if T is a t-norm, S is a t-conorm, N is a strong 
negation and they satisfy (10.3). 


It is worth noting that, given a De Morgan triplet 
(T,S,N), the tuple ([0, 1],7,5,N,0,1) can never be 
a Boolean algebra [10.18]: in order to satisfy distribu- 
tivity we must have T = min and § = max, in which 
case it is impossible to have both T(x, N(x)) = 0 and 
S(x, N(x)) = 1 for all x € [0, 1]. Depending on the oper- 
ations used, one can, however, obtain rather general and 
useful structures such as, for instance, De Morgan al- 
gebras, residuated lattices, l-monoids, Girard algebras, 
MV-algebras, see [10.10]. 

There are several examples of De Morgan triplets. 
We list in Table 10.1 those ones that are related to 
the examples above. Let y be an automorphism of the 
unit interval, Ny the corresponding strong negation, and 
x,y € [0, 1]. 

A function K : [0, 1]? > [0, 1] will often be called 
a binary operation on [0,1]. For an automorphism g 
of [0, 1], the g-transform Kọ of such a K is defined by 


Ko (x, y) = 9 '(K(g(x), (y))), x,y € [0, 1]. Thus, Ta- 


ble 10.1 contains -transforms of some fundamental 
t-norms and t-conorms. 


Continuous Archimedean t-Norms 

and t-Conorms 
A broad class of problems consists of the representation 
of multi-place functions in general by composition of 
simpler functions and functions of fewer variables (see 
Ling [10.19] for a brief survey), such as 


Ka, y) = sFf@+f0)), 


where K is a two-place function and f, g are real func- 
tions. In that general framework, the representation of 
(two-place) associative functions by appropriate one- 
place functions is a particular problem. It was Abel who 
first obtained such a representation in 1826 [10.20], by 
assuming also commutativity, strict monotonicity and 
differentiability. Since Abel’s result, a lot of contribu- 
tions have been made to representations of associative 
functions (and generally speaking, of abstract semi- 
groups). 

For any x € [0, 1], any n € N, and for any associa- 
tive binary operation K on [0,1], denote x? the n-th 
power of x defined by 


NRO Sk; 
and 
x” = K(x,...,x) forn>2. 
eae 


n-times 


Definition 10.4 
A t-norm T (resp. a t-conorm S) is said to be: 


a) Continuous if T (resp. S) as a function is continuous 
on the unit interval; 
b) Archimedean if for each (x, y) €]0, 1[? there is ann € 


N such that x) < y (resp. ag >y). 


Note that the definition of the Archimedean prop- 
erty is borrowed from the theory of semigroups. 


Table 10.1 Some Nọ-dual triangular norms and conorms 
if S 
min(x, y) max(x, y) 
9 '(~@)90)) oT EW + 90) — 9@)90)) 
gy '(max(g(x) + ¢0)—1,0)) g!(min(g@) + p0), 1) 
0 if p(x) + py) S1 \max(x. y) if p(x) + GQ) <1 
min(x, y) otherwise 1 otherwise 
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We state here the representation theorem of con- 
tinuous Archimedean t-norms and t-conorms attributed 
very often to Ling [10.19]. In fact, her main theorem 
can be deduced from previously known results on topo- 
logical semigroups, see [10.21—23]. Nevertheless, the 
advantage of Ling’s approach is twofold: treating two 
different cases in a unified manner and establishing el- 
ementary proofs. 


Theorem 10.2 
A t-norm T is continuous and Archimedean if and only 
if there exists a strictly decreasing and continuous func- 
tion t : [0, 1] — [0, co] with (1) = 0 such that 
T(x,y) =X) +t0)) ye [0, 1), 
(10.4) 


where £’ is the pseudoinverse of t defined by 


—] . 
cnra JE (x) ifx < t(0) 
ee 0 otherwise. 
Moreover, representation (10.4) is unique up to a posi- 
tive multiplicative constant. [10.19] 


We say that T is generated by t if T has represen- 
tation (10.4). In this case t is said to be an additive 
generator of T. 


Theorem 10.3 
A t-conorm S is continuous and Archimedean if and 
only if there exists a strictly increasing and continuous 
function s : [0, 1] — [0, co] with s(0) = 0 such that 
Sexy) =s EAs) ye (0.1), 
(10.5) 


where s‘—! is the pseudoinverse of s defined by 


—1 . 
cna Js œ) ifx<s() 
oa) = 1 otherwise. 
Moreover, representation (10.5) is unique up to a posi- 
tive multiplicative constant. [10.19] 


We say that a continuous Archimedean t-conorm S 
is generated by s if S has representation (10.5). In this 
case s is said to be an additive generator of S. 

Remark that Aczél published the representation of 
strictly increasing, continuous and associative two- 


place functions on open or half-open real inter- 
vals [10.24-26]. This was the starting point to be 
generalized by Ling [10.19]. 


Definition 10.5 

We say that a t-norm T has zero divisors if there exist 
x, y €]0, 1[ such that T(x, y) = 0. T is said to be positive 
if x, y > Oimply T(x, y) > 0. A t-norm T or at-conorm S 
is called strict if it is a continuous and strictly increasing 
function in each place on ]0, 1[?. T is called nilpotent 
if it is continuous and Archimedean with zero divi- 
sors. Triangular conorms which are duals of nilpotent 
t-norms are also called nilpotent. 


The representation theorem of t-norms (resp. t- 
conorms) does not indicate any condition on the value 
of a generator function at O (resp. at 1). On the basis 
of this value, one can classify continuous Archimedean 
t-norms (resp. t-conorms) as it is stated in the following 
theorem. 


Theorem 10.4 

Let T be a continuous Archimedean t-norm with ad- 
ditive generator t, and S be a continuous Archimedean 
t-conorm with additive generator s. Then: 


a) T is nilpotent if and only if t(0) < +00; 
b) T is strict if and only if t(0) = lim,—,9 t(x) = +00; 
c) Sis nilpotent if and only if s(1) < +00; 
d) Sis strict if and only if s(1) = lim,_; s(x) = +00. 


Using the general representation theorem of contin- 
uous Archimedean t-norms, we can give another form 
of representation for a class of continuous t-norms with 
zero divisors. More exactly, for continuous t-norms T 
such that T(x, N(x)) = 0 holds with a strict negation N 
for all x € [0, 1]. Such t-norms are Archimedean, as it 
was proved in [10.27]. The following theorem is estab- 
lished after [10.28]. 


Theorem 10.5 

A continuous t-norm T is such that T(x, N(x)) = 0 
holds for all x € [0, 1] with a strict negation N if and 
only if there exists an automorphism ¢ of the unit inter- 
val such that for all x, y € [0, 1] we have 


T(x, y) = 9 | (max{g(x) + p0) — 1, 0}) 
and 


N(x) <p- pa). 


(10.6) 
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As a consequence, we obtain that any nilpotent t- 
norm is isomorphic to (i.e., is a g-transform of) the 
Lukasiewicz t-norm Ty (x, y) = max(x+ y— 1,0). Sim- 
ilarly, any strict t-norm T is isomorphic to the algebraic 
product: there is an automorphism ø of [0, 1] such that 
T, y) = 9 '(g(2)9(0)), for all x, y € [0, 1]. 

Similar statements can be proved for t-conorms, 
see [10.29] for more details. For instance, any strict t- 
conorm is isomorphic to the probabilistic sum Sp, and 
any nilpotent t-conorm is isomorphic to the bounded 
sum SL. 


Continuous t-Norms and t-Conorms 
Suppose that {[a@m, Bml}mem is a countable family of 
non-overlapping, closed, proper subintervals of [0,1]. 
With each [a,,, Em] associate a continuous Archimedean 
t-norm Tp. Let T be a function defined on [0, 1]? by 


T(x, y) 
x—-a y—-a 
Qin tr (Bn E Am) Tm ( “ > u ) 
RE Bn an Bn Tan 
7 if (x, y) € [Om, Bnl? 
min(x, y) otherwise . 


(10.7) 


Then T is a continuous t-norm. In this case T 
is called the ordinal sum of {([om, Pm], Tm)}mem and 
each T„ is called a summand. 

Similar construction works for t-conorms S. Just re- 
place T,,, with a continuous Archimedean t-conorm Sm, 
and min with max in (10.7). Thus defined S is 
a continuous t-conorm, called the ordinal sum of 
{([Om, Pml, Sm)}mem, Where each Sm is called a sum- 
mand. 

Assume now that T is a continuous t-norm. Then, T 
is either the minimum, or T is Archimedean, or there 
exist a family {([Qm,Bm],7Tm)tmem With continuous 
Archimedean summands T, such that T is equal to the 
ordinal sum of this family, see [10.19, 22]. It has also 
been proved there that a continuous t-conorm S is either 
the maximum, or Archimedean, or there exist a fam- 
ily {([Oms Bin], Sm)}mem With continuous Archimedean 
summands S,, such that S is the ordinal sum of this 
family. 


Parametric Families of Triangular Norms 
We close this subsection by giving taste of the wide 
variety of parametric t-norm families. For a com- 
prehensive list with detailed properties please look 
in [10.10]. 


Frank t-norms {T}}y<[0,00]- Let A>0,A #1 be 
a real number. Define a continuous Archimedean t- 
norm TF in the following way 


es) =g (1+ EE) 


x (x,y € [0, 1]). 


We can extend this definition for A = 0, A = 1 and 
à = œ by taking the appropriate limits. Thus we get 
T 7 TE and em as follows 


TË (x, y) = lim TY (x, y) = min{x, y} , 
A—>0 

TE (x,y) = lim y) =x, 
A->1 


TE (x, y)= aim TË (x,y) = max{x+y-—1,0}. 


Each TY is a strict t-norm for À €]0, co[. The corre- 
sponding additive generators i are given by 


—logx ifA=1 

10) = —log ¿= if A €]0,00f, A 1 
The family {TF }A€[0,00] is called the Frank family of 
t-norms (see Frank [10.30]). Note that members of this 
family are decreasing functions of the parameter A (see 
e.g. [10.31]). 

The De Morgan law enables us to define the Frank 
family of t-conorms {Ss} A€[0,00] by 


Shy) = 1-TH(1—x, 1—y) 


for any A € [0, oo]. 
In [10.30] one can find the following interesting 
characterization of these parametric families. 


Theorem 10.6 
A t-norm T and a t-conorm S satisfy the functional 
equation 


T(x, y) +S, y) =x+y (x,y €[0,1]) (10.8) 


if and only if 


a) there is a number A € [0, co] such that T = TY and 
S= Ss or 

b) T is representable as an ordinal sum of t-norms, 
each of which is a member of the family {Ti} 0< 
À < oo, and S is obtained from T via (10.8). [10.30] 
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Hamacher t-norms {TH} 2€[0,00] - Let us define three the unit interval [0,1]. As in the case of the nega- 
parameterized families of t-norms, t-conorms and tion, we call an interpretation of the implication > 
strong negations, respectively, as follows. also an implication (or sometimes, a fuzzy implication). 
A comprehensive study of fuzzy implications can be 
TH(x,y) = a a ash found in the book [10.34]. 
A+(L-A)(x+y—xy) In a very broad sense, any function /: [0, 1]? > 
F x+y + (B— Ixy [0, 1] which is decreasing/increasing and preserves the 
Sg, y= 1+ bxy >» p2-l, values of the crisp implication on {0, 1} is considered 
{=z as a fuzzy implication. 
N, (x) = , yor. 
l1+yx 


Hamacher proved the following characterization 
theorem [10.32]. 


Theorem 10.7 
(T, S, N) is a De Morgan triplet such that 


T(x, y) = T(x,z) => y=z, 

S(x,y) = SQ, 2) => y =z, 

Vz<x Jy,y such that T(x, y) =z, 
S(z.y) =x 


and T and S are rational functions if and only if there are 
numbers A > 0, 6 > —1 and y > —1 such that A = 4 
and T = T}, S = S% and N = Ny. 


Remark that another characterization of the 
Hamacher family of t-norms with positive parameter 
has been obtained in [10.33] as solutions of a functional 
equation. 


Dombi t-norms {T)}j<€[0,00]- The formula for this t- 
norm family is given by 


Tp (x, y) ifA =0 
Tm (x, y) if à = œ 
T? (x,y) = ' 
a03) l if 4 €]0, cof 


OO 


Essential properties of these t-norms and other well- 
known families can be found in [10.31]. 


10.3.3 Fuzzy Implications 


Turning to the interpretation of the implication —> in 
fuzzy logics, it becomes apparent that there are sev- 
eral logical formulae which, in the Boolean two-valued 
logic, are equivalent to the implication, but give rise 
to different interpretations when replacing {0,1} by 


Definition 10.6 

A function I: [0, 1]? > [0, 1] is called a fuzzy implica- 
tion if and only if it satisfies the following conditions: 
I1. 7(0,0) = 7(0, 1) =7(1, 1) = 1; 701, 0) = 0. 

12. If x < z then I(x, y) > I(z, y) for all y € [0, 1]. 

13. If y < t then I(x, y) < I(x, t) for all x € [0, 1]. 


The reason behind I1 is obvious, while a fuzzy im- 
plication is required to be decreasing/increasing (i. e., I2 
and I3 should be satisfied) because it measures that the 
consequent is more true than the antecedent [10.35]. 

Clearly, a fuzzy implication 7 has the following 
properties (as a consequence of the definition): 


14. I(0,x) = 1 for all x € [0, 1]. 
15. I(x, 1) = 1 for all x € [0, 1]. 


Note that originally we defined a fuzzy implica- 
tion in a slightly different form [10.29, Definition 1.15], 
which is equivalent to Definition 10.6. 

Further properties may be required for a fuzzy 
implication that can be important also in some appli- 
cations: 


16. I(1,x) = x for all x € [0, 1]. [10.36] 

17. I(x, I,z))=1I,I(x,z)) for all x,y,z e€ [0,1]. 
[10.36] 

I8. x< y if and only if I(x, y) = 1 for all x, y € [0, 1]. 
[10.37] 

19. N(x) = I(x, 0) is a strong negation (x € [0, 1]). 

110. I(x, y) > y for all x, y € [0, 1]. [10.38] 

I11. (x, x) = 1 for all x € [0, 1]. [10.39] 

112. I(x, y) = (NO), N(x)) with a strong negation N, 
for all x, y € [0, 1]. 


113. J is a continuous function. 


Property I6 yields that tautology cannot justify any- 
thing. Condition I7 is called the exchange principle, and 
is based on the following equivalence: 


if P; then (if Pz then P3) <=> if (P; AND P2) 
then P3 . 
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I8 expresses that implication defines an ordering, 
I9 reflects that P —> Q = —P if Q is false. I10 is the 
numerical counterpart of P —> (Q —> P). I11 is called 
the identity principle and it yields that P — P is always 
true. I12, the contraposition law (or in other words, the 
contrapositive symmetry), expresses a relationship be- 
tween modus ponens and modus tollens, see [10.35]. 
In general, this is a strong condition, see [10.17]. 113 
prevents implication from reacting in a chaotic way 
to a small change of the truth value of either the 
antecedent or the consequent. This is also a fairly re- 
strictive condition. 


Fuzzy Implications Defined by t-Norms, 

t-Conorms and Negations 
To be consistent, implications and conjunctions (or 
implications and disjunctions) cannot be studied inde- 
pendently. Thus, we introduce two particular classes 
of fuzzy implications based on t-norms, t-conorms 
and negations. These were identified in [10.36, 40- 
44]. 

For a left-continuous t-norm T, its T-residuum 
[10.10] Zr generalizes the Boolean implication, we pre- 
fer the name R-implication for Ir (see the next definition 
and the Remark after that). 

Another way to introduce an implication (called 
S-implication) which is an extension of the Boolean im- 
plication is to exploit the fact that, in a two-valued logic, 
the formulae p — q and >p V q are equivalent. 


Definition 10.7 
Suppose (T, S, NV) is a De Morgan triplet. 

An R-implication Ir associated with the t-norm T is 
defined by 


Ir(x, y) = sup{z|T(x, z) <y} (wy €[0, 1). 
(10.9) 


An S-implication Is associated with the t- 
conorm S and the strong negation N is defined by 


Ts. n(x, y) = S(N(x),y) (x, y € [0, 1]) . (10.10) 


It is easy to see that both Jr and Js y satisfy prop- 
erties I1-I3 for any t-norm T, t-conorm S and strong 
negation N, thus they are fuzzy implications. Note that 
if T is a continuous Archimedean t-norm with additive 
generator ¢ then 


Ir(x, y) = t! (max{t(y) — 2), 0}) Gy € (0, 1). 


Let us emphasize an important link between R- 
implications defined by left-continuous t-norms, and 
residuums in lattice-ordered monoids. 

Assume that L is a non-empty set, (L,~<) is 
a lattice and (L,*) is a semigroup with neutral ele- 
ment. We introduce some definitions, for more details 
see [10.45]. 


i) The triplet (L,*,~<) is called a lattice-ordered 
monoid (or an l-monoid) if for all x,y,z € L we 
have: 

LMI) xx*(yVz) = (x*y) V (x * 2), 
LM2) (xVy)*z=(**z) V (y*2Z). 

ii) An l-monoid (L, x, <) is said to be commutative if 
the semigroup (L, *) is commutative. 

iii) A commutative l-monoid (L, x, <) is called a com- 
mutative, residuated I-monoid if there exists a fur- 
ther binary operation —> * on L, i.e., a function 
—>x*:L? —> L (the *-residuum), such that for all 
x,y,z E L we have (R) x* y Xz if and only if x < 
You Z. 

iv) An l-monoid (L, *, <) is called integral if there is 
a greatest element in the lattice (L, <) (often called 
the universal upper bound) which coincides with the 
neutral element of the semigroup (L, *). 


It is evident that ([0, 1], T, <) is a commutative in- 
tegral l-monoid if and only if the function T : [0, 1]? > 
[0, 1] is a t-norm. It turns out that the left-continuity 
of a t-norm can be characterized by the fact that the 
corresponding |-monoid is residuated. In this case the 
T-residuum Jr is given by (10.9), see [10.10]. Be- 
cause of its interpretation in [0, 1]-valued logics, the 
T-residuum is also called a residual implication (or 
briefly, an R-implication). 

Given a left-continuous t-norm T, the R-implication 
Tr is left-continuous in its first and right-continuous in 
its second argument, and it is continuous if and only if 
the underlying t-norm is nilpotent [10.29, 46]. 

For the sake of completeness we mention a third 
type of connectives used in quantum logic and called 
QL-implication defined as follows 


Irs x,y) = S(N (x), T(x, y)) (x,y € [0, 1]). 


(10:11) 


For the idea behind QL-implications, see [10.47]. In 
general, Ir, S, N violates property I2, so it is not a fuzzy 
implication in the sense of Definition 10.6. Conditions 
under that I2 is satisfied can be found in [10.17]. 
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Part B 


Fuzzy Logic 


Negations Defined by Implications 
As we have seen, several types of negations can be 
introduced in fuzzy logic. The link between fuzzy im- 
plications and negations can be expressed by requiring 
that the function N defined by 

N(x) =1(x,0) forallxe [0,1], 
be a negation, where / is a fuzzy implication. This is 
motivated by the corresponding classical rule. 

Suppose that 7 : [0, 1]? > [0, 1] is a function satisfy- 
ing I3, I7, I8 and define N(x) = I(x, 0) (x € [0, 1]). Then 
(a) N is a negation; (b) x < N(N(x)) for all x € [0, 1]; 


10.4 Concluding Remarks 


The study of fuzzy implications, triangular norms and 
their extensions is a never ending story. During such 
research, fundamental new properties and classes have 
been discovered, essential results have been proved and 
applied to diverse problem classes. These are beyond 
the goal of the present chapter. Nevertheless, we name 
just a few directions. 

Firstly, we mention uninorms [10.48,49], a joint 
extension of both t-norms and t-conorms, with neu- 
tral element being an arbitrary number between 0 and 
1. Their study includes the Frank functional equa- 
tion [10.50], their residual operators [10.51], differ- 
ent extensions [10.52,53], characterizing their math- 
ematical properties [10.54] and important subclasses 
such as idempotent [10.55] and representable uninorms 


(c) N(N(N(x))) = N(x) for all x € [0, 1]. If, in addition, 
N is continuous then it is involutive [10.29]. Thus, un- 
der the above conditions, N cannot be a noninvolutive 
strict negation: it is either discontinuous or a strong 
negation. If N is continuous then Z fulfils 112 with 
this N. 

For a positive t-norm T (like min or the algebraic 
product), the negation obtained via its R-implication is 
not continuous at all. In fact, in this case we have that 


1 ifx=0 
Ir (x, 0) = 


0: eso: VU 


[10.56]. It turns out that some special uninorms have al- 
ready been hidden, without using the name uninorm, in 
the classical expert system MYCIN [10.57]. 

Secondly, some recent papers on fuzzy implications 
are briefly listed, in which the interested reader can 
find further references. After the book [10.34] was pub- 
lished, several important contributions have been made 
by the authors themselves, like [10.58]. Some algebraic 
properties of fuzzy implications such as distributiv- 
ity [10.59] and contrapositive symmetry [10.60], the 
law of importation or the exchange principle [10.61], 
typically in the form of functional equations, have also 
been studied intensively. New construction methods 
have also been introduced and deeply studied [10.62- 
65]. 


References 
10.1 L. Zadeh: Fuzzy sets, Inf. Control 8, 338-353 (1965) 10.7 A. Kaplan, H.F. Schott: A calculus for empirical 
10.2 C. Hartshorne, P. Weiss (Eds.): Principles of Phi- classes, Methods III, 165-188 (1951) 
losophy, Collected Papers of Charles Sanders Peirce 10.8 K. Menger: Ensembles flous et fonctions aleatoires, 
(Harvard University Press, Cambridge 1931) C. R. Acad. Sci. Paris 232, 2001-2003 (1951) 
10.3 B. Russell: Vagueness, Austr. J. Philos. 1, 84-92 10.9 B. Schweizer, A. Sklar: Probabilistic Metric Spaces 
(1923) (North-Holland, Amsterdam 1983) 
10.4 J. tukasiewicz: Philosophical remarks on many- 10.10 E.P. Klement, R. Mesiar, E. Pap: Triangular Norms 
valued systems of propositional logic. In: Selected (Kluwer, Dordrecht 2000) 
Works, Studies in Logic and the Foundations of 10.11 D. Dubois, W. Ostasiewicz, H. Prade: Fuzzy sets: His- 
Mathematics, ed. by L. Borkowski (North-Holland, tory and basic notions. In: Fundamentals of Fuzzy 
Amsterdam 1970) pp. 153-179 Sets, (Kluwer, Dordrecht 2000), Chap. 1, p. 21-124 
10.5 M. Black: Vagueness, Philos. Sci. 4, 427-455 (1937) 10.12 M. Sugeno: Fuzzy measures and fuzzy initegrals: 
10.6 H. Weyl: The ghost of modality. In: Philosophical A survey. In: Fuzzy Automata and Decision Pro- 


Essays in Memory of Edmund Husserl, ed. by M. Far- 
ber (Cambridge, Cambridge 1940) pp. 278-303 


cesses, ed. by G.N. Saridis, M.M. Gupta, B.R. Gaines 
(North-Holland, Amsterdam 1977) pp. 89-102 


Basics of Fuzzy Sets 


References 


10.13 


10.14 


10.15 


10.16 


10.17 


10.18 
10.19 


10.20 


10.21 


10.22 


10.23 


10.24 


10.25 


10.26 


10.27 


10.28 


10.29 


10.30 


10.31 


10.32 


10.33 


10.34 


E. Trillas: Sobre funciones de negación en la teori 
a de conjuntos difusos, Stochastica III, 47-60 
(1979) 

K. Menger: Statistical metric spaces, Proc. Natl. 
Acad. Sci. USA 28, 535-537 (1942) 

C. Alsina, M.J. Frank, B. Schweizer: Associative 
Functions: Triangular Norms and Copulas (Word 
Scientific, Hoboken 2006) 

E.P. Klement: Operations on fuzzy sets - An ax- 
iomatix approach, Inf. Sci. 27, 221-232 (1982) 

J.C. Fodor: Contrapositive symmetry of fuzzy impli- 
cations, Fuzzy Sets Syst. 69, 141-156 (1995) 

R. Sikorski: Boolean Algebras (Springer, Berlin 1964) 
C.H. Ling: Representation of associative functions, 
Publ. Math. Debr. 12, 189-212 (1965) 

N.H. Abel: Untersuchung der Fuctionen zweier 
unabhängig verdnderlichen Grössen x und y 
wie f(x,y), welche die Eigenschaft haben, dass 
F(z, f(%, y)) eine symmetrische Function von x, y 
und zist, J. Reine Angew. Math. 1, 11-15 (1826) 
W.M. Faucett: Compact semigroups irreducibly con- 
nected between two idempotents, Proc. Am. Math. 
Soc. 6, 741-747 (1955) 

P.S. Mostert, A.L. Shields: On the structure of semi- 
groups on a compact manifold with boundary, 
Annu. Math. 65, 117-143 (1957) 

A.B. Paalman-de Mirinda: Topological Semigroups, 
Technical Report (Mathematisch Centrum, Amster- 
dam 1964) 

J. Aczél: Uber eine Klasse von Funktionalgleichun- 
gen, Comment. Math. Helv. 54, 247-256 (1948) 

J. Aczél: Sur les opérations définies pour des nom- 
bres réels, Bull. Soc. Math. Fr. 76, 59-64 (1949) 

J. Aczél: Lectures on Functional Equations and their 
Applications (Academic, New York 1966) 

S. Ovchinnikov, M. Roubens: On fuzzy strict pref- 
erence, indifference and incomparability relations, 
Fuzzy Sets Syst. 47, 313-318 (1992) 

S. Ovchinnikov, M. Roubens: On strict pref- 
erence relations, Fuzzy Sets Syst. 43, 319-326 
(1991) 

J. Fodor, M. Roubens: Fuzzy Preference Modelling 
and Multicriteria Decision Support (Kluwer, Dor- 
drecht 1994) 

M.J. Frank: On the simultaneous associativity of 
F(x,y) and x+ y — F(x, y), Aeq. Math. 19, 194-226 
(1979) 

E.P. Klement, R. Mesiar, E. Pap: A characterization 
of the ordering of continuous t-norms, Fuzzy Sets 
Syst. 86, 189-195 (1997) 

H. Hamacher: Über logische Aggrationen nicht- 
binär explizierter Entscheidungskriterien; Ein ax- 
iomatischer Beitrag zur normativen Entschei- 
dungstheorie (Fischer, Frankfurt 1978) 

J.C. Fodor, T. Keresztfalvi: Characterization of the 
Hamacher family of t-norms, Fuzzy Sets Syst. 65, 
51-58 (1994) 

M. Baczyński, B. Jayaram: Fuzzy Implications 
(Springer, Berlin 2008) 


10.35 


10.36 


10.37 


10.38 


10.39 


10.40 


10.41 


10.42 


10.43 


10.44 


10.45 


10.46 


10.47 


10.48 


10.49 


10.50 


10.51 


10.52 


10.53 


P. Smets, P. Magrez: Implication in fuzzy logic, Int. 
J. Approx. Reason. 1, 327-347 (1987) 

E. Trillas, L. Valverde: On some functionally ex- 
pressable implications for fuzzy set theory, Proc. 
3rd Int. Seminar on Fuzzy Set Theory (Johannes Ke- 
pler Universität, Linz 1981) pp. 173-190 

B.R. Gaines: Foundations of fuzzy reasoning, Int. 
J. Man-Mach. Stud. 8, 623-668 (1976) 

R.R. Yager: An approach to inference in approxi- 
mate reasoning, Int. J. Man-Mach. Stud. 13, 323- 
338 (1980) 

W. Bandler, L.J. Kohout: Fuzzy power sets and fuzzy 
implication operators, Fuzzy Sets Syst. 4, 13-30 
(1980) 

E. Trillas, L. Valverde: On implication and in- 
distinguishability in the setting of fuzzy logic. 
In: Management Decision Support Systems us- 
ing Fuzzy Sets and Possibility Theory, ed. by 
J. Kacprzyk, R.R. Yager (Verlag TÜV Rheinland, Köln 
1985) pp. 198-212 

H. Prade: Modèles mathématiques de |'imprécis 
et de l'incertain en vue d'applications au raison- 
nement naturel, Ph.D. Thesis (Université P. 
Sabatier, Toulouse 1982) 

S. Weber: A general concept of fuzzy connectives, 
negations and implications based on t-norms and 
t-conorms, Fuzzy Sets Syst. 11, 115-134 (1983) 

D. Dubois, H. Prade: Fuzzy logics and the general- 
ized modus ponens revisited, Int. J. Cybern. Syst. 
15, 293-331 (1984) 

D. Dubois, H. Prade: Fuzzy set-theoretic differ- 
ences and inclusions and their use in the analy- 
sis of fuzzy equations, Control Cybern. 13, 129-145 
(1984) 

G. Birkhoff: Lattice Theory, Collected Publications, 
Vol. 25 (Am. Math. Soc., Providence 1967) 

U. Bodenhofer: A Similarity-Based Generalization 
of Fuzzy Orderings, Schriften der Johannes-Kepler- 
Universitat Linz, Vol. 26 (Universitatsverlag Rudolf 
Trauner, Linz 1999) 

D. Dubois, H. Prade: Fuzzy sets in approximate 
reasoning, part 1: Inference with possibility distri- 
butions, Fuzzy Sets Syst. 40, 143-202 (1991) 

R.R. Yager, A. Rybalov: Uninorm aggregation oper- 
ators, Fuzzy Sets Sys. 80, 111-120 (1996) 

J.C. Fodor, R.R. Yager, A. Rybalov: Structure of uni- 
norms, Int. J. Uncertain. Fuzziness Knowl.-Based 
Syst. 5(4), 411-427 (1997) 

T. Calvo, B. De Baets, J. Fodor: The functional 
equations of frank and alsina for uninorms and 
nullnorms, Fuzzy Sets Syst. 120, 385-394 (2001) 

B. De Baets, J. Fodor: Residual operators of uni- 
norms, Soft Comput. 3, 89-100 (1999) 

M. Mas, G. Mayor, J. Torrens: T-operators and uni- 
norms on a finite totally ordered set, Int. J. Intell. 
Syst. 14, 909-922 (1999) 

M. Mas, M. Monserrat, J. Torrens: On left and right 
uninorms, Int. J. Uncertain. Fuzziness Knowl.- 
Based Syst. 9, 491-507 (2001) 


169 


OL| d Hed 


170 PartB | Fuzzy Logic 
10.54 M. Mas, G. Mayor, J. Torrens: The modularity condi- 10.60 M. Baczyński, F. Qin: Some remarks on the distribu- 
o tion for uninorms and t-operators, Fuzzy Sets Syst. tive equation of fuzzy implication and the con- 
a 126, 207-218 (2002) trapositive symmetry for continuous, archimedean 
= 10.55 B. De Baets: Idempotent uninorms, Eur. J. Oper. t-norms, Int. J. Approx. Reason. 54, 290-296 
=à Res. 118, 631-642 (1999) (2013) 
© 10.56 J. Fodor, B. De Baets: A single-point characteri- 10.61 S. Massanet, J. Torrens: The law of importation ver- 
zation of representable uninorms, Fuzzy Sets Syst. sus the exchange principle on fuzzy implications, 
202, 89-99 (2012) Fuzzy Sets Syst. 168, 47-69 (2011) 
10.57 B. De Baets, J.C. Fodor: Van melle?s combining 10.62 S. Massanet, J. Torrens: On a new class of fuzzy im- 
function in mycin is a representable uninorm: plications: h-implications and generalizations, Inf. 
An alternative proof, Fuzzy Sets Syst. 104, 133-136 Sci. 181, 2111-2127 (2011) 
(1999) 10.63 S. Massanet, J. Torrens: On some properties of 
10.58 B.Jayaram, M. Baczyński, R. Mesiar: R-implications threshold generated implications, Fuzzy Sets Syst. 
and the exchange principle: The case of border 205, 30-49 (2012) 
continuous t-norms, Fuzzy Sets Syst. 224, 93-105 10.64 S. Massanet, J. Torrens: Threshold generation 
(2013) method of construction of a new implication from 
10.59 M. Baczyński: On two distributivity equations for two given ones, Fuzzy Sets Syst. 205, 50-75 (2012) 
fuzzy implications and continuous, archimedeant- 10.65 S. Massanet, J. Torrens: On the vertical threshold 


norms and t-conorms, Fuzzy Sets Syst. 211, 34-54 
(2013) 


generation method of fuzzy implication and its 
properties, Fuzzy Sets Syst. 206, 32-52 (2013) 


11. Fuzzy Relations: Past, Present, and Future 


Susana Montes, Ignacio Montes, Tania Iglesias 


Relations are used in many branches of mathe- 
matics to model concepts like is lower than, is 
equal to, etc. Initially, only crisp relations were 
considered, but in the last years, fuzzy relations 
have been revealed as a very useful tool in psy- 
chology, engineering, medicine, economics or any 
mathematically based field. A first approach to the 
concept of fuzzy relations is given in this chapter. 
Thus, operations among fuzzy relations are defined 
in general. When considering the particular case 
of fuzzy binary relations, their main properties are 
studied. Also, some particular cases of fuzzy binary 
relations are considered and related among them. 
Of course, this chapter is just a starting point to 
study in detail more specialized literature. 


1.1 Fuzzy Relations ..................cccccccccceeeeeee 172 
1.11 Operations on Fuzzy Relations... 172 

11.1.2 Specific Operations 
on Fuzzy Relations................... 173 


The notion of relation plays a central role in various 
fields of mathematics. As a consequence of that, it is 
a very important concept in all engineering, science, 
and mathematically based fields. 

Crisp or classical relations show a problem; they do 
not allow to express partial levels of relationship among 
two elements. This is a problem in many practical sit- 
uations since not always an element is clearly related 
to another one. The valued theory arises with the aim 
of allowing to assign degrees to the relations between 
alternatives. As it is well known, according to fuzzy 
set theory [11.1], the connection established among two 
alternatives admits different degrees of intensity and 
that intensity is represented by a value in the interval 
[0, 1]. The idea of working with values different from 0 
and | to express the relationship between two elements 
was already considered by Lukasiewicz in the 1920s 
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when he introduced his three-valued logic, and later by 
Luce [11.2] or Menger [11.3], but it was Zadeh [11.4] 
who formally defined the concept of a fuzzy (or multi- 
valued) relation. 

In the history of fuzzy mathematics, fuzzy relations 
were early considered to be useful in various appli- 
cations: fuzzy modeling, fuzzy diagnosis, and fuzzy 
control. They also have applications in fields such as 
psychology, medicine, economics, and sociology. For 
this reason, they have been extensively investigated. For 
a contemporary general approach to fuzzy relations one 
should look at Bělohláveks book [11.5], and also to 
other general publications, as for instance the books by 
Klir and Yuan [11.6] and Turunen [11.7]. 

Since this chapter is entirely devoted to fuzzy rela- 
tions, our aim is to give a detailed introduction to them 
for a nonexpert reader. 
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11.1 Fuzzy Relations 


Assume that X and Y are two given sets. A fuzzy relation 
R is a mapping from the Cartesian product X x Y to the 
interval [0, 1]. Therefore, R is basically a fuzzy set in 
the universe X x Y. This means that, for any x € X and 
any y € Y, the value R(x, y) measures the strength with 
which R connects x with y. If R(x, y) is close to 1, x is 
related to y by R. If R(x, y) is a value close to 0, then it 
hardly connects x with y and so on. 

This definition can be extended to the Cartesian 
products of more than two sets and then they are called 
n-ary fuzzy relations. Note that fuzzy sets may be 
viewed as degenerate, l-ary fuzzy relations. 


Example 11.1 
If we consider the case X = Y = [0, 3], we could define 
the fuzzy relation approximately equal to as follows 


R(x, y) =e7!,  Y(x,y) € [0,3] x [0, 3] 
which is represented in Fig. 11.1. 


The domain of a fuzzy relation R is a fuzzy set on 
X, whose membership function is given by 


dom R(x) = sup R(x, y) 
yEY 


and the range is given by 


ran R(x) = sup R(x, y) . 


xEX 


When X and Y are finite sets, we can consider a ma- 
trix representation for any fuzzy relation. The entry on 
the line x and column y of the associated matrix is the 
value R(x, y). 


Example 11.2 

Let us consider the set X formed by three papers, X = 
{P1,P2,P3}, and let Y be a set formed by five different 
topics Y = {t), fo, t3, t4, ts}. The fuzzy relation R mea- 
suring the degree of relationship of any paper with any 
topic is defined by 


R ti lo f3 t4 ts 


pı 10 07 09 04 0.2 
p 05 08 10 0.3 0.9 
p3 9.7 05 0.8 0.3 0.8 


and its domain is given by the fuzzy subset of X 


dom R = {(p1, 1), (p2, 1), (p3.0.8)}. 
and its range by 


ran R = {(t;, 1), (t2, 0.8), (t3, 1), (t4, 0.4), (t5,0.9)}. 


11.1.1 Operations on Fuzzy Relations 


All concepts and operations applicable to fuzzy sets 
are applicable to fuzzy relations as well. Thus, for any 
fuzzy relations R and Q on the Cartesian product X x Y, 
we have 


@ Given a t-norm T (for a complete study about 
t-norms and f-conorms, we refer to [11.8]), the 
T-intersection (or just intersection if there is not am- 
biguity) of R and Q is the fuzzy relation on X x Y 
defined by 


RO rz Q(x, y) = T(R(x, y), Q(x, y)) , 
V(x,y) EXXxY. 


Fig. 11.1 Membership function of the fuzzy re- 
lation R (approximately equal to) introduced in 
Example 11.1 
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Initially, T was considered to be the minimum oper- 
ator. 

Given a ft-conorm S, the S-union of R and Q is the 
fuzzy relation on X x Y defined by 


RUs Q(x, y) = S(R(x, y), Q(x, y)) , 
V(x, y) EXxY. 


At the initial proposal, the maximum f-conorm was 
considered. 

The transpose or inverse of the fuzzy relation R, de- 
noted as in the classical case by R~!, is the fuzzy 
relation that satisfies 


R7'(x,y) =RO.x), Vy) EXXY. 


fuzzy relation defined by 
R OT Q(x, y) m Sup T(R(x, y), Oy, z)) , 
ye 
V(x,z)EXxZ. 


Due to associativity and nondecreasingness of the 
t-norms, the following result can be easily proven. 


Proposition 11.1 
Let R, Q, and P be the three fuzzy relations on X x Y, 
Y x Z, Z x U, respectively. Then: 


i) Ror (Qoz P) = (Ror Q) or P, 
ii) If R’ is another fuzzy relation on X x Y such that 
RCR, then Ror Q CR’ or Q. 


@ The complement of a fuzzy relation is not unique. 
It depends on the negator n we choose. The n- 
complement of the fuzzy relation R, denoted by R°, 
is the fuzzy relation defined by 


R(x, y) = n(R(x,y)), Wx y) Ee Xx Y. 


Although the definition is given for any negator, 
the most widely used one is the standard negator 
(n(x) = 1 —x). In this case, it is called the standard 
complement and is defined by R° (x, y) = 1—R(x, y). 

@ The dual of the fuzzy relation R is defined and de- 
noted as in the classical case. The fuzzy relation R? 
is the complement of the transpose of R 


R? (x,y) =n(R(y,x)), Y&,y)EXxY. 


That is, R? = (R~!)°. 

@ We say that R is contained in Q, and we denote it 
by RC Q, if and only if for all (x, y) € X x Y the 
inequality R(x, y) < Q(x, y) holds. 

© R and Q are said to be equal if and only if for 
all (x,y) € X x Y we have the inequality R(x, y) = 
Q(x, y), that is, RC Qand QCR. 


11.1.2 Specific Operations on Fuzzy Relations 


In the previous items, we are only considering that 
fuzzy relations can be seen as fuzzy sets on X x Y and 
we have adapted the corresponding definitions. How- 
ever, fuzzy relations involve additional concepts and 
operations. The most important are: compositions, pro- 
jections and cylindrical extensions, among others. 


© Let R and Q two fuzzy relations on X x Y and Y x Z, 
respectively. The T-composition of R and Q is the 


Let R be a fuzzy relation on X x Y. We can project 
R with respect X and Y as follows: 
Ry(x) = sup R(x, y), WxeX, and 


yer 


Ry(y) = sup R(x, y), YyeY 
xEX 

where Ry and Ry denote the projected relation of R 
to X and Y, respectively. 

It is clear that the projection to X coincides with the 
domain of R and the projection to Y with the range. 
The definition given for 2-ary fuzzy relations can be 
generalized to n-ary relations. Thus, if R is a fuzzy 
relation on X; x X2 x -+ - x Xn, the projected relation 
of R to the subspace X;, x X; x --- x X;, is defined 
by 


Ry; Xj, xx Xi in Xiz os Xip) 
= sup  R(x1,X2,..., Xn), 
Hi H2 +++ m 
where Xj, , Xz». - - , Xn represent the omitted dimen- 
sions and X;,,Xj,,...,X;, the remained ones. That 
is 


{ij,i2,...,i$U {Jije dm} = {1,2,...,n} 


and 


tit, in... de} OU J2,- -Jm =O. 


Another operation on relations, which is in some 
sense an inverse to the projection, is called a cylin- 
drical extension. If A is a fuzzy subset of X, then 
its cylindrical extension to Y is the fuzzy relation 
defined by 


cylA(x, y) =A(x), Vay) exxY. 
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11.2 Cut Relations 


Any fuzzy relation R has an associated family of crisp 
relations {Rg |æ € [0, 1]}, called cut relations, which are 
defined by 


Ra = {(x, y) E X x Y|R(x, y) = a}. 


It is clear that they are just the -cuts of R, considered as 
a fuzzy set. Thus, it is immediate that they form a chain 
(nested family) of relations, that is, 


BC Rap © Rapa C C Ra, CXXY 


if0 <a, Se. An—1 S Am S 1. 


11.3 Fuzzy Binary Relations 


In the particular case X = Y, fuzzy relations are called 
fuzzy binary relations or valued binary relations and 
they have specific and interesting properties. A de- 
tailed proof of the results presented here can be found 
in [11.9]. 

The first specific characteristic of fuzzy binary rela- 
tions is that, apart from the matrix representation, they 
admit a graph representation if X is finite. In this di- 
rected graph, X is the set of nodes (vertices) and R is 
the set of arcs (edges). The arc from x to y exists if and 
only if x and y are related in some sense (R(x, y) > 0). 
A number on each arc represents the membership de- 
gree of this elements to R. 


Example 11.3 
If we consider X = {x, y, z, t}, the fuzzy binary relation 
R x y Zz t 


10 04 0.2 0.0 
0.6 0.9 0.0 0.0 
0.0 0.0 0.0 0.0 
t 00 0.0 0.8 0.0 
can also be represented by the graph in Fig. 11.2. 


NS & 


In the following, we will list some basic properties 
of fuzzy binary relations. Usually, these properties are 
translations of the equivalent for the particular case of 
(crisp) binary relations. 


11.3.1 Reflexivity 


The most used definition of reflexivity for fuzzy bi- 
nary relations was given by Zadeh in 1971 [11.4]. Thus, 


Moreover, it is possible to represent a fuzzy relation 
by means of its cuts relations, since 


R(x, y)= sup min(a, R(x, y)), 


ae€[0,1] 
V(x,y) EXXY, 
which is denoted by 
R= sup aRq. 
a@e[0.1] 


a fuzzy binary relation R on X is said to be reflexive iff 
R@,x=1, Vxex. 


This means that every vertex in the graph originates 
a simple loop. 

Other less restrictive definitions have also been 
considered in the literature. Thus, we say that R is €- 
reflexive [11.10], with € € (0, 1], iff 


R(x,x) >€, Wxex 
and weakly reflexive [11.10] iff 
R(x, x) > R(x, y), Wx,yex. 


Of course, in the particular case of crisp binary 
relations, all of them are the usual definition of reflex- 
ivity. Moreover, if R is reflexive, it is €-reflexive for any 


Fig. 11.2 Directed graph associated to a fuzzy binary 
relation 


€ € (0, 1] and weakly reflexive. The remaining implica- 
tions are not true in general. 

A cutworthy study of this property is given in the 
following proposition. 


Proposition 11.2 

Let R be a fuzzy binary relation on X. If R is reflexive, 
then its associated cut relations Ra, with «œ € (0, 1], are 
reflexive. 


The cutworthy property is not fulfilled, in general, 
by €-reflexive or weakly reflexive fuzzy binary rela- 
tions. 


11.3.2 Irreflexivity 


A fuzzy binary relation that is irreflexive, or antire- 
flexive, is a fuzzy binary relation where no element is 
related in any degree to itself. Formally 


R(x,x)=0, Wxrex. 


This means that there is not any vertex in the graph orig- 
inating a simple loop. 

Analogous to the case of reflexivity, we can con- 
sider some generalizations of this concept: 


© ¢-Irreflexive, with € € [0, 1): R(x, x) < €, Yx € X 
@ Weakly irreflexive: R(x, x) < R(x, y), Yx, y E X. 


The behavior of cut relations is similar to the previ- 
ous case. 


Proposition 11.3 

Let R be a fuzzy binary relation on X. If R is irreflexive, 
then its associated cut relations Rg, with «œ € (0, 1], are 
irreflexive. 


Again this condition is not fulfilled for ¢-reflexivity 
or weak reflexivity. 


11.3.3 Symmetry 


For symmetry, there is not any change with the classi- 
cal definition for crisp relations. Thus, a fuzzy binary 
relation R on X is said to be symmetric if 


R(x, y) =RO,x), Vx, yeXx. 


This is equivalent to require that R and its inverse are 
equal 


R=R!. 


Of course, if R is symmetric, so are their associated 
cut relations for any œ € (0, 1]. 


11.3.4 Antisymmetry 


In the crisp case, a binary relation R is antisymmetric 
iff xRy and yRx which implies that x = y. This is equiv- 
alent to require that x Æ y implies that (x, y) € RO R7!. 
Thus, the intersection can be used in order to define an- 
tisymmetry. In the fuzzy case, the intersection will be 
defined, as usual, by means of a t-norm T. The defini- 
tion will be directly related to the t-norm, and therefore 
the used ¢-norm should appear in the name of the prop- 
erty. Thus, a fuzzy binary relation R on X is said to be 
T-antisymmetric if 


x # y= T(R(x, y), RO, x)) = 0 
or, equivalently 
RArR™!(x,y)=0, YxÆy. 


It is immediate that if T and T’ are t-norms such 
that 7’ < T, then the T-antisymmetry of a fuzzy relation 
implies its T’-antisymmetry. 

In 1971, Zadeh [11.4] proposed to use the minimum 
t-norm for this aim, and he called it perfect antisymme- 
try or just antisymmetry. In that case, its cut relations 
are antisymmetric, for any œ € (0, 1]. This is also true 
for T-antisymmetry for a positive t-norm (x, y > 0 = 
T(x, y) > 0), since in this case, T-antisymmetry and 
perfect antisymmetry are equivalent. However, the cut- 
worthy property is not fulfilled, in general, for any other 
T-norm. 

Clearly, perfect antisymmetry implies the T-anti- 
symmetry for any t-norm T. However, perfect antisym- 
metry can be too restrictive in many cases, since it 
excludes relations where R(x, y) and R(y, x) are almost 
zero. If we consider t-norms with zero divisors, this 
problem is solved. In this way, the case when R(x, y) 
and R(y,x) are high is avoided, but the equality to 
zero is not required. For instance, if we consider the 
Łukasiewicz t-norm (T(x, y) = max(x+ y— 1,0)), T- 
antisymmetry is equivalent to require that 


R(x, y) +R, x) <1, Yx,yE€X such that x £ y. 
11.3.5 Asymmetry 


Asymmetry is a stronger condition, since it is not only 
required for pairs of elements (x, y) such that x Æ y, but 
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also for any pair of elements in X x X. Thus, the T- 
asymmetry of a fuzzy binary relation R on X is defined 
by 


T(R(x, y), Ry, x)) =0, Yx,yEX 


or, equivalently, 
ROrR '=6. 


Clearly, T-asymmetry implies 7’-asymmetry if 
T’ < T and then, classic asymmetry (T = min) implies 
T-asymmetry for any t-norm T. Usually it is called just 
asymmetry. In particular, asymmetry is equivalent to the 
T-asymmetry if T is positive. In that case, its associ- 
ated cut relations are crisp asymmetric relations for any 
a € (0, 1]. 

It is possible to relate asymmetry and irreflexivity 
as follows: 


Proposition 11.4 
Let R be a T-asymmetric fuzzy binary relation on X. 
The following statements hold: 


i) R is irreflexive if and only if T is a positive t-norm; 
ii) R is e€-irreflexive for € < 1 if and only if T has zero 
divisors and € belongs to the interval (0, sup{x € 


[0, 1]|T (x, x) = O}). 


11.3.6 Transitivity 


The pairwise comparison of possible alternatives is 
a first step in many approaches to decision making. If 
this first step lacks coherence, the whole decision pro- 
cess might become meaningless. A popular criterion for 
coherence is the transitivity of the involved relations, 
expressing that the strength of the link between two al- 
ternatives cannot be weaker than the strength of any 
chain involving another alternative [11.11]. 

The usual definition of transitivity for fuzzy rela- 
tions is related to a t-norm and it is a generalization 
of the proposal given by Zadeh in 1971 [11.4]. Thus, 
a fuzzy binary relation R on X is said to be T-transitive 
if 


T(R(x, y), RO, Z)) < R(x, z) 
for all x, y, z € X. 


As it happened for the concepts of T-asymmetry 
and T-antisymmetry, T-transitivity is not unique as it 


happened for classical relations. When T is the mini- 
mum t-norm that definition can also be expressed as 


R(x,z) > max(min(R(x, y),RQ,z))), Yx,zEX. 


Then, it is sometimes called max-min-transitivity. This 
coincides with the initial definition proposed by Zadeh. 

The T-transitivity is a natural way of extending the 
original definition by Zadeh, specially after t-norms and 
t-conorms began to be used in the 1970s by different 
authors to generalize the intersection and the union. 
However, many other types of transitivity were defined. 
From the least restrictive ones as the minimal transitiv- 
ity [11.12] defined by 


R(x, y) = 1 and R(y, z) = 1 = > R(x,z) = 1 


or the preference sensitive transitivity [11.13], also 
called quasitransitivity [11.12], defined by 


R(x,y)>0 and RỌy,z)>0 = R(x,z)>0. 


The weak and parametric transitivities [11.14] de- 
fined, respectively, by 


R(x, y) > R(y, x) and R(y, z) > R(z, y) 
=> R(x, z) > R(z, x) 


and by 


R(x, y) > 0 > R(y, x) and R(y, z) > 6 > R(z, y) 
=> R(x, z) > 0 > R(z, x) 


where @ is a fixed value in the interval [0, 1). 

Or the weighted mean transitivity [11.15] stating 
that the inequalities R(x, y) > 0 and R(y, z) > 0 require 
the existence of some 6 € (0, 1) such that 


R(x, z) > 0 - max(R(x, y), RO, z)) 
+ (1 — 0) - min(R(x, y), R(y, z)) 


among others. 

T-norms offer a way of defining transitivity for 
fuzzy relations, but it is known that these operators 
are too restrictive in some cases. If we have a look at 
the properties an operator defining transitivity must sat- 
isfy, associativity is only necessary if we try to extend 
the definition to more than three elements. Concerning 
the commutativity and the boundary condition, a much 
weaker condition is sufficient to generalize the classical 
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definition: commutativity and boundary condition on 
{0, 1}. Thus, recent studies about transitivity for fuzzy 
binary relations are not restricted to t-norms, but a much 
more general definition of transitivity is considered, the 
one obtained by considering only the necessary condi- 
tions [11.16]. Thus, we consider a conjunctor, that is an 
increasing binary operator f : [0, 1]? — [0, 1] which co- 
incides with the Boolean conjunction on {0, 1}. Recall 
that this definition preserves the concept of conjunction 
for classical relations. It is also clear that the notion 
of conjunctor is much more general than the one of 
t-norm. Neither associativity nor commutativity are re- 
quired. Note that conjunctors are even not required to 
have neutral element 1. 

Given a conjunctor f, we can define the f-transitiv- 
ity of a fuzzy relation R in the same way as it is defined 
for t-norms 


f(R(x,y), RO,Z)) < RO, Yx,y,zEX. 


Since conjunctors are a much wider family of oper- 
ators, this definition includes more types of transitivity 
than the definition given just for t-norms. It is trivial 
that if we restrict this definition to t-norms, we get the 
classical definition of T-transitivity. 

The definition for conjunctors is a too general no- 
tion in a particular case: if we consider a reflexive 
relation R f-transitive, where f is a conjunctor, that con- 
junctor must be bounded by the minimum. That is, only 
conjunctors smaller than or equal to minimum can de- 
fine the transitivity of reflexive fuzzy relations. 


11.3.7 Negative Transitivity 


Another important property is negative transitivity, 
which is a dual property of transitivity. Thus, given a t- 
conorm S, a fuzzy binary relation R on X is said to be 
negatively S-transitive if 


R(x, z) < S(R(x, y), RQ, z)) 


for all x,y,z € X. 

If T is a t-norm and S is its dual t-conorm, R is 
T-transitive if and only if its dual R@ is negatively S- 
transitive. 

Clearly, if R is negatively S-transitive, then it is 
negatively S’-transitive for any other t-conorm S’ such 
that S < S’. In particular, the negative transitivity of the 
maximum implies the negatively S-transitivity for any 
t-conorm S. 


11.3.8 Semitransitivity 


In the classical case, a crisp relation R is semitransitive 
if xRy and yRz implies that there exists t € X such that 
xRt or tRz. 

If we consider t-norms and t-conorms to generalize 
AND and OR, respectively, we obtain that a fuzzy bi- 
nary relation R on X is T-S-semitransitive if 


T(R(x, y), R, z)) < S(R(x, t), R(t, z)) 


for every x, y, z, t € X. 

It is clear that T-S-semitransitivity implies T’-S’- 
semitransitivity of any t-norm T’ such that T” < T and 
any t-conorm S’ such that S< S’. Thus, the classi- 
cal semitransitivity with the minimum t-norm and the 
maximum f-conorm implies the T-S-transitivity for any 
t-norm T and any t-conorm S. 

As a consequence of the definition, for any T-S- 
semitransitive fuzzy relation R we have that: 


© If Ris reflexive, then it is negatively S-transitive 
@ If R is irreflexive, then it is T-transitive. 


Moreover, we can easily prove the following propo- 
sitions [11.9]. 


Proposition 11.5 
If R is T-transitive and negatively S-transitive, then R is 
T-S-semitransitive. 


Proposition 11.6 

Suppose that T is a continuous t-norm in the De Morgan 
triple (T, S,n). If R is T-asymmetric and negatively S- 
transitive then R is T-S-semitransitive. 


11.3.9 Completeness 


In the crisp set theory, the concept of completeness is 
clear, the relation R defined on X is complete if every 
two elements are related by R, that is, if at least xRy or 
yRx for any pair of values x different from y in X. And 
it is still equivalent, for reflexive relations, to the con- 
cept of strong completeness (xRy or yRx, Vx, y € X). It 
is logical that in the setting of classical relations, the 
completeness is equivalent to the absence of incom- 
parability. There is no pair of elements that cannot be 
compared since they are related at least by R or R7!. 
When we try to generalize this concept to fuzzy re- 
lations, the problem arises when trying to fuzzify the 


ELL | @ Hed 


178 PartB 


Fuzzy Logic 


E'L | d Hed 


notion related at least by R or R~'. In the classical case, 
it is clear that x and y are related by R if and only if 
R(x, y) = 1 or R(y, x) = 1. This could be a first way to 
define the concept of completeness for fuzzy relations. 

Thus, a fuzzy binary relation R defined on X is 
strongly complete if 


max(R(x, y), R, x) =1, VxyeXx. 


Perny and Roy [11.17] call it just complete. 

But this condition could be considered too restric- 
tive for fuzzy relations. Consider, for example, the 
case in which both R(x, y) = 0.95 and R(y, x) = 0.95. 
By the definition given above (strong completeness), 
R is not complete but it is clear that x and y are re- 
lated by R. Taking into account this type of situations, 
other less restrictive completeness conditions were pro- 
posed. 

Among the most employed in the literature, we find 
the one known as weak completeness. A fuzzy relation 
R defined on X is weakly complete if 


R(x, y) +R, x) > 1, Yx,yEX. 


This condition is called connectedness in [11.13, 
18] while in other works [11.14] this name makes ref- 
erence to the strong completeness. 

It is clear that this definition is much less restrictive 
than the one called strong completeness. Strongly com- 
plete relations are a particular type of weakly complete 
relations. 

If we take a careful look at these two definitions, we 
can express them by means of a t-conorm. On the one 
hand, we can quickly identify the maximum t-conorm 
as the operator that relates R and its transpose in the first 
definition. On the other hand, since R(x, y) + R(y, x) > 1 
is equivalent to min(R(x, y) + R, x), 1) = 1, then the 
weakly completeness relates R and RT! by means of 
the Łukasiewicz t-conorm. 

These two conditions are special cases of what is 
called S-completeness. A fuzzy relation R defined on X 
is called S-complete [11.19], where S is a t-conorm, if 


S(R(x, y), Ry, x))=1, Yx,yEX. 


It is immediate that this is equivalent to require that 
RUsR™!=XxX. 
Remark that the previous definition corresponds, ac- 


cording to [11.19], to strong S-completeness, while the 
concept of S-completeness only requires the equality 


S(R(x, y), R(y, x)) = 1 for pairs of different elements, 
xy. 

A direct consequence of the definition is that for 
any two f-conorms S and S’, such that S < S’, S-com- 
pleteness (respectively, strong S-completeness) implies 
S’-completeness (respectively, strong S’-completeness). 
It is easy to check that strong completeness can be 
identified not only with the S-completeness defined by 
the maximum f-conorm, but also with any t-conorm of 
which the dual t-norm has no zero divisors. The S-com- 
pleteness of a fuzzy relation R is equivalent to the T- 
antisymmetry of its dual R? [11.19], where T is the dual 
t-norm of S by means of a strong negation n. 

The behavior of cut relations are the same as in 
the previous properties. Thus, if R is a max-complete 
(resp. strongly max-complete) fuzzy binary relation, 
then its associated cut relations Ra are complete (resp. 
strongly complete) crisp binary relations, for any œ € 
(0, 1]. 

As it happens in the crisp case, we can relate 
completeness and reflexivity. In this case, we obtain dif- 
ferent results depending on the chosen t-norm. 


Proposition 11.7 
Let R be a strongly S-complete fuzzy binary relation 
on X: 


1. R is reflexive if and only if the dual t-norm associ- 
ated to S is a positive t-norm. 

2. R is e-reflexive with e < 1 if, and only if, S is 
a nilpotent ¢-conorm and e€ belongs to the interval 
[inf{x € [0, 1]|S(x, x) = 1}, 1). 


Table 11.1 Some properties of fuzzy binary relations 


Property Definition 

Reflexivity R@w,x)=1, Wxex 

Irreflexivity R@,x)=0, VxEx 

Symmetry R(x, y) = RO, x), Yx,yEX 
T-antisymmetry T(R(x, y), R@,x)) =0, Vx,yEX,x Éy 
T-asymmetry T(R(x, y), R, x)) =0, Yx,yEX 
T-transitivity T(R(@, y), RO, 2)) < R(x, z), Yx, y,zEX 
Negative R(x, z) < SRCE RO DDV EX 
S-transitivity 

T-S T(R(x, y), R, z)) 

semitransitivity ZSR O RUDE YA y,z,tEX 
S-completeness S(R(x, y), RO, x)) =1, Yx,yEX 
Strong max(R(x, y),RQy,x))=1, Vx,yEX 
completeness 

Weak R(x, y) RO) 21, Wx, yex 
completeness 

T-linearity n7(R(x,y)) <RO,x), Vx, yeX 
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Completeness also plays an important role in or- 
der to relate transitivity and negative transitivity [11.9]. 
Thus, given a De Morgan triple (T, S, n), with T a con- 
tinuous f-norm, and a strongly S-complete fuzzy binary 
relation R, the T-transitivity of R implies: 


@ Its negatively S-transitivity 
@ The 7-transitivity of R°. 


11.3.10 Linearity 


S-completeness is also very related to the concept of T- 
linearity. Given a t-norm T, a fuzzy relation R defined 


on X is called T-linear [11.20] if 
nr (R(x, y)) <RO.x), Yx,yEX, 


where ny stands for the negator n(x) = sup{z € 
[0, 1]| 7, z) = 0}. 

S-completeness is equivalent to T-linearity when- 
ever T is nilpotent and S is the dual t-conorm of T by 
using ny, that is, S(x, y) =nr(T(nr(x), nr(y))). 

As the Lukasiewicz t-norm T; is in particular 
a nilpotent t-norm, the weak completeness, that is Sz- 
completeness, is equivalent to T;-linearity. 

We summarize the properties and definitions we 
have introduced in this section in Table 11.1. 


11.4 Particular Cases of Fuzzy Binary Relations 


In this section, we deal with some particular cases of 
fuzzy binary relations, which are very important in sev- 
eral fields and they generalize classic concepts. 


11.4.1 Similarity Relation 


The notion of similarity is essentially a generalization 
of the notion of equivalence. 

More concretely, a T-indistinguishability relation R 
is a fuzzy binary relation which is reflexive, sym- 
metric, and 7-transitive. Sometimes it is also called 
fuzzy equivalence relation or equality relation. R(x, y) 
is interpreted as the degree of indistinguishability (or 
similarity) between x and y. 

In this definition, reflexivity expresses the fact that 
every object is completely indistinguishable from itself. 
Symmetry says that the degree in which x and y are 
indistinguishable is the same as the degree in which y 
and x are indistinguishable. For transitivity, as it de- 
pends on a t-norm, we have a more flexible property. In 
particular, when we use the product t-norm, we obtain 
the so-called possibility relations introduced by Menger 
in [11.3]; if we choose the Lukasiewicz t-norm, we 
obtain the relations called likeness introduced by Rus- 
pini [11.21]; while for the minimum f-norm we obtain 
similarity relations [11.22]. 

When the transitivity is not required, the relation R 
is said to be a proximity relation. 


11.4.2 Fuzzy Order 


Next, we make a quick overview on some different 
fuzzy ordering relations, focusing on the properties they 
shall satisfy. Consider a fuzzy binary relation R on the 
set X. R is called: 


© Partial T-preorder or T-quasiorder if R is reflexive 
and T-transitive; 

@ Total T-preorder or linear T-quasiorder if R is 
strongly complete and T-transitive; 

© Partial T-order if R is antisymmetric and T- 
transitive; 

© Strict partial T-order if R is asymmetric and T- 
transitive; 

@ Total T-order or linear T-order if R is a com- 
plete partial T-order, that is, R is antisymmetric, 
T-transitive and complete; 

@ Strict total T-order if R is a complete strict partial 
T-order, that is, R is asymmetric, T-transitive and 
complete. 


As in the previous concepts, when T is the t-norm 
of the minimum, we call R simply total preorder, total 
order, etc. 

The previous definitions are summarized in Ta- 
ble 11.2. 
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Table 11.2 Fuzzy binary relations by properties 


Reflexivity Symmetry 
Preorder Yes 
Order Yes No 
Strict order Yes No 
Proximity Yes Yes 
T-indistinguishability Yes Yes 


Antisymmetry Asymmetry Transitivity 
Yes 

Yes Yes 

Yes Yes Yes 

No No 

No No Yes 


11.5 Present and Future of Fuzzy Relations 


In this chapter, we have tried to give to the reader 
a first approach to the concept of fuzzy relations. Of 
course this is just a starting point. Due to the current 
development of the topics related to fuzzy relations, 
a researcher interested in this notion should study in de- 
tail more specialized materials. 

Here we have presented the definition of classic 
fuzzy relations, which take values in the interval (0, 1]. 
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12. Fuzzy Implications: Past, Present, 


Michat Baczynski, Balasubramaniam Jayaram, Sebastia Massanet, Joan Torrens 


Fuzzy implications are a generalization of the clas- 
sical two-valued implication to the multi-valued 
setting. They play a very important role both in the 
theory and applications, as can be seen from their 
use in, among others, multivalued mathematical 
logic, approximate reasoning, fuzzy control, im- 
age processing, and data analysis. The goal of this 
chapter is to present the evolution of fuzzy impli- 
cations from their beginnings to the current days. 
From the theoretical point of view, we present the 
basic facts, as well as the main topics and lines of 
research around fuzzy implications. We also de- 
vote a specific section to state and recall a list of 
main application fields where fuzzy implications 
are employed, as well as another one to the main 
open problems on the topic. 


12.1 Fuzzy Implications: 
Examples, Properties, and Classes ........ 184 


Fuzzy logic connectives play a fundamental role in the 
theory of fuzzy sets and fuzzy logic. The basic fuzzy 
connectives that perform the role of generalized And, 
Or, and Not are t-norms, t-conorms, and negations, re- 
spectively, whereas fuzzy conditionals are usually man- 
aged through fuzzy implications. Fuzzy implications 
play a very important role both in theory and applica- 
tions, as can be seen from their use in, among others, 
multivalued mathematical logic, approximate reason- 
ing, fuzzy control, image processing, and data analysis. 
Thus, it is hardly surprising that many researchers have 
devoted their efforts to the study of implication func- 
tions. This interest has become more evident in the last 
decade when many works have appeared and have led to 
some surveys [12.1,2] and even some research mono- 
graphs entirely devoted to this topic [12.3,4]. Thus, 
most of the known results and applications of fuzzy 
implications until the publication date were collected 
in [12.3], and very recently the edited volume [12.4] has 
been published complimenting the earlier monograph 
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with the most recent lines of investigation on fuzzy im- 
plications. 

In this regard, we have decided to devote this chap- 
ter, as the title suggests, to present the evolution of 
fuzzy implications from their beginnings to the present 
time. The idea is not to focus on a list of results already 
collected in other works, but unraveling the relations 
and highlighting the importance in the development 
and progress that fuzzy implications have experienced 
along the time. From the theoretical point of view we 
present the basic facts, as well as the main topics and 
lines of research around fuzzy implications, recalling 
in most of the cases where the corresponding results 
can be found, instead of listing them. Of course, we 
also devote a specific section to state and recall a list 
of the main application fields where fuzzy implications 
are employed. A final section looks ahead to the future 
by listing some of the main open-problem-solutions of 
which are certain to enrich the existing literature on the 
topic. 


and Future 
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12.1 Fuzzy Implications: Examples, Properties, and Classes 


Fuzzy implications are a generalization of the classical 
implication to fuzzy logic. It is a well-established fact 
that fuzzy concepts have to generalize the correspond- 
ing crisp one, and consequently fuzzy implications 
restricted to {0,1}? must coincide with the classical 
implication. Currently, the most accepted definition of 


a fuzzy implication is the following one. 


Definition 12.1 [12.3, Definition 1.1.1] 


A function J: [0, 1]? + [0, 1] is called a fuzzy implica- 


tion if it satisfies the following conditions: 
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This definition is flexible enough to allow uncount- 
ably many fuzzy implications. This great repertoire of 
fuzzy implications allows a researcher to pick out, de- 
pending on the context, that fuzzy implication which 
satisfies some desired additional properties. Many ad- 
ditional properties, all of them arising from tautologies 
in classical logic, have been postulated in many works. 


(I1) I(x, z) > I(y, z) when x < y, for all z € [0, 1] 
(12) I(x, y) < I(x, z) when y < z, for all x € [0, 1] 
(13) 7(0,0) = 7(1, 1) = 1 and /(1,0) = 0. 


The most important of them are collected below: 


@ (NP): The left neutrality principle, 


I(,y) = 


y., ye€[0,1]. 


@ (EP): The exchange principle, 


I(x, 1Qy,2)) =10,1,2)), xy,z€ [0,1]. 
@ (OP): The ordering property, 
x<y<4> I(x,y)=1, x, ye [0,1]. 


© (IP): The identity principle, 


I(x,x)=1, xe[0,1]. 
@ (CP(N)): The contrapositive symmetry with respect 

to a fuzzy negation N, 

I(x,y) =(N(Q).N@), xy € [0,1]. 

Given a fuzzy implication J, its natural negation is 
defined as N;(x) = I(x, 0) for all x € [0, 1]. This func- 
tion is always a fuzzy negation. For the definitions of 
basic fuzzy logic connectives like fuzzy negations, t- 
norms and t-conorms please see [12.5]. Moreover, Ny 
can be continuous, strict, or strong and these are also 
additional properties usually required of a fuzzy impli- 
cation I. 

Table 12.1 lists the most well-known fuzzy im- 
plications along with the additional properties they 
satisfy [12.3, Chap.1]. In addition, the following 


Table 12.1 Basic fuzzy implications and the additional properties they satisfy where Nc, Np,, and Np, stand for the 
classical, the least and the greatest fuzzy negations, respectively 


Name 
Lukasiewicz 


Gédel 


Reichenbach 
Kleene—Dienes 


Goguen 


Rescher 


Yager 


Weber 


Fodor 


Formula 
Tix(x, y) = min{1, 1—x +y} 
Igp (x. y) = r n 

y ifx>y 
Inc(,y) = 1—x-+xy 
Ixy (x, y) = max{1 — x, y} 


I May 
aN = a . 
E 
Tea teeny, 
oe = 
ean 0 ifx>y 
1 if, y) = (0,0) 
Tye. y) = 
y” if œ, y) # (0,0) 
io ia 
Iwg (x, y) = ; 
y ifx=1 
E 1 ifx<y 
x,y) = 
Boece max{l—x,y} ifx>y 


Vv 


< BAe 


(NP) (EP) (IP) (OP) (CP(N)) Nr 
Vv Vv Vv Ne Ne 
vA Vv Vv x Np, 
A x x Ne Ne 
VA X X Nc Ne 
Vv Vv Vv X Np, 
xX Vv Vv Ne Np, 
7 x X x Np, 
Vv Vv x X Np» 
Vv Vv Vv Ne Ne 


Fuzzy Implications: Past, Present and Future 


12.1 Fuzzy Implications: Examples, Properties, and Classes 


two implications 


WED l, ifx=Oory=1, 
x,y) = 
oma = o. Gee Oaidy ei: 


Ka 1, ifx<lory>0, 
X, = 
me 0, ifx= 1 andy=0, 


are the least and the greatest fuzzy implications, respec- 
tively, of the family of all fuzzy implications. 

Beyond these examples of fuzzy implications, sev- 
eral families of these operations have been proposed 
and deeply studied. There exist basically two strate- 
gies in order to define classes of fuzzy implications. 
The most usual strategy is based on some combina- 
tions of aggregation functions. In this way, t-norms and 
t-conorms [12.5] were the first classes of aggregation 
functions used to generate fuzzy implications. Thus, the 
following are the three most important classes of fuzzy 
implications of this type: 


1) (S,N)-implications defined as 
Isny) = S(WQ@),y), x,y E€ [0,1], 


where S is a f-conorm and N a fuzzy negation. 
They are the immediate generalization of the clas- 
sical boolean material implication p > q = >p V q. 
If N is involutive, they are called strong or S- 
implications. 

2) Residual or R-implications defined by 


Ir(x, y) = sup{z € [0, 1] | T(x, z) < y} , x,y € [0, 1], 


where T is a t-norm. When they are obtained from 
left-continuous t-norms, they come from residuated 
lattices based on the residuation property 


T(x, y) <z & I(x,z) >y, forallx,y,z € [0,1]. 
3) QL-operations defined by 
Ir snx, y) = SN (x), T(x, y)), x,ye [0, 1], 


where S$ is a f-conorm, T is a t-norm and N is 
a fuzzy negation. Their origin is the quantum me- 
chanic logic. 


Note that R- and (S,N)-implications are always 
implications in the sense of Definition 12.1, whereas 
QL-operations are not implications in general (they 
are called QL-implications when they actually are). 


A characterization of those QL-operations which are 
also implications is still open (Sect. 12.4), but a com- 
mon necessary condition is S(N(x),x) = 1 for all x € 
[0, 1]. Yet another class of fuzzy implications is that of 
Dishkant or D-operations [12.6] which are the contra- 
position of QL-operations with respect to a strong fuzzy 
negation. 

These initial classes were successfully general- 
ized considering more general classes of aggregation 
functions, mainly uninorms, generating new classes 
of fuzzy implications with interesting properties. In 
this way, (U, N), RU-implications and QLU-operations 
have been deeply analyzed [12.3, Chap. 5], [12.6]. 

A second approach to obtain fuzzy implications is 
based on the direct use of unary monotonic functions. 
In this way, the most important families are Yager’s f- 
and g-generated fuzzy implications which can be seen 
as implications generated from additive generators of 
continuous Archimedean t-norms and f-conorms, re- 
spectively [12.3, Chap. 3]: 


1) Yager’s f-generated implications are defined as 


Kay =f fO), x ye [0,1], 


with the understanding 0 - oo = 0, where f: [0, 1] > 
[0, co] is a strictly decreasing and continuous func- 
tion with f(1) = 0. 

2) Yager’s g-generated implications are defined as 


let) = 7 (min } +0). 
x,y € [0,1], 


with the understanding i= co and œo:0= œ 


where g: [0, 1] — [0, oo], is a strictly increasing and 
continuous function with g(0) = 0. 


The above classes give rise to fuzzy implications 
with different additional properties which are collected 
in Table 12.2. All the results referred in Table 12.2 are 
from [12.3, Chaps. 2 and 3]. 

One of the main topics in this field is the character- 
ization of each of these families of fuzzy implications 
through algebraic properties. This is an essential step in 
order to understand the behavior of these families. The 
available characterization results of the above families 
of implications are collected below. 


Theorem 12.1 [12.3, Theorem 2.4.10] 
For a function J: [0, 1]? — [0,1] the following state- 
ments are equivalent: 
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Table 12.2 Classes of fuzzy implications and the additional properties they satisfy 


Class / Properties (NP) (EP) (IP) (OP) (CP(N)) Nr 

(S, N)-imp. Vv Vv Thm. 2.4.17 Thm. 2.4.19 Prop. 2.4.3 N 

R-imp. with l-c. T Vv Vv Vv Vv Prop. 2.5.28 Nr 
QL-imp. Vv Thm. 2.6.19 Sect. 2.6.3 Sect. 2.6.4 Sect. 2.6.5 N 

f-gen. Vv Vv x x Thm. 3.1.7 Prop. 3.1.6 
g-gen. v v Thm. 3.2.8 Thm. 3.2.9 x Np, 


i) J is an (S,N)-implication with a continuous (strict, 
strong) fuzzy negation N. 

ii) J satisfies (11), (EP), and N; is a continuous (strict, 
strong) fuzzy negation. 


Moreover, in this case the representation /(x, y) = 
S(N(x),y) is unique with N=WN, and S(x,y) = 
I(Ity(x),y) (for the definition of Ry see [12.3, 
Lemma 1.4.10]). 


Theorem 12.2 [12.3, Theorem 2.5.17] 
For a function Z: [0,1]? — [0,1] the following state- 
ments are equivalent: 


i) IZ is an R-implication generated from a left- 
continuous f-norm. 

ii) Z satisfies (I2), (EP), (OP) and it is right continuous 
with respect to the second variable. 


Moreover, the representation 
I(x, y) = max{t € [0, 1]|T7(@, ) < y} 
is unique with 


T(x, y) = min{t € [0, 1]|J@, 1) > y}. 


As already said, it is still an open question when 
QL-operations are fuzzy implications. However, in the 
continuous case, when S and N are the g-conjugates of 
the Lukasiewicz t-conorm Sx and the classical negation 
Nc, respectively, for some order automorphism ¢ on the 
unit interval, the QL-operation has the following expres- 
sion 

Ir sn (x,y) = lo.r y) 

=9 '(1-9@) + 9(T(@.y))), 
x,y € [0,1], 


and we have the following characterization result. 


Theorem 12.3 [12.3, Theorem 2.6.12] 

For a QL-operation Iy,7, where T is a t-norm and @ 
is an automorphism on the unit interval, the following 
statements are equivalent: 


i) Ig,r is a QL-implication. 
ii) Ty-1 satisfies the Lipschitz condition, i. e., 


|To- &1, y1) — To- 2. y2)| 
<x =x] + lyi = y2l; 41.42, y1, y2 € [0, 1] . 
In addition, (U, N)-implications are characterized 
in [12.3, Theorem 5.3.12] and more recently, Yager’s f 
and g-generated [12.7] and RU-implications [12.8] have 
been also characterized. Finally, due to its importance in 
many results, we recall the characterization of the fam- 
ily of the conjugates of the Łukasiewicz implication. 


Theorem 12.4 [12.3, Theorem 7.5.1] 

For a function Z: [0, 1]? > [0, 1] the following state- 

ments are equivalent: 

i) Zis continuous and satisfies both (EP) and (OP). 

ii) I is a -conjugate with the Łukasiewicz implica- 
tion Jig, i.e., there exists an automorphism on 
the unit interval, which is uniquely determined, such 
that J has the form 


I(x, y) = (Lk)o (x,y) 
= '(min{1— px) +90), }}), 


x,y € [0, 1]. 
Irs Io Th Ipc Iæ |} Tie 
Txp |g Irc||\/v4 
Ip Tix 
lac 
Trp Icp 
Is,N Iwg 
FI 2 Ir 


Fig. 12.1 Intersections between the main classes of fuzzy 
implications 
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For the conjugates of the other basic implications 
in Table 12.1, see the characterization results in [12.3, 
Sect. 7.5]. 

The great number of classes of fuzzy implications 
induces the study of the intersection between the differ- 
ent classes which brings out both the unity that exists 
among this diversity of classes and where the basic im- 
plications from Table 12.1 are located. The intersections 
among the main classes of fuzzy implications were stud- 
ied in [12.3, Chap. 4] and are graphically displayed in 
Fig. 12.1 (note that FI, Is nN, Ir, Ion, lr and Ig 
denote the families of all fuzzy implications, (S,N)- 
implications, R-implications, QL-implications, Yager’s 
f-generated implications and Yager’s g-generated impli- 


cations, respectively). In this figure, we have included 
the fuzzy implications of Table 12.1 and the following 
fuzzy implications which are examples of implications 
lying in some intersection between some families 


3 y 
Ia (x, y) = min f1, =| : 
XX 


1, ifx=0, 


Í f — 
Dt.) l ifx>0, 
Ipc(x, y) = 1 — (max{x(x + xy? — 2y), 0})? . 


Also note that it is still an open problem to prove if 
den NIr)\Is.n = 9. 


12.2 Current Research on Fuzzy Implications 


In the previous sections, we have seen some func- 
tional equations, namely, the exchange property (EP), 
the contrapositive symmetry (CP(N)) and the like. In 
this section, we deal with a few functional equations (or 
inequalities) involving fuzzy implications. These equa- 
tions, once again, arise as the generalizations of the 
corresponding tautologies in classical logic involving 
boolean implications. 


12.2.1 Functional Equations and Properties 


A study of such equations stems from their applica- 
bility. The need for a plethora of fuzzy implications 
possessing various properties is quite obvious. On the 
one hand, they allow us to clearly classify and charac- 
terize different fuzzy implications, while on the other 
hand, they make themselves appealing to different ap- 
plications. Thus, the functional equations presented in 
this section are chosen to reflect this dichotomy. 


Distributivity over other Fuzzy Logic Operations 
The distributivity of fuzzy implications over different 
fuzzy logic connectives, like t-norms, t-conorms, and 
uninorms is reduced to four equations 


I(x, Ci(y, z)) = Co, y), L(x, 2) , (12.1) 
I(x, D,(y,2)) = Da (x, y), I(x, 2)) , (12.2) 
I(C(x, y), z) = DU (x, z), 10,2) , (12.3) 
I(D(, y), z) = CU, z), 10,2), (12.4) 


satisfied for all x,y,z € [0, 1], where J is some gener- 
alization of classical implication, C, C1, C2 are some 


generalizations of classical conjunction and D, D1, D2 
are some generalizations of classical disjunction. 

All the above equations can be investigated in two 
different ways. On the one hand, one can assume that 
function J belongs to some known class of fuzzy im- 
plications and investigate the connectives C;, D; that 
satisfy (12.1)-(12.4), as is done in the following works, 
for e.g., Trillas and Alsina [12.9], Balasubramaniam 
and Rao [12.10], Ruiz-Aguilera and Torrens [12.11, 
12] and Massanet and Torrens [12.13]. On the other 
hand, one can assume that the connectives C;, D; come 
from the known classes of functions and investigate 
the fuzzy implications / that satisfy (12.1)—(12.4). See 
the works of Baczyński [12.14, 15], Baczyński and Ja- 
yaram [12.16], Baczyński and Qin [12.17, 18] for such 
an approach. 

The above distributive equations play an important 
role in reducing the complexity of fuzzy systems, since 
the number of rules directly affects the computational 
duration of the overall application (we will discuss this 
problem again in Sect. 12.3.2). 


Law of Importation 
One of the desirable properties of a fuzzy implication is 
the law of importation as given below 


I(x, I(y,z)) =I(T(x,y),z), x y,z€[0,1], (12.5) 


where T is a f-norm (or, in general, some conjunc- 
tion). It generalizes the classical tautology (p ^q) > 
r = (p> (q > r)) into fuzzy logic context. This equa- 
tion has been investigated for many different families 
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of fuzzy implications (for results connected with main 
classes see [12.3, Sect. 7.3]). Fuzzy implications satis- 
fying (12.5) have been found extremely useful in fuzzy 
relational inference mechanisms, since one can obtain 
an equivalent hierarchical scheme which significantly 
decreases the computational complexity of the system 
without compromising on the approximation capability 
of the inference scheme. For more on this, we refer the 
readers to the following works [12.19, 20]. Related with 
(12.5) is its equivalence with (EP) that has been an open 
problem till the recent paper [12.21], where it is proved 
that (12.5) is stronger than (EP) and equivalent when N; 
is continuous. 


T-Conditionality or Modus Ponens 
Another property investigated in the scientific litera- 
ture, which is of great practical importance (see also 
Sect. 12.3.1), is the so-called T-conditionality, defined 
in the following way. If J is a fuzzy implication and T is 
a t-norm, then J is called an MP-fuzzy implication for T, 
if 


T(x, Ix, y) <y, (12.6) 


Investigations of (12.6) have been done for the three 
main families of fuzzy implications, namely, (S, N)-, 
R-, and QL-implications [12.3, Sect. 7.4]. 


x,y € [0, 1]. 


Nonsaturating Fuzzy Implications 

Investigations connected with subsethood measures 
(see Sect. 12.3.3) and constructing strong equality func- 
tions by aggregation of implication functions by the 
formula W(x, y) = M (I(x, y), I(y, x)), where M is some 
symmetric function, have led researchers to consider 
under which properties a fuzzy implication J satisfies 
the following conditions: 


(P1) I(x, y) = 1 if and only if x = 0 or y = 1; 
(P2) I(x, y) = 0 if and only if x = 1 andy = 0. 


In [12.22], the authors considered the possible re- 
lationships between these two properties and the prop- 
erties usually required of implication operations. More- 
over, they developed different construction methods of 
strong equality indexes using fuzzy implications that 
satisfy these two additional properties. 


Special Fuzzy Implications 
Special implications were introduced by Hájek and Ko- 
hout [12.23] in their investigations on some statistics on 
marginals. The authors further have shown that they are 
related to special GUHA-implicative quantifiers (see, 
for instance, [12.24—26]). Thus, special fuzzy impli- 


cations are related to data mining. In their quest to 
obtain some many-valued connectives as extremal val- 
ues of some statistics on contingency tables with fixed 
marginals, they especially focussed on special homoge- 
nous implicational quantifiers and showed that: 


Each special implicational quantifier determines 
a special implication. Conversely, each special 
implication is given by a special implicational 
quantifier. 


Definition 12.2 

A fuzzy implication J is said to be special, if for any 
€ > Oand for all x, y € [0, 1] such thatx+e, y+e € [0, 1] 
the following condition is satisfied 


I(x, y) <I(xteyte). (12.7) 


Recently, Jayaram and Mesiar [12.27] have investi- 
gated the above functional equation. Their study shows 
that among the main classes of fuzzy implications, no f- 
implication is a special implication, while the Goguen 
implication /gg is the only special g-implication. Based 
on the available results, they have conjectured that the 
(S, N)-implications that are special also turn out to be 
R-implications. However, in the case of R-implications 
(generated from any t-norm) they have obtained the fol- 
lowing result. 


Theorem 12.5 [12.27, Theorem 4.6] 

Let T be any t-norm and /r be the R-implication 
obtained from T. Then the following statements are 
equivalent: 


i) Ir satisfies (12.7). 

ii) T satisfies the 1-Lipschitz condition. 

iii) T has an ordinal sum representation ((eq,dq, 
Ta))awea Where each t-norm Ty, a € A is generated 
by a convex additive generator (for the definition of 
ordinal sum, see [12.5]). 


Having shown that the families of (S,N)-, f-, and 
g-implications do not lead to any new special implica- 
tions, Jayaram and Mesiar [12.27] turned to the most 
natural question: Are there any other special implica- 
tions, than those that could be obtained as residuals 
of t-norms? This led them to propose some interest- 
ing constructions of fuzzy implications which were 
also special — one such construction is given in Defi- 
nition 12.4 in Sect. 12.2.2. 
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12.2.2 New Classes and Generalizations 


Another current research line on fuzzy implications is 
devoted to the study of new classes and generalizations 
of the already known families. The research in this di- 
rection has been extensively developed in recent years. 
Among many generalizations of already known classes 
of implications that have been dealt with in the litera- 
ture, we highlight the following ones. 


Generalizations of R-implications 
The family of residual implications is one of the most 
commonly selected families for generalization. As al- 
ready mentioned in Sect. 12.1, the RU-implications 
were the first generalization obtained via residuation 
from uninorms instead of from t-norms. In the same 
line, many other families of aggregation functions have 
been used to derive residual implications: 


1. Copulas, quasi-copulas, and semicopulas were used 
in [12.28]. The main results in this work relate to 
the axiomatic characterizations of those functions J 
that are the residual implications of left-continuous 
commutative semicopulas, the residuals of quasi- 
copulas, and the residuals of associative copulas. 
For details on these characterizations, that involve 
up to ten different axioms, see [12.28]. 

2. Representable aggregation functions (RAFs) were 
used in [12.29]. These are aggregation functions 
constructed from additive generators of continuous 
Archimedean f-conorms and strong negations. The 
interest in the residual implications obtained from 
them lies in the fact that they are always continu- 
ous and in many cases they also satisfy the modus 
ponens with a nilpotent f-conorm. In particular, 
residual implications that depend only on a strong 
negation N are deduced from the general method 
just by considering specific generators of continu- 
ous Archimedean f-conorms. 

3. A more general situation is studied in [12.30] where 
residual implications derived from binary functions 
F: [0, 1]? + [0, 1] are studied. In this case, the pa- 
per deals with the minimal conditions that F must 
satisfy in order to obtain an implication by residu- 
ation. The same is done in order to obtain residual 
implications satisfying each one of the most usual 
properties. 

4. It is well known that residual implications de- 
rived from continuous Archimedean t-norms can 
be expressed directly from the additive genera- 
tor of the t-norm. A generalization of this idea is 


presented in [12.31], where strictly decreasing func- 
tions f: [0, 1] — [0, +00] with f(1) = 0 are used to 
derive implications as follows 


ee l; ifx<y, 

My VY Ap : 

FEO FOD-H), ifx>y, 
where f(t) =lim, + f0) and FAH =f0). 
Properties of these implications are studied and 
many new examples are also derived in [12.31]. 


Generalizations of (S, N)-Implications 
Once again a first generalization of this class of im- 
plications has been done using uninorms leading to 
the (U, N)-implications mentioned in Sect. 12.1, but 
recently many other aggregation functions were also 
employed. 

This is the case for instance in [12.32], where 
the authors make use of TS-functions obtained from 
a t-norm T, a t-conorm S and a continuous, strictly 
monotone function f: [0, 1] — [—oo, +00] through the 
expression 


TSa py) =f (AA (Ty) + AF (SOY) 


for x, y € [0, 1], where A € (0, 1). Operators defined by 
I(x, y) = TS, ¢(N(x), y) are studied in [12.32] giving 
the conditions under which they are fuzzy implications. 

Another approach is based on the use of dual repre- 
sentable aggregation functions G, that are simply the 
N-dual of RAFs, introduced earlier. In this case, the 
corresponding (G, N)-operator is always a fuzzy impli- 
cation and several examples and properties of this class 
can be found in [12.33]. See also [12.34] where it is 
proven that they satisfy (EP) (or (12.5)) if and only if G 
is in fact a nilpotent t-conorm. 


Generalizations of Yager's Implications 
In this case, the generalizations usually deal with the 
possibility of varying the generator used in the defini- 
tion of the implication. A first step in this line was taken 
in [12.35] by considering multiplicative generators of t- 
conorms, but it was proven in [12.36] that this new class 
is included in the family of all (S, V)-implications ob- 
tained from f¢-conorms and continuous fuzzy negations. 

Another approach was given in [12.37] introducing 
(f, g)-implications. In this case, the idea is to general- 
ize f-generated Yager’s implications by substituting the 
factor x by g(x) where g: [0, 1] — [0, 1] is an increasing 
function satisfying g(0) = 0 and g(1) = 1. 
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In the same direction, a generalization of f- and 
g-generated Yager’s implications based on aggregation 
operators is presented and studied in [12.38], where the 
implications are constructed by replacing the product 
t-norm in Yager’s implications by any aggregation func- 
tion. 

Finally, h-implications were introduced in [12.39] 
and are constructed from additive generators of repre- 
sentable uninorms as follows. 


Definition 12.3 ([12.39]) 

Let h: [0, 1] —> [—co, co] be a strictly increasing and 
continuous function with h(0) = —oo, h(e) = 0 for an 
e e (0,1) and A(1) = +00. The function J": [0, 1]? > 
[0, 1] defined by 


1, ifx=0, 
I" (x, y= 4h '(x-h(y)), 
h“'(2-hO)) , 


ifx>Oandy<e, 


ifx>Oandy>e, 


is called an h-implication. 


This kind of implications maintains several properties 
of those satisfied by Yager’s implications, like (EP) and 
(12.5) with the product t-norm, but at the same time 
they satisfy other interesting ones. For more details on 
this kind of implications, as well as some generaliza- 
tions of them, see [12.39]. 


12.2.3 New Construction Methods 


In this section, we recall some construction meth- 
ods of fuzzy implications. The relevance of these 
methods is based on their capability of preserv- 
ing the additional properties satisfied by the ini- 
tial implication(s). First, note that some of them 
were already collected in [12.3, Chaps. 6 and 7], 
like: 


@ The ¢g-conjugation of a fuzzy implication J 


To(x.y) = P7 MA, pO), xy € [0,1], 


where ¢ is an order automorphism on (0, 1]. 
@ The min and max operations from two given fuzzy 
implications 


(IV J) (x,y) = max{I (x, y), J(x, y)} , x,y € [0, 1], 
(TA J)@, y) = mnf, y), JŒ, y)}, x,y € [0, 1]. 


@ The convex combinations of two fuzzy implica- 
tions, where À € [0, 1] 


Pay) = ATGey) + (1-A)- Jy), 
x,y € [0,1]. 

@ The N-reciprocation of a fuzzy implication 7 
Ty(x,y) =1(N(y).N@)), xy € [0,1], 


where N is a fuzzy negation. 
@ The upper, lower, and medium contrapositivization 
of a fuzzy implication J defined, respectively, as 


Ty (x, y) = max{I(x, y), Inx, y)} 
= (IV In)(x, y), 
Ty (x, y) = min{I(x, y), Inx, y)} 
= (TA In)(x,y) , 
Ix (x, y) = mintl(x, y) V N(x), Inx, y) Vy}, 


where N is a fuzzy negation and x, y € [0, 1]. Please 
note that the lower (upper) contrapositivization is 
based on applying the min (max) method to a fuzzy 
implication J and its N-reciprocal. 


It should be emphasized that the first major work to 
explore contrapositivization in detail, in its own right, 
was that of Fodor [12.40], where he discusses the con- 
trapositive symmetry of fuzzy implications for the three 
main families, namely, S-, R-, and QL-implications. 
In fact, during this study Fodor discovered the nilpo- 
tent minimum f-norm Tam, which is by far the first 
left-continuous but noncontinuous t-norm known in 
the literature. This study had a major impact on the 
development of left-continuous t-norms with strong 
natural negation, for instance, see the early works of 
Jenei [12.41, and references therein]. 

The above fact clearly illustrates how the study of 
functional equations involving fuzzy implications have 
also had interesting spin-offs and have immensely ben- 
efited other areas and topics in fuzzy logic connectives. 

Among the new construction methods proposed in 
the recent literature, we can roughly divide them into 
the following categories. 


Implications Generated from Negations 
The first method was introduced by Jayaram and 
Mesiar in [12.42], while they were studying special im- 
plications (see Definition 12.2). From this study, they 
introduced the neutral special implications with a given 
negation and they studied the main properties of this 
new class. 
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Definition 12.4 [12.42] 
Let N be a fuzzy negation such that N < Nc. Then the 
function Itn]: [0, 1]? — [0, 1] given by 


1, ifx<y, 
N(x-y)A- 

y @—y)U—x) 
l—-x+y 


Taney ifx>y 


with the understanding g = 0, is called the neutral spe- 


cial implication generated from N. 


The second method of generation of fuzzy implica- 
tions from fuzzy negations was introduced in [12.43]. 


Definition 12.5 [12.43] 
Let N be a fuzzy negation. The function 7™1: [0, 1]? > 
[0, 1] is defined by 


1 ifx<y, 


gD] A = = 
(x,y) EY og ifx>y. 
x 


Again, several properties of these new implications 
can be derived, specially when the following classes of 
fuzzy negations are considered 


MG) 1, ifxeA, 
x = 
7 0, ifx¢A, 
1, ifxeA, 
Nap) = 4 1-x fxg A 
, ifx¢gA, 
1+ Bx 


where A = [0, œ) with a € (0,1) or A = [0, a] witha € 
[0, 1]. Note that Nto, = Np, and N¢o,,g is the Sugeno 
class of negations. Note also that J! can be expressed 
as IM (x, y) = Sp (N(x), Ice (x, y)) for all x,y € [0, 1]. 
From this observation, replacing Sp for any f-conorm 
S and Jgg for any implication /, the function 


INST x, y) = SNO), 1y), x,y € [0,1], 
is always a fuzzy implication. 


Implications Constructed 

from Two Given Implications 
In this section, we present methods that generate a fuzzy 
implication from two given ones. 

The first method is based on an adequate scaling of 
the second variable of the two initial implications and it 
is called the threshold generation method [12.44]. 


Definition 12.6 [12.44] 
Let J; and h be two fuzzy implications and e € (0, 1). 
The function J;,—;,: [0, 1]? — [0, 1] defined by 


1, ifx=0, 


el (x z), 
e 


e+(1-e)-h (z N, 
l—e 


ifx>Oandy<e, 
Th —h (x,y) = 


ifx>Oandy>e, 


is called the e-threshold generated implication from /, 
and h. 


This method allows for a certain degree of con- 
trol over the rate of increase in the second variable of 
the generated implication. Furthermore, the importance 
of this method derives from the fact that it allows us 
to characterize h-implications as the threshold gener- 
ated implications of an f-generated and a g-generated 
implication [12.13, Theorem 2 and Remark 30]. Fur- 
ther, in contrast to many other generation methods of 
fuzzy implications from two given ones, it preserves 
(EP) and (12.5) if the initial implications possess them. 
Moreover, for an e € (0,1), the e-threshold generated 
implications can be characterized as those implications 
that satisfy I(x, e) = e for all x > 0. 

The threshold generation method given above is 
based on splitting the domain of the implication with 
a horizontal line and then scaling the two initial impli- 
cations in order to be well defined in those two regions. 
An alternate but analogous method can be proposed 
by using a vertical line instead of a horizontal line. 
This is the idea behind the vertical threshold generation 
method of fuzzy implications. This method does not 
preserve as many properties as the horizontal threshold 
method, but some results can still be proven. In partic- 
ular, they are characterized as those fuzzy implications 
such that /(e, y) = e for all y < 1 [12.45]. 

The following two construction methods were pre- 
sented in [12.46]. Given two implications J, J, the 
following operations are introduced 


IVI) (xy) =I, x), Jy) » 
(1@ J)(x, y) = 1x, Jy). 


for all x, y € [0, 1]. The properties of these new opera- 
tions as well as the structure of the set of all implica- 
tions FI equipped with each one of these operations is 
studied in [12.46]. 
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Other Construction Methods 
In addition to the above methods, we would like to 
recall the following interesting method based on condi- 
tional probability and conditional distribution functions 
presented by Grzegorzewski in [12.47]. 


Definition 12.7 [12.47, 48] 
The function Jc: [0, 1]? — [0, 1] given by 


l, ifx=0, 


C(x, y) 
x 


Io(x,y) = 


, ifx>0, 


where C is acopula, is called a probabilistic implication 
based on copula C. 


Conditions on copula C ensuring that the corre- 
sponding Ic is an implication, as well as properties of 
these implications are detailed in [12.48]. The main in- 
terest on this kind of implications lies in the fact that 
they are a powerful link between probability theory and 
fuzzy implications theory that can be useful in approxi- 
mate reasoning. Moreover, results on these probabilistic 
implications can also be useful for examining and inter- 
preting the behavior of some stochastic events. Some 
early results in this direction have appeared in [12.49, 
50], where some generalizations of the previous idea 
are considered. In particular in [12.51], survival impli- 
cations based on the probability that a given object will 
survive a fixed time into a population are studied. In this 
case, the survival implications are defined by 


1, ifx=0, 
x+y—1+C(1—x,1—y) 
x 


ley) = 


> ifx>0. 


where C is again a copula. 

Finally, we only briefly mention that there exist 
other construction methods. For instance, Massanet 
and Torrens [12.13,44,45] have proposed methods of 
constructing implications derived from a given impli- 
cation I and a fuzzy negation N as part of their study 
on some properties of horizontal and vertical threshold 
generated implications. 


12.2.4 Fuzzy Implications in Nonclassical 
Settings 


When we deal with uncertainty through fuzzy sets and 
fuzzy logic the natural framework is the unit inter- 
val [0, 1] and hence the logical connectives to be used 


are interpreted as operators on this interval. However, 
there are many different tools that have been proposed 
for managing uncertainty. In this context, some ex- 
tensions of fuzzy logic and fuzzy sets have also been 
developed. One can list at least the following exten- 
sions: interval-valued fuzzy sets, Atanassov intuitionis- 
tic fuzzy sets (that are equivalent to the interval-valued 
approach, [12.52]), interval-valued intuitionistic fuzzy 
sets, type-2 fuzzy sets, fuzzy multisets, n-dimensional 
fuzzy sets, and hesitant fuzzy sets. 

For all these extensions, the usual logical connec- 
tives like fuzzy conjunctions and fuzzy disjunctions 
need to be studied to develop a comprehensive theory, 
and especially fuzzy implications in order to make in- 
ferences in each one of these extensions. Due to space 
constraints, we only recall some aspects of interval- 
valued (or intuitionistic) fuzzy implications and the 
references where they can be found. 


Interval-Valued Approach 

A good compilation of the known results related to 
fuzzy implications (and other operations) in the interval- 
valued framework, can be found in [12.53] or [12.54] 
wherein, interval-valued or intuitionistic (S, N)- and R- 
implications are developed and some of their properties 
are presented. Works that deal with the construction of 
these classes of interval-valued implications can also be 
found in the literature. For instance, in [12.55] a con- 
struction method for the residual implication associated 
with a representable t-norm (constructed from two stan- 
dard t-norms T; and T, with Tı < T2) is presented. Sim- 
ilarly, (S, N)- and R-implications generated from: 


i) Aggregation functions and a standard fuzzy nega- 
tion are presented in [12.56]. 

ii) Some classes of interval-valued aggregation func- 
tions based on f-norms and f-conorms are dealt with 
in [12.57]. 

iii) The so-called Kq-operators have been proposed in 
[12.58]. 


Discrete Approach 
Note that all the above mentioned tools are mainly used 
in the management of imprecise quantitative informa- 
tion. However, experts deal with many problems where 
qualitative information is usually expressed through 
linguistic terms. Qualitative information is often inter- 
preted to take values in a totally ordered finite scale like 


{Extremely Bad, Very Bad, Bad, Fair, 


(12.8) 
Good, Very Good, Extremely Good}. 
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In these cases, the representative finite chain L, = 
{0,1,...,m} is usually considered to model these lin- 
guistic hedges and several researchers have devel- 
oped an extensive study of operations on L,, usu- 
ally called discrete operations. This approach allows 
avoiding numerical interpretations and consequently, 
the fuzzification and defuzzification steps become un- 
necessary. In this framework, the smoothness con- 
dition is usually considered as the discrete counter- 
part of continuity. In fact, in the discrete framework 
this property is equivalent to the divisibility prop- 
erty as well as to the Lipschitz condition. In this 
way, smooth discrete t-norms and ft-conorms were 
studied and characterized in [12.59] and also dis- 
crete fuzzy implications derived from them have been 
introduced. 

As in the case of [0,1], the four most usual ways 
to construct discrete implications from t-norms and t- 
conorms on L, are (S, N)-, R-, QL-, and D-implications. 
The first two classes derived from smooth t-norms 
and f-conorms and the only strong negation on L, 
(given by No(x) =n—x) were studied in [12.60]. In 
the smooth case, it is proven that the intersection be- 
tween (S,N)- and R-implications contains only the 
Lukasiewicz implication [12.60, Proposition 10]. Fur- 
ther, the nonsmooth case has also been investigated 
showing a parameterized family of nonsmooth t-norms 
T for which the corresponding R-implication coincides 
with the (S, V)-implication derived from the No-dual of 
T. The case of discrete QL- and D-operators is studied 
in [12.61], where characterization results on when such 
operators are in fact implications are given and, more- 
over, it is proven that both these classes coincide in the 
smooth case. 


However, the modeling of linguistic information is 
limited because the information provided by experts for 
each variable must be expressed by a simple linguistic 
term. In most cases, this is a problem for experts be- 
cause their opinion does not agree with a concrete term. 
On the contrary, experts’ values are usually expressions 
like better than Good, between Fair and Very Good, or 
other even more complex expressions. 

To avoid the limitation above, an approach has re- 
cently appeared trying to increase the flexibility of 
the elicitation of linguistic information. This approach 
deals with the possibility of extending monotonic op- 
erations on L, to operations on the set of discrete 
fuzzy numbers whose support is a subinterval of Ln, 
usually denoted by A". The idea lies in the fact that 
any discrete fuzzy number A € A can be consid- 
ered (identifying the scale £ given in (12.8) with the 
chain Le) as an assignment of a [0, 1]-value to each 
term in our linguistic scale. As an example, the above 
mentioned expression between Fair and Very Good can 
be performed, for instance, by a discrete fuzzy number 
AE AK, with support given by the subinterval 


[Fair, Very Good] = {Fair, Good, Very Good} , 


(that corresponds to the subinterval [3,5] in L6). The 
values of A in its support should be described by 
experts, allowing in this way a complete flexibility 
of the qualitative valuation. Usual operations like t- 
norms, f-conorms, strong negations, aggregation func- 
tions, and also fuzzy implications have been introduced 
in this framework. The case of (S,N)-, QL- and D- 
implications can be found in [12.62, 63] and the case 
of R-implications in [12.64]. 


12.3 Fuzzy Implications in Applications 


So far, we have discussed the theoretical aspects of 
fuzzy implications, namely, analytical and algebraic. In 
this section, we discuss their applicational value which 
shows a wide spectrum of areas wherein they are em- 
ployed and how the gamut of properties that a fuzzy 
implication possesses plays an important role in its em- 
ployability. 


12.3.1 FL,—Fuzzy Logic in the Narrow Sense 


Boolean implications are employed in inference 
schemas like modus ponens, modus tollens, etc., where 


the reasoning is done with statements or propositions 
whose truth-values are two valued. Fuzzy implica- 
tions play a similar role in the generalizations of the 
above inference schemas, where reasoning is done with 
fuzzy statements whose truth-value lies in [0, 1] instead 
of {0, 1}. 


Fuzzy Propositions 
An expression of the form x is A where A is a fuzzy 
set on an appropriate domain U, with reference to the 
context, is termed as a Fuzzy Statement or a Fuzzy 
Proposition. (The above two interpretations bear a close 
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resemblance to the Adjunctive and Connective interpre- 
tations as given in [12.65, pp. 331], though they are 
originally given for a binary operator. For other views 
and interpretation of the above statement, see, for in- 
stance, Bezdek et al., [12.66].) 

Let it be given that x is A and also that x assumes 
the precise value, let us say, x = u, where u € U, the 
domain of A. Then the truth value of the above fuzzy 
statement is obtained as follows 


t(x is A) = A(u) , 


i. e., the truth value of the above fuzzy statement, given 
that x is precisely known, is equal to the degree to 
which u — the value x assumes — is itself compatible 
with the fuzzy set A. Thus greater the membership de- 
gree of u in the concept A, higher is the truth value of 
the fuzzy statement. 

Consider the statement John is Tall and that x — the 
height of John — is precisely given to be 5'10” € U. 
Now, A(5’10”) gives the membership degree of 5'10” 
in the concept A = Tall, which can be interpreted as 
how much John belongs to the set of all Tall men, or 
equivalently, how much John is Tall is true, which is 
nothing but the truth-value t(John is Tall). 


Fuzzy Conditionals or Fuzzy IF-THEN Rules 
A fuzzy statement of the type discussed above X is A 
can be interpreted in yet another way, namely, as a lin- 
guistic statement, i.e., as an assignment of a fuzzy set 
to a variable. 

Let A: U — [0, 1] be a fuzzy set on a suitable do- 
main U. Then A can be taken to represent a concept. 
A linguistic variable of U is a symbol * that can assume 
or be assigned any fuzzy subset of U. Then a linguistic 
statement X is A is interpreted as the linguistic variable 
X taking the linguistic value A. 

For example, let U denote the set of all values in 
degrees centigrade. If the linguistic variable ¥ denotes 
Temperature, then it can assume the following linguis- 
tic values A, namely, high, more or less high, medium, 
cool, very cold, etc. Each of the linguistic values (say 
A = cool) is represented by a fuzzy set on the domain 
U of the linguistic variable Ñ, i. e., A: U — [0, 1]. 

The shape of the graph of the function represents 
the concept (say high temperature). The concept of high 
temperature is itself again context dependent. For ex- 
ample, high temperature (fever) for a human being is 
different from the high temperature in a blast furnace, 
and accordingly the domain of the linguistic variable is 
selected. 


A fuzzy IF-THEN tule is of the form 


IF xis A THEN Vis B, (12.9) 
where A, B are linguistic expressions/values assumed 
by the linguistic variables x, y. For example, 


IF’ X (temperature) is A (high) 
THEN y (pressure) is B (low). 


Generalized Modus Ponens 
Let a, P be two fuzzy propositions as given above and 
let x — £ be the fuzzy conditional which is a fuzzy 
IF-THEN rule as above. In classical logic, one uses 
rules of deduction, like modus ponens and modus tol- 
lens to deduce new knowledge from a given set of 
propositions. For instance, modus ponens states that 
aA(a—>B)FB. 

In fuzzy logic, since we deal with fuzzy propo- 
sitions whose truth values vary over the entire [0, 1] 
interval we employ fuzzy logic operations. Typically ^ 
is interpreted as a t-norm T and for the — a fuzzy im- 
plication is used. 

Unlike with classical propositions, when we deal 
with fuzzy propositions it is not always given that from 
a A (a —> f) one obtains f. This type of deduction 
is known as generalized modus ponens (GMP) and the 
study of pairs of operators (A, —), or alternately, a t- 
norm and fuzzy implication (T, /), that can be employed 
in GMP becomes important. It can be shown that this 
property translates to studying pairs (T,/) that satisfy 
the functional equation T(x, I(x, y)) < y for x, y € [0, 1], 
which is nothing but T-conditionality as dealt with in 
Sect. 12.2.1. 


Proof by Contradiction 
In classical logic, many a time one proves a statement 
of the form a —> £ by proving its contrapositive, i. e., 
=p — ~a. However, in the setting of fuzzy logic, of- 
ten the negation — used is noninvolutive, i. e., ~=—a +Æ 
a. 

For instance, when the underlying fuzzy logic 
operations come from the Gödel residuated lattice 
([0, 1], Tm, Zep, A, V), the natural negation of the fuzzy 
implication Igp is not involutive and Igp is not contra- 
positive w.r.t. any fuzzy negation. This led to the study 
of contrapositivization of fuzzy implications which was 
begun by Fodor [12.40] and is dealt with in Sect. 12.2.3 
above. 
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12.3.2 Approximate Reasoning 


One of the best known application areas of fuzzy logic 
is approximate reasoning (AR), wherein from impre- 
cise inputs and fuzzy premises or rules we obtain, often, 
imprecise conclusions [12.67]. AR with fuzzy sets en- 
compasses a wide variety of inference schemes and 
have been readily embraced in many fields, especially 
among others: decision making, expert systems, and 
control. Fuzzy implications play a vital role in many of 
these inference mechanisms, a brief discussion of which 
is presented below. 


Inference Mechanisms in AR 
Let us be given a set of n fuzzy IF-THEN rules of the 
form given in (12.10) 


Ifxis A; Thenyis Bi, i=1,2,...,n, (12.10) 


where A;, B; are fuzzy sets on input and output domains. 
Now, given a fuzzy input, i.e., a fuzzy proposition or 
a statement of the form Y is A’, the role of an inference 
mechanism is to obtain a fuzzy output B’ that satisfies 
some desirable properties [12.68, 69]. 

Note that, if we denote the fuzzy rules as A; —> 
Bi, i=1,2,...,n, as is typically done, then these 
are exactly the fuzzy conditionals discussed above in 
Sect. 12.3.1. Further, if we denote the input as A’ then 
an inference mechanism implements the generalized 
modus ponens by composing the fuzzy input A’ with 
all the rules A; —> B; to obtain the fuzzy output B’. 

There are two established ways to accomplish the 
above, namely, fuzzy relational inference (FRI) and sim- 
ilarity based reasoning (SBR). Fuzzy implications play 
a major role in both the types of inference mechanisms 
as detailed below. 


Fuzzy Relational Inference (FRI) 
In a fuzzy relational inference, all the rules A; — B; 
are combined into a single fuzzy relation R and the out- 
put B’ is obtained as an image of the input A’ composed 
with R. 

A fuzzy IF-THEN tule base of the form (12.10) is 
modeled as a fuzzy relation R(x, y):X x Y — [0, 1] as 
follows 


R(x, y) = NL, (Ai) > BiG) 
= Nim, (Ai (x), Bi(y))) . 
which reflects the conditional nature of the rules and 


where J is usually a fuzzy implication. Then given a fact 
Xis A’, the inferred output B’ is obtained either as: 


(12:11) 


i) sup-T composition, as in the compositional rule of 
inference (CRI) of Zadeh [12.70], or 

ii) An inf-J composition, as in the Bandler-Kohout 
subproduct (BKS) [12.71], 


of A’(x) and R(x, y), i.e., 


B’(y) = Aœ) 0 R, y) = sup T(A’(x),R(x,y)) . 
(12.12) 


B'O) =4 0 RG) = inf 114’), ROY), 
(12.13) 


where T can be any t-norm and 7 is any fuzzy implica- 
tion. 

It is clear from (12.12) and (12.13) that the impor- 
tant role fuzzy implications and their properties play in 
the goodness of an inference scheme. In the following 
subsection, we present a few issues where this role is 
highlighted. 


Issues in FRI 
While the rule base is an example of a single input sin- 
gle output (SISO) case, in practice we need multi-input 
single-output (MISO) rules of the form given below, 
with m input domains X;, j = 1,2,...,m, 


R; $ IF xX is Ail AND X2 is Ap AND 
... AND %, is Ain THEN Yis B; . 


While MISO rule bases are of great practical necessity, 
they spring up some new issues when they are em- 
ployed in FRIs. 


Combinatorial Explosion of Rules 

and Distributivity of Fuzzy Implications 
Let there be k; fuzzy sets defined on each of the do- 
mains X;, j= 1,2,...,m. Then in a complete MISO 
rule base, we will have n = kı x kz X - - - km number of 
rules. Clearly, as m or k; increases n increases and we 
have a combinatorial explosion of rules. 

In a seminal work on studying this issue, Combs 
and Andrews [12.72] proposed an equivalent transfor- 
mation of the CRI to mitigate the computational cost. 
The authors showed that the distributivity of fuzzy 
implications over t-norms play a major role in this 
transformation. This was further studied by Balasubra- 
maniam and Rao [12.10] and its use in SBR was also 
demonstrated later by Jayaram [12.73]. 
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Computational Complexity, 

Hierarchical Systems, 

and the Law of Importation 
Let us consider an MISO rule base. From (12.11), it is 
clear that the relation Ê obtained is a multidimensional 
matrix, with R : X; x Xo xX- -X Xm xX Y > [0, 1]. In fact, 
when one uses the First-Infer-Then-Aggregate mecha- 
nism in an FRI, either CRI or BKS, one needs to store 
n such m-dimensional matrices. Further, the input A’ is 
also an m-dimensional matrix and the computation of 
the output gets costlier. 

To overcome this, Jayaram [12.19] proposed an 
alternate hierarchical inference scheme which can be 
shown to be equivalent both in the CRI [12.19] 
and BKS [12.20] setting, when the underlying 
operators are such that the ż-norm T and the 
fuzzy implication J satisfy the law of importation 
(12.5): 


12.3.3 Fuzzy Subsethood Measures 


Inclusion or subsethood of sets is an important con- 
cept. The first such definition of inclusion of a fuzzy 
set A over X in another fuzzy set B, was given by Zadeh 
[12.74] as follows 


A Cz B 4>A(x) < B(x), 
foralxexX. 


Note that this definition was more or less crisp, since 
an A was either contained in B or not. A more 
general notion of degree of inclusion was missing 
in the above definition. Subsequently many fuzzy 
subsethood measures, denoted (usually) Inc, were 
proposed. 


Axiomatic Studies 

on Fuzzy Subsethood Measures 
From the isomorphism that exists between classical set 
theory and classical logic, we know that A C B is equiv- 
alent to ya => Xg, where 7x is the characteristic func- 
tion of the set X. Thus, early fuzzy subsethood measures 
also mimicked this equivalence by defining them based 
on fuzzy implications. Many researchers, in particular, 
Sinha and Dougherty [12.75], Kitainik [12.76], Bandler 
and Kohout [12.77] proposed sets of axioms for an Inc 
to satisfy. 

It is easy to see that all of the above axiomatic ap- 
proaches, eventually lead to employing implications as 
the underlying operators to define the corresponding Inc 


measure, as given below 
Incgp(A, B) = inf min (1, A(A(x)) +AC.— B(x))) , 
xE 
Inc, (A, B) = inf eUKp (B(x), AQ) i 
xe 


1 — Iko (4x), B(x) , 
Incpx(A, B) = inf (A), B@))) , 


where A: [0, 1] — [0, 1] is a decreasing function with 
some additional properties, g: A — [0,1] a func- 
tion with additional properties where A = {(x, y) € 
[0, 1]?|x > y} and J is any fuzzy implication. 

From the above formulae the important position 
a fuzzy implication J holds in measuring fuzzy sub- 
sethood is apparent. Note that the Jnc measure is used 
extensively in similarity based reasoning (SBR) and in 
fuzzy mathematical morphology (FMM) which are dis- 
cussed below. 


12.3.4 Fuzzy Control 


While Sect. 12.3.2 dealt with FRIs which are largely 
used in the context of decision making and expert sys- 
tems, in this section we deal with another type of fuzzy 
inference mechanism (FIM) that is used in fuzzy con- 
trol, where the approximation properties of the FIM are 
important. 


Similarity-Based Reasoning (SBR) 
Let us once again consider a fuzzy IF-THEN rule base 
of the form (12.10) and a fuzzy input A’. In an SBR 
inference scheme, the following steps are employed to 
produce the output: 


@ Matching: The input A’ is matched against each 
of the antecedents A; of the rules (12.10) using 
a matching function M to obtain the correspond- 
ing similarity values s; = M(A’,A;) € [0, 1] for i= 
1,2,...,n. 

@ Modification: Each of the similarity values s; is used 
to modify the corresponding consequent B; of the 
rule (12.10) using a modification J to obtain the 
modified output BY = J(s;, Bi). 

@ Aggregation: Finally all the modified outputs B; 
are aggregated to obtain an overall output B= 
G(B),..., Bi). 


In notations, we can write the above as 


B'O) = GL, (IMA, A), Bi). yer. 


(12.14) 
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Fuzzy Implications and Matching Functions 
Clearly, since A, A; € F(X), we see that the matching 
function M : F(X) x F(X) —> [0, 1]. Typically, a fuzzy 
subsethood measure Jnc is employed as an M. While 
there exist M that are not based on fuzzy implications, 
it is seen that those that are based on fuzzy implications 
often satisfy many of the desirable properties required 
on the matching function M in different contexts, for 
instance, when the SBR is required to be interpolative, 
monotonic or for the SBR to possess good approxima- 
tion properties. For more on this topic, see the works of 
Jayaram [12.73] or Mandal and Jayaram [12.78]. 


Fuzzy Implications and Modification Functions 
From (12.14), it is clear that the modification function J 
can be seen simply as a binary function on [0, 1]. While 
any fuzzy logic operation could be used for J, fuzzy 
implications are preferred either due to their proper- 
ties or due to the conditional nature of the underlying 
rules. For instance, when J = I a fuzzy implication, if 
the original output B; is normal then the modified out- 
put BY is also normal, which is usually not the case when 
one uses, say, a t-norm. In fact, different properties of J 
like (OP), (IP) and the nature of its natural negation Ny 
all play a role in the reasonableness of the final output 
of an SBR. 

In real-life systems, the input and output domains 
X,Y are subsets of R. Now, let the consequents B; 
be of bounded support, i.e., {y € Y C R|B;(y) > 0} = 
[a,b] © Y for some finite a,b € R. When an J whose 
N; is not the Gödel least negation Np, is employed, the 
support of B; becomes larger and in the case N; is in- 
volutive then the support of the modified output sets 
B’ become the whole of the set Y. This often makes 
the modified output sets B; to be nonconvex (and of 
larger support) and makes it difficult to apply stan- 
dard defuzzification methods. For more on these see the 
works of Štěpnička and De Baets [12.79]. The above 
discussion brings out an interesting aspect of fuzzy im- 
plications. While fuzzy implications 7 whose N; are 
strong are to be preferred in the setting of fuzzy logic 
FL, for inferencing as noted in Sect. 12.3.1 above, an J 
with an N; that is not even continuous is to be preferred 
in inference mechanisms used in fuzzy control. 

By the core of a fuzzy set B on Y, we mean the 
set {y € Y|B(y) = 1}. Now, an J which possesses (OP) 
or (IP) is preferred in an SBR to ensure there is an 
overlap between the cores of the modified outputs B; — 
a property that is so important to ensure coherence in 
the system [12.80] and that, once again, standard de- 
fuzzification methods can be applied. 


12.3.5 Fuzzy Mathematical Morphology 


Consider a 2D binary image P, i.e., the value at a pixel 
is either 0 or 1. P can be seen as a function from X C 
R? — {0, 1} or just a classical subset X C R*. Mathe- 
matical morphology (MM) is a set-theoretic method for 
the extraction of shape information from a scene. Here, 
a Y C R? — which can be seen as another image Q and 
often referred to as the structuring element — is used to 
transform the original image P by some well-defined 
local operators termed Dilation and Erosion as defined 
below 


D(P, 2) = {ve R?|A,(Q)NP FB}, (12.15) 
E(P, Q) = {v E R*|A,(Q) € P}, (12.16) 


where A, (Q) = {u € R?|u—v € Q} is the translation of 
Q by ve R?. 

FMM is the extension of MM to gray-level im- 
ages by using fuzzy sets and possibility theory. Note 
that a gray-level image P can be interpreted as a fuzzy 
set X C R? — (0, 1] where the pixel value is interpreted 
as its membership degree to the original data set. This 
fuzzified image is then processed via morphological op- 
erators that are extensions of the boolean ones. 

In the literature, one finds two approaches to this 
extension: 


i) As a formal translation of crisp equations using 
t-norms and negations, by employing a fuzzy in- 
tersection for N in (12.15) and a fuzzy subsethood 
measure Inc for C in (12.16), and 

ii) Using adjunction and residual implications. 


While the first approach is based on the duality be- 
tween dilation and erosion, the second approach stems 
more from an algebraic setting. 

De Baets (12.81, 82] took the second approach, and 
defined the fuzzy dilation and erosion as follows 


D(P,2)y)= sup [C(P(x—y), Q(@))]. 


xEA,(YINX 


E(P,Q)(y) = inf [(P—y), 20], 
xEA, (Y) 


where C is any fuzzy conjunction and J is a fuzzy im- 
plication. 

When the pair of operations (C,T) satisfy the ad- 
junction property, or equivalently, Z is a residual impli- 
cation obtained from C, then many interesting aspects 
emerge. Firstly, it can be shown that opening and 
closing operations, which are some morphological op- 
erations obtained from the defined Ď, È turn out to 
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be idempotent, which is highly desirable [12.83]. Sec- 
ondly, it can be shown, as was done by Nachtegael and 
Kerre [12.84], that this approach is more general and 
many other approaches become a specific case of it. 
Thirdly, recently, Bloch [12.85] showed that both the 
above approaches based on duality and adjunction are 
equivalent under some rather general and mild con- 


12.4 Future of Fuzzy Implications 


Since the publication of [12.2,3], the peak of interest 
in fuzzy implications has led to a rapid progress in at- 
tempts to solve open problems in this topic. Specially, 
in [12.3], many open problems were presented covering 
all the subtopics of this field: characterizations, intersec- 
tions, additional properties, etc. Many of these problems 
have been already solved and the solutions have been 
collected in [12.88]. However, there still remain many 
open problems involving fuzzy implications. Thus, in 
this section, we will list some of them whose choice has 
been dictated either based on the importance of the prob- 
lem or the significance of the solution. 

The first subset corresponds to open problems deal- 
ing with the satisfaction of particular additional prop- 
erties of fuzzy implications. The first one deals with 
the law of importation (LI). Recently, some works 
have dealt with this property and its equivalence to the 
exchange principle and from them, some new character- 
izations of (S, N)- and R-implications based on (12.5) 
have been proposed, see [12.21]. However, some ques- 
tions are still open. Firstly, (12.5) with a t-norm (or 
a more general conjunction) and (EP) are equivalent 
when N; is a continuous negation, but the equivalence 
in general is not fully determined. 


Problem 12.1 
Characterize all the cases when (LI) and (EP) are equiv- 
alent. 


Secondly, it is not yet known which fuzzy implica- 
tions satisfy (LI) when the conjunction operation is fixed. 


Problem 12.2 

Given a conjunction C (usually a t-norm or a conjunc- 
tive uninorm), characterize all fuzzy implications / that 
satisfy (LI) with this conjunction C. For instance, which 
implications Z satisfy the following functional equation 


I (xy, z) = I(x, I(y, z)) 


that comes from (LI) with T = Tp? 


ditions, but those that often lead to highly desirable 
settings. 

Recently, the approach initiated by De Baets has 
been enlarged by considering uninorms instead of t- 
norms and their residual implications with good results 
in edge detection, as well as in noise reduction [12.86, 
87]. 


Another problem now concerning only the ex- 
change principle follows. 


Problem 12.3 
Give a necessary condition on a nonborder continuous 
t-norm T for the corresponding Ir to satisfy (EP). 


It should be mentioned that some related work on 
the above problem appeared in [12.89]. 

Some other open problems with respect to the sat- 
isfaction of particular additional properties are based 
on the preservation of these properties from some ini- 
tial fuzzy implications to the generated one using some 
construction methods like max, min, or the convex com- 
bination method. 


Problem 12.4 
Characterize all fuzzy implications J, J such that Jv J, 


IA J and K^ satisfy (EP) or (LI), where A € [0, 1]. 


The above problem is also related to the following 
one: 


Problem 12.5 

Characterize the convex closures of the following fam- 
ilies of fuzzy implications: (S,N)-, R- and Yager’s f- 
and g-generated implications. 


Another open problem which has immense applica- 
tional value is the satisfaction of the T-conditionality by 
the Yager’s families of fuzzy implications. 


Problem 12.6 

Characterize Yager’s f-generated and g-generated im- 
plications satisfying the T-conditionality property with 
some f-norm T. 


The following two open problems are related to the 
characterization of some particular classes of fuzzy im- 
plications. 
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Problem 12.7 
What is the characterization of (S, V)-implications gen- 
erated from noncontinuous negations? 


Problem 12.8 
Characterize triples (T,S,N) such that the correspond- 
ing QL-operation Ir sy satisfies (I1). 


Finally, a fruitful topic where many open problems 
are still to be solved is the study of the intersections 
among the classes of fuzzy implications (Fig. 12.1). 


Problem 12.9 


i) Is there a fuzzy implication 7, other than the 
Weber implication Jwg, which is both an (S,N)- 
implication and an R-implication which is obtained 
from a nonborder continuous t-norm and cannot be 
obtained as the residual of any other left-continuous 
t-norm? 
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13. Fuzzy Rule-Based Systems 


Luis Magdalena 


Fuzzy rule-based systems are one of the most 

important areas of application of fuzzy sets and 

fuzzy logic. Constituting an extension of classical 
rule-based systems, these have been successfully 
applied to a wide range of problems in differ- 

ent domains for which uncertainty and vagueness 
emerge in multiple ways. In a broad sense, fuzzy 
rule-based systems are rule-based systems, where 
fuzzy sets and fuzzy logic are used as tools for rep- 
resenting different forms of knowledge about the 
problem at hand, as well as for modeling the in- 
teractions and relationships existing between its 
variables. The use of fuzzy statements as one of 
the main constituents of the rules allows cap- 

turing and handling the potential uncertainty of 
the represented knowledge. On the other hand, 

thanks to the use of fuzzy logic, inference meth- 
ods have become more robust and flexible. This 

chapter will mainly analyze what is a fuzzy rule- 
based system (from both conceptual and structural 
points of view), how is it built, and how can be 
used. The analysis will start by considering the 

two main conceptual components of these sys- 

tems, knowledge, and reasoning, and how they 
are represented. Then, a review of the main struc- 
tural approaches to fuzzy rule-based systems will 
be considered. Hierarchical fuzzy systems will also 
be analyzed. Once defined the components, struc- 


From the point of view of applications, one of the most 
important areas of fuzzy sets theory is that of fuzzy 
rule-based systems (FRBSs). These kind of systems 
constitute an extension of classical rule-based systems, 
considering JF-THEN rules whose antecedents and 
consequents are composed of fuzzy logic (FL) state- 
ments, instead of classical logic ones. 

Conventional approaches to knowledge representa- 
tion are based on bivalent logic, which has associated 
a serious shortcoming: the inability to reason in situa- 
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ture and approaches to those systems, the ques- 
tion of design will be considered. Finally, some 
conclusions will be presented. 


tions of uncertainty and imprecision. As a consequence, 
conventional approaches do not provide an adequate 
framework for this mode of reasoning familiar to hu- 
mans, and most commonsense reasoning falls into this 
category. 

In a broad sense, an FRBS is a rule-based sys- 
tem where fuzzy sets and FL are used as tools for 
representing different forms of knowledge about the 
problem at hand, as well as for modeling the interac- 
tions and relationships existing between its variables. 
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The use of fuzzy statements as one of the main con- 
stituents of the rules, allows capturing and handling 
the potential uncertainty of the represented knowl- 
edge. On the other hand, thanks to the use of fuzzy 
logic, inference methods have become more robust and 
flexible. 

Due to these properties, FRBSs have been success- 
fully applied to a wide range of problems in different 
domains for which uncertainty and vagueness emerge 
in multiple ways [13.1-5]. 

The analysis of FRBSs will start by considering 
the two main conceptual components of these systems, 
knowledge and reasoning, and how they are repre- 


sented. Then, a review of the main structural approaches 
to FRBSs will be considered. Hierarchical fuzzy sys- 
tems would probably match in this previous section, 
but being possible to combine the hierarchical approach 
with any of the structural models defined there, it seems 
better to consider it independently. Once defined the 
components, structure, and approaches to those sys- 
tems, the question of design will be considered. Finally, 
some conclusions will be presented. It is important to 
notice that this chapter will concentrate on the general 
aspects related to FRBSs without deepening in the foun- 
dations of FL which are widely considered in previous 
chapters. 


13.1 Components of a Fuzzy Rule-Based System 


Knowledge representation in FRBSs is enhanced with 
the use of linguistic variables and their linguistic val- 
ues, that are defined by context-dependent fuzzy sets 
whose meanings are specified by gradual membership 
functions [13.6-8]. On the other hand, FL inference 
methods such as generalized Modus Ponens, general- 
ized Modus Tollens, etc., form the basis for approx- 
imate reasoning [13.9]. Hence, FL provides a unique 
computational framework for inference in rule-based 
systems. This idea implies the presence of two clearly 
different concepts in FRBSs: knowledge and reasoning. 
This clear separation between knowledge and reason- 
ing (the knowledge base (KB) and processing structure 
shown in Fig. 13.1) is the key aspect of knowledge- 
based systems, so that from this point of view, FRBSs 
can be considered as a type of knowledge-based system. 

The first implementation of an FRBS dealing 
with real inputs and outputs was proposed by Mam- 


Knowledge base 


Scaling 


dani [13.10], who considering the ideas published just 
a few months before by Zadeh [13.9] was able to aug- 
ment his initial formulation allowing the application of 
fuzzy systems (FSs) to a control problem, so creating 
the first fuzzy control application. These kinds of FSs 
are also referred to as FRBSs with fuzzifier and de- 
fuzzifier or, more commonly, as fuzzy logic controllers 
(FLCs), as proposed by the author in his pioneering pa- 
per [13.11], or Mamdani FRBSs. From the beginning, 
the term FLC became popular since control systems 
design constituted the main application of Mamdani 
FRBSs. At present, control is only one more of the 
many application areas of FRBSs. 

The generic structure of a Mamdani FRBS is shown 
in Fig. 13.1. The KB stores the available knowledge 
about the problem in the form of fuzzy IF-THEN rules. 
The processing structure, by means of these rules, puts 
into effect the inference process on the system inputs. 


functions 
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Fig. 13.1 General structure of 
a Mamdani FRBS 
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13.1 Components of a Fuzzy Rule Based-System 


The fulfillment of rule antecedent gives rise to the ex- 
ecution of its consequent, i. e., one output is produced. 
The overall process includes several steps. The input 
and output scalings produce domain adaptations. Fuzzi- 
fication interface establishes a mapping between crisp 
values in the input domain U, and fuzzy sets defined on 
the same universe of discourse. On the other hand, the 
defuzzification interface performs the opposite opera- 
tion by defining a mapping between fuzzy sets defined 
in the output domain V and crisp values defined in 
the same universe. The central step of the process is 
inference. 

The next two subsections analyze in depth the two 
main components of an FRBS, the KB and the pro- 
cessing structure, considering the case of a Mamdani 
FRBS. 


13.1.1 Knowledge Base 


The KB of an FRBS serves as the repository of the 
problem-specific knowledge — that models the rela- 
tionship between input and output of the underlying 
system — upon which the inference process reasons 
to obtain from an observed input, an associated out- 
put. 

This knowledge is represented in the form of rules, 
and the most common rule structure in Mamdani 
FRBSs involves the use of linguistic variables [13.6—8]. 
Hence, when dealing with multiple inputs-single output 
(MISO) systems, these linguistic rules possess the fol- 
lowing form 


IF X, is LT, and ... and X, is LT, 
THEN Y is LT, , (13.1) 


with X; and Y being, respectively, the input and output 
linguistic variables, and with LT; being linguistic terms 
associated with these variables. 

Note that the KB contains two different informa- 
tion levels, i. e., the linguistic variables (providing fuzzy 
rule semantics in the form of fuzzy partitions) and 
the linguistic rules representing the expert knowledge. 
Apart from that, a third component, scaling functions, is 
added in many FRBSs to act as an interfacing compo- 
nent for domain adaptation between the external world 
and the universes of discourse used at the level of the 
fuzzy partitions. This conceptual distinction drives to 
the three separate entities that constitute the KB: 


© The fuzzy partitions (also called Frames of Cogni- 
tion) describe the sets of linguistic terms associated 


with each variable and considered in the linguis- 
tic rules, and the membership functions defining 
the semantics of these linguistic terms. Each lin- 
guistic variable involved in the problem will have 
associated a fuzzy partition of its domain. Fig- 
ure 13.2 shows a fuzzy partition using triangu- 
lar membership functions. This structure provides 
a natural framework to include expert knowledge 
in the form of fuzzy rules. The fuzzy partition 
shown in the figure uses five linguistic terms {very 
small, small, medium, large, and very large}, (rep- 
resented as VS, S, M, L, and VL, respectively) 
with the interval [/,r] being its domain (Universe 
of discourse). The figure also shows the mem- 
bership function associated to each of these five 
terms. 

© A rule base (RB) is comprised of a collection of 
linguistic rules (as the one shown in (13.1)) that 
are joined by the also operator. In other words, 
multiple rules can fire simultaneously for the same 
input. 

@ Moreover, the KB also comprises the scaling 
functions or scaling factors that are used to 
transform between the universe of discourse in 
which the fuzzy sets are defined from/to the 
domain of the system input and output vari- 
ables. 


It is important to note that the RB can present 
several structures. The usual one is the list of rules, 
although a decision table (also called rule matrix) be- 
comes an equivalent and more compact representation 
for the same set of linguistic rules when only a few in- 
put variables (usually one or two) are considered by the 
FRBS. 

Let us consider an FRBS where two input vari- 
ables (x; and x2) and a single output variable (y) 
are involved, with the following term sets associated: 
{small, medium, large}, {short, medium, long} and 
{bad, medium, good}, respectively. The following RB 


VS S M L VL 


0.5 


il r 


Fig. 13.2 Example of a fuzzy partition 
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composed of five linguistic rules 


R,: IF X, is small and Xa is short THEN Y is bad, 
also 
Ro: IF X, is small and Xə is medium THEN Y is 
bad, also 
R3: IF X, is medium and X is short THEN Y is 
medium, also 

R4: IF X; is large and X is medium THEN Y is 
medium, also 


Rs: IF X; is large and Xp is long THEN Y is good , 
(13.2) 


can be represented by the decision table shown in 
Table 13.1. 

Before concluding this section, we should notice 
two aspects. On one hand, the structure of a lin- 
guistic rule may be more generic if a connective 
other than the and operator is used to aggregate the 
terms in the rule antecedent. However, it has been 
demonstrated that the above rule structure is generic 
enough to subsume other possible rule representa- 
tions [13.12]. The above rules are therefore com- 
monly used throughout the literature due to their sim- 
plicity and generality. On the other hand, linguistic 
rules are not the only option and rules with a dif- 
ferent structure can be considered, as we shall see in 
Sect. 13.2. 


13.1.2 Processing Structure 


The functioning of FRBSs has been described as the 
interaction of knowledge and reasoning. Once briefly 
considered the knowledge component, this section will 
analyze the reasoning (processing) structure. The pro- 
cessing structure of a Mamdani FRBS is composed of 
the following five components: 


© The input scaling that transforms the values of the 
input variables from its domain to the one where the 


input fuzzy partitions are defined. 


Table 13.1 Example of a decision table 


x1 
x2 small medium large 
short bad medium 
medium bad medium 
long good 


© A fuzzification interface that transforms the crisp in- 
put data into fuzzy values that serve as the input to 
the fuzzy reasoning process. 

@ An inference engine that infers from the fuzzy in- 
puts to several resulting output fuzzy sets according 
to the information stored in the KB. 

© A defuzzification interface that converts the fuzzy 
sets obtained from the inference process into a crisp 
value. 

© The output scaling that transforms the defuzzified 
value from the domain of the output fuzzy parti- 
tions to that of the output variables, constituting the 
global output of the FRBS. 


In the following, the five elements will be briefly 
described. 


The Input/Output Scaling 
Input/output scaling maps (applying the corresponding 
scaling functions or factors contained in the KB) the in- 
put/output variables to/from the universes of discourse 
over which the corresponding linguistic variables were 
defined. 

This mapping can be performed with different func- 
tions ranging from a simple scaling factor to linear and 
nonlinear functions. 

The initial idea for scaling was the use of scaling 
factors with a tuning purpose [13.13], giving a certain 
adaptation capability to the fuzzy system. 

Additional degrees of freedom could be obtained by 
using a more complex scaling function. A second op- 
tion is the use of linear scaling with a function of the 
form 


fQ@)=A-x4+v, (13.3) 


where the scaling factor A enlarges or reduces the op- 
erating range, which in turn decreases or increases the 
sensitivity of the system in respect to that input vari- 
able, or the corresponding gain in the case of an output 
variable. The parameter v shifts the operating range and 
plays the role of an offset for the corresponding vari- 
able. 

Finally, it is possible to use more complex mappings 
generating nonlinear scaling. A common nonlinear 
scaling function is 


f(x) = sign(x) - |x| . (13.4) 


This nonlinear scaling increases (a > 1) or decreases 
(a < 1) the relative sensitivity in the region closer to the 
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central point of the interval and has the opposite effect 
when moving far from the central point [13.14]. 


The Fuzzification Interface 

The fuzzification interface enables Mamdani FRBSs 
to handle crisp input values. Fuzzification establishes 
a mapping from crisp input values to fuzzy sets defined 
in the universe of discourse of those inputs. The mem- 
bership function of the fuzzy set A’ defined over the 
universe of discourse U associated to a crisp input value 
Xo is computed as 


pw = Fo), (13.5) 


in which F is a fuzzification operator. 

The most common choice for the fuzzification oper- 
ator F is the point wise or singleton fuzzification, where 
A’ is built as a singleton with support xo, i. e., it presents 
the following membership function: 

w l, ifx= xo isa 

Wy (x) = . 13. 

0, otherwise . 
Nonsingleton options [13.15] are also possible and have 
been considered in some cases as a tool to represent the 
imprecision of measurements. 


The Inference System 
The inference system is the component that derives 
the fuzzy outputs from the input fuzzy sets accord- 
ing to the relation defined through the fuzzy rules. The 
usual fuzzy inference scheme employs the generalized 
Modus Ponens, an extension to the classical Modus Po- 
nens [13.9] 


IF X is A THEN Y is B 
X is A’ (13.7) 
Y is B’. 
In this expression, IF X is A THEN Y is B describes 
a conditional statement that in this case is a fuzzy con- 
ditional statement, since A and B are fuzzy sets, and 
X and Y are linguistic variables. A fuzzy conditional 
statement like this one represents a fuzzy relation be- 
tween A and B defined in U x V. This fuzzy relation is 
expressed again by a fuzzy set (R) whose membership 
function ug(x, y) is given by 


ur, y) = (max), Ma), Vee U,yev, 
(13.8) 


in which u4 (x) and ug(y) are the membership functions 
of the fuzzy sets A and B, and 7 is a fuzzy implication 
operator that models the existing fuzzy relation. 


Going back to (13.7), the result of applying gen- 
eralized Modus Ponens is obtaining the fuzzy set B’ 
(through its membership function) by means of the 
compositional rule of inference [13.9]: 


If R is a fuzzy relation defined in U and V, and A’ 
is a fuzzy set defined in U, then the fuzzy set B’, in- 
duced by A’, is obtained from the composition of R 
and A’, 


that is 
B' =A oR. (13.9) 


Now it is needed to compute the fuzzy set B’ from 
A’ and R. According to the definition of composition 
(T-composition) given in the chapter devoted to fuzzy 
relations, the result will be 


ugr O) = sup T(x (x), uR(X,y)) 5 (13.10) 


where T is a triangular norm (t-norm). The concept and 
properties of t-norms have been previously introduced 
in the chapter devoted to fuzzy sets. 

Given now an input value X = xo, obtaining A’ in 
accordance with (13.6) (where uy (x)= 0 Vx 4 xo), 
and considering the properties of t-norms (T(1,a) = 
a, T(0, a) = 0), the previous expression is reduced to 


ue (Y) = T (mx (xo), Ur. Y)) 
= T(1, ur(x0, y)) = ur(Xo, Y) - (13.11) 


The only additional point to arrive to the final value 
of ug (y) is the definition of R, the fuzzy relation 
representing the Implication. This is a somehow con- 
troversial question. Since the very first applications of 
FRBSs [13.10, 11] this relation has been implemented 
with the minimum (product has been also a common 
choice). If we analyze the definition of fuzzy impli- 
cation given in the corresponding chapter, it is clear 
that the minimum does not satisfy all the conditions 
to be a fuzzy implication, so, why is it used? It can 
be said that initially it was a short of heuristic de- 
cision, which demonstrated really good results being 
accepted and reproduced in all subsequent applications. 
Further analysis can offer different explanations to this 
choice [13.16-18]. 

In any case, assuming the minimum as the represen- 
tation for R, (13.11) produces the following final result: 


ueg (y) = min(ua (xo), HBG) - (13.12) 
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Considering now an n-dimensional input space, the Usually, the aggregation operator G is implemented 
inference will establish a mapping between fuzzy sets by the maximum (a t-conorm), and the defuzzifier D 
defined in the Cartesian product (U = U x U2 x---x is the center of gravity (CG) or the mean of maxima 
Un) of the universes of discourse of the input variables (MOM), whose expressions are as follows: 
X1, ..., Xn, and fuzzy sets defined in V, being the uni- 
verse of discourse of the output variable Y. Therefore, © CG: 
when applied to the ith rule of the RB, defined as 
ee ee declan asi _ fy y: be Ody 
fi 1 is Ay and ... andX,, is Ai, THEN Y is B; , ME lida (13.16) 
(13.13) Jy He) y 
considering an input value xo = (x1,..., Xn), the out- @ MOM: 
put fuzzy set B’ will be obtained by replacing ua (xo) 
eae yal Yint = inf{z| up (z) = max pp’ (y)} 
Ha; (x0) = T (Man i), MAin (Xn). E 
where T is a fuzzy conjunctive operator (a t-norm). Ysup = Suptz| Ma (2) = a Le (y)} 
The Defuzzification Interface Yo = Ty tee . (13.17) 
The inference process in Mamdani-type FRBSs oper- 
ates at the level of individual rules. Thus, the applica- Mode B-FITA: First Infer, then Aggregate. In this 
a tion of the compositional rule of inference to the current second approach, the contribution of each fuzzy set is 
z input, using the m rules in the KB, generates m out- considered separately and the final crisp value is ob- 
= put fuzzy sets B;. The defuzzification interface has to tained by means of an averaging or selection operation 
a aggregate the information provided by the m individ- applied to the set of crisp values derived from each of 


ual outputs and obtain a crisp output value from the 
aggregated set. This task can be done in two different 
ways [13.1, 12, 19]: Mode A-FATI (first aggregate, then 
infer) and Mode B-FITA (first infer, then aggregate). 

Mamdani originally suggested the mode A-FATI 
in his first conception of FLCs [13.10]. In the last 
few years, the Mode B-FITA is becoming more pop- 
ular [13.19-21], in particular, in real-time applications 
which demand a fast response time. 


Mode A-FATI: First Aggregate, then Infer. In this 
case, the defuzzification interface operates as follows: 


e Aggregate the individual fuzzy sets B; into an over- 
all fuzzy set B’ by means of a fuzzy aggregation 
operator G (usually named as the also operator): 


H) = G fjue O); Hag O); -+ + Hay O)} 
(13.14) 
@ Employ a defuzzification method, D, transforming 
the fuzzy set B’ into a crisp output value yo: 


yo = D(a (y)). (13.15) 


the individual fuzzy sets By. 

The most common choice is either the CG or the 
maximum value (MV), then weighted by the matching 
degree. Its expression is shown as follows: 


(13.18) 


with y; being the CG or the MV of the fuzzy set By, 
inferred from rule R;, and h; = ua; (xo) being the match- 
ing between the system input x9 and the antecedent 
(premise) of rule i. 

Hence, this approach avoids aggregating the rule 
outputs to generate the final fuzzy set B’, reducing the 
computational burden compared to mode A-FATI de- 
fuzzification. 

This defuzzification mode constitutes a different ap- 
proach to the notion of the also operator, and it is 
directly related to the idea of interpolation and the ap- 
proach of Takagi-Sugeno—Kang (TSK) fuzzy systems, 
as can be seen by comparing (13.18) and (13.25). 
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13.2 Types of Fuzzy Rule-Based Systems 


As discussed earlier, the first proposal of an FRBS was 
that of Mamdani, and this kind of system has been 
considered as the basis for the general description of 
previous section. This section will focus on the differ- 
ent structures that can be considered when building an 
FRBS. 


13.2.1 Linguistic Fuzzy Rule-Based Systems 


This approach corresponds to the original Mam- 
dani FRBS [13.10,11], being the main tool to de- 
velop Linguistic models, and is the approach that has 
been mainly considered to this point in the chap- 
ter. 

A Mamdani FRBS provides a natural framework 
to include expert knowledge in the form of linguis- 
tic rules. This knowledge can be easily combined with 
rules which are automatically generated from data sets 
that describe the relation between system input and 
output. In addition, this knowledge is highly inter- 
pretable. The fuzzy rules are composed of input and 
output variables, which take values from their term sets 
having a meaning (a semantics) associated with each 
linguistic term. Therefore, each rule is a description of 
a condition-action statement that exhibits a clear in- 
terpretation to a human — for this reason, these kinds 
of systems are usually called linguistic or descrip- 
tive Mamdani FRBSs. This property makes Mamdani 
FRBSs appropriate for applications in which the em- 
phasis lies on model interpretability, such as fuzzy 
control [13.20, 22,23] and linguistic modeling [13.4, 
21). 


13.2.2 Variants of Mamdani Fuzzy 
Rule-Based Systems 


Although Mamdani FRBSs possess several advantages, 
they also come with some drawbacks. One of the prob- 
lems, especially in linguistic modeling applications, 
is their limited accuracy in some complex problems, 
which is due to the structure of the linguistic rules. 
[13.24] and [13.25] analyzed these limitations conclud- 
ing that the structure of the fuzzy linguistic JF-THEN 
tule is subject to certain restrictions because of the use 
of linguistic variables: 


@ There is a lack of flexibility in the FRBS due 
to the rigid partitioning of the input and output 
spaces. 


@ When the input variables are mutually dependent, it 
becomes difficult to find a proper fuzzy partition of 
the input space. 

@ The homogeneous partition of the input and output 
space becomes inefficient and does not scale well 
as the dimensionality and complexity of the input— 
output mapping increases. 

@ The size of the KB increases rapidly with the num- 
ber of variables and linguistic terms in the system. 
This problem is known as the course of dimension- 
ality. In order to obtain an accurate FRBS, a fine 
level of granularity is needed, which requires addi- 
tional linguistic terms. This increase in granularity 
causes the number of rules to grow, which compli- 
cates the interpretability of the system by a human. 
Moreover, in the vast majority of cases, it is possi- 
ble to obtain an equivalent FRBS that achieves the 
same accuracy with a fewer number of rules whose 
fuzzy sets are not restricted to a fixed input space 
partition. 


Both variants of linguistic Mamdani FRBSs de- 
scribed in this section attempt to solve the said prob- 
lems by making the linguistic rule structure more 
flexible. 


DNF Mamdani Fuzzy Rule-Based Systems 
The first extension to Mamdani FRBSs aims at a differ- 
ent rule structure, the so-called disjunctive normal form 
(DNF) fuzzy rule, which has the following form [13.26, 
27): 


IF X, is A; and ... and X, is A, 
THEN Y is B, (13.19) 


where each input variable X; takes as its value a set 
of linguistic terms A;, whose members are joined by 
a disjunctive operator, while the output variable remains 
a usual linguistic variable with a single label associated. 
Thus, the complete syntax for the antecedent of the rule 
is 


Xx is Ay = {Aj or ... or Ay} and... 
and Xn is Ay = {An or... orAni,}. (13.20) 
An example of this kind of rule is shown as follows. Let 


us suppose we have three input variables, X1, X2, and 
X3, and one output variable, Y, such that the linguistic 
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term sets D; (i= 1,2,3) and F, associated with each 
variable, are 


Dı = {A11, A12, A13} 
Dy = {A21 , A22, A23, A24, Ars} 


D3 = {A31,A32} F = {By, Bo, B3} . (13.21) 
In this case, an example of DNF rule will be 
IF X; is {Ai or Aj2} and X> is {A23 or Arq} 
and X3 is {A31 or A32} THEN Y is B2 . (13.22) 


This expression contains an additional connective 
different than the and considered in all previous rules. 
The or connective is computed through a t-conorm, the 
maximum being the most commonly used. 

The main advantage of this rule structure is its 
ability to integrate in a single expression (a single 
DNF rule) the information corresponding to several 
elemental rules (the rules commonly used in Mam- 
dani FRBSs). In this example, (13.22) corresponds to 
8 (2 x 2 x 2) rules of the equivalent system expressed as 
(13.1). This property produces a certain level of com- 
pression of the rule base, being quite helpful when the 
number of input variables increases, alleviating the ef- 
fect of the course of dimensionality. 


Approximate Mamdani-Type 

Fuzzy Rule-Based Systems 
While the previous DNF fuzzy rule structure does not 
involve an important loss in the linguistic Mamdani 


a) Descriptive Knowledge base 


NB NM NS ZR PS PM PB 


X 


XI 


R1: If X is NB then Y is NB 
R2: If X is NM then Y is NM 
R3: If X is NS then Y is NS 
R4: If X is ZR then Y is ZR 


NB NM NS ZR PS PM PB 


Y 


Xr Yl Nar 


R5: If X is PS then Y is PS 
R6: If X is PM then Y is PM 
R7: If X is PB then Y is PB 


b) Approximate fuzzy rule base 


RI: IfXis /X thenYis Mœ 
R2: If Xis Z^ thenYis A 
R3: If Xis 7_ thenY is ÆA 
R4: If X is ZX then Y is 7X 


Fig. 13.3a,b Comparison between a descriptive KB and an approx- 
imate fuzzy rule base 


FRBS interpretability, the point of departure for the sec- 
ond extension is to obtain an FRBS which achieves 
a better accuracy at the cost of reduced interpretability. 
These systems are called approximate Mamdani-type 
FRBSs [13.1, 25, 28-30], in opposition to the previous 
descriptive or linguistic Mamdani FRBSs. 

The structure of an approximate FRBS is similar to 
that of a descriptive one shown in Fig. 13.1. The dif- 
ference is that in this case, the rules do not refer in 
their definition to predefined fuzzy partitions of the lin- 
guistic variables. In an approximate FRBS, each rule 
defines its own fuzzy sets instead of using a linguistic 
label pointing to a particular fuzzy set of the partition 
of the underlying linguistic variable. Thus, an approxi- 
mate fuzzy rule has the following form: 


IF X; is Ay and ... and X,, is A, THEN Y is B. 


(13.23) 


The major difference with respect to the rule struc- 
ture considered in linguistic Mamdani FRBSs is the fact 
that the input variables X; and the output one Y are fuzzy 
variables instead of linguistic variables and, thus, A; and 
B are not linguistic terms (L7;) as they were in (13.1), 
but independently defined fuzzy sets that elude an in- 
tuitive linguistic interpretation. In other words, rules of 
approximate nature are semantic free, whereas descrip- 
tive rules operate in the context formulated by means of 
the linguistic terms semantics. 

Therefore, approximate FRBSs do not relay on 
fuzzy partitions defining a semantic context in the form 
of linguistic terms. The fuzzy partitions are somehow 
integrated into the fuzzy rule base in which each rule 
subsumes the definition of its underlying input and out- 
put fuzzy sets, as shown in Fig. 13.3(b). 

Approximate FRBSs demonstrate some specific ad- 
vantages over linguistic FRBSs making them particu- 
larly useful for certain types of applications [13.25]: 


@ The major advantage of the approximate approach 
is that each rule employs its own distinct fuzzy sets 
resulting in additional degrees of freedom and an in- 
crease in expressiveness. It means that the tuning of 
a certain fuzzy set in a rule will have no effect on 
other rules, while changing a fuzzy set of a fuzzy 
partition in a descriptive model affects all rules con- 
sidering the corresponding linguistic label. 

e@ Another important advantage is that the number of 
rules can be adapted to the complexity of the prob- 
lem. Simple input—output relationships are modeled 
with a few rules, but still more rules can be added as 
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the complexity of the problem increases. Therefore, 
approximate FRBSs constitute a potential remedy 
to the course of dimensionality that emerges when 
scaling to multidimensional systems. 


These properties enable approximate FRBSs to 
achieve a better accuracy than linguistic FRBS in com- 
plex problem domains. However, despite their benefits, 
they also come with some drawbacks: 


@ Their main drawback compared to the descriptive 
FRBS is the degradation in terms of interpretabil- 
ity of the RB as the fuzzy variables no longer 
share a unique linguistic interpretation. Still, un- 
like other kinds of approximate models such as 
neural networks that store knowledge implicitly, 
the knowledge in an approximate FRBS remains 
explicit as the system behavior is described by lo- 
cal rules. Therefore, approximate FRBSs can be 
considered as a compromise between the apparent 
interpretability of descriptive FRBSs and the type of 
black-box behavior, typical for nondescriptive, im- 
plicit models. 

@ The capability to approximate a set of training data 
accurately can lead to over-fitting and therefore to 
a poor generalization capability to cope with previ- 
ously unseen input data. 


According to their properties, fuzzy model- 
ing [13.1] constitutes the major application of approx- 
imate FRBSs, as model accuracy is more relevant than 
description ability. Approximate FRBSs are usually not 
the first choice for linguistic modeling and fuzzy control 
problems. Hence, descriptive and approximate FRBSs 
are considered as complementary rather than competi- 
tive approaches. Depending on the problem domain and 
requirements on the obtained model, one should use one 
or the other approach. Approximate FRBSs are recom- 
mendable in case one wants to trade interpretability for 
improved accuracy. 


13.2.3 Takagi-Sugeno-Kang 
Fuzzy Rule-Based Systems 


Instead of working with linguistic rules of the kind in- 
troduced in the previous section, Sugeno et al. [13.31, 
32] proposed a new model based on rules whose an- 
tecedent is composed of linguistic variables and the 
consequent is represented by a function of the input 
variables. The most common form of this kind of rules 
is the one in which the consequent expression consti- 


tutes a linear combination of the variables involved in 
the antecedent 


IF X, is Ay and ... and X, is An 


THEN Y = po+pi-X1 +++ +Pn- Xn, (13.24) 
where X; are the input variables, Y is the output variable, 
and p = (po, P1,- - - , Pn) is a vector of real parameters. 
Regarding A;, they are either a direct specification of 
a fuzzy set (thus X; being fuzzy variables) or a linguis- 
tic label that points to a particular member of a fuzzy 
partition of a linguistic variable. These rules, and conse- 
quently the systems using them, are usually called TSK 
fuzzy rules, in reference to the names of their first pro- 
ponents. 

The output of a TSK FRBS, using a KB composed 
of m rules, is obtained as a weighted sum of the indi- 
vidual outputs provided by each rule, Y;, i= 1,...,m, 
as follows: 


Ni hi . Y; 
Dihi | 


in which h; = T (Ai (x1), .. ., Ain(Xn)) is the matching 
degree between the antecedent part of the ith rule 
and the current inputs to the system, xo = (x1, . . . , Xn). 
T stands for a conjunctive operator modeled by a t- 
norm. Therefore, to design the inference engine of TSK 
FRBSs, the designer only selects this conjunctive op- 
erator T, with the most common choices being the 
minimum and the product. As a consequence, TSK sys- 
tems do not need defuzzification, being their outputs 
real numbers. 

This type of FRBS divides the input space in sev- 
eral fuzzy subspaces and defines a linear input—output 
relationship in each one of these subspaces [13.31]. 
In the inference process, these partial relationships are 
combined in the said way for obtaining the global 
input-output relationship, taking into account the dom- 
inance of the partial relationships in their respective 
areas of application and the conflicts emerging in the 
overlapping zones. As a result, the overall system per- 
forms as a sort of interpolation of the local models 
represented by each individual rule. 

TSK FRBSs have been successfully applied to 
a large variety of practical problems. The main ad- 
vantage of these systems is that they present a set of 
compact system equations that allows the parameters p; 
to be estimated by means of classical regression meth- 
ods, which facilitates the design process. However, the 
main drawback associated with TSK FRBSs is the form 


(13.25) 
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of the rule consequents, which does not provide a natu- 
ral framework for representing expert knowledge that is 
afflicted with uncertainty. Still, it becomes possible to 
integrate expert knowledge in these FRBSs by slightly 
modifying the rule consequent: for each linguistic rule 
with consequent Y is B, provided by an expert, its con- 
sequent is substituted by Y = po, with po standing for 
the modal point of the fuzzy set associated with the la- 
bel B. These kinds of rules are usually called simplified 
TSK rules or zero-order TSK rules. 

However, TSK FRBSs are more difficult to interpret 
than Mamdani FRBSs due to two different reasons: 


© The structure of the rule consequents is difficult to 
be understood by human experts, except for zero- 
order TSK. 

© Their overall output simultaneously depends on the 
activation of the rule antecedents and on the evalu- 
ation of the function defining rule consequent, that 
depends itself on the crisp inputs as well, rather than 
being constant. 


TSK FRBSs are used in fuzzy modeling [13.4, 31] 
as well as control problems [13.31, 33]. 

As with Mamdani FRBSs, it is also possible to built 
descriptive as well as approximate TSK systems. 


13.2.4 Singleton Fuzzy Rule-Based Systems 


The singleton FRBS, where the rule consequent takes 
a single real-valued number, may be considered as 
a particular case of the linguistic FRBS (the consequent 
is a fuzzy set where the membership function is one 
for a specific value and zero for the remaining ones) or 
of the TSK-type FRBS (the previously described zero- 


order TSK systems). 
Its rule structure is the following 
IF X; is A; and ... and X, is An 
THEN Y is yọ. (13.26) 


Since the single consequent seems to be more easily 
interpretable than a polynomial function, the singleton 
FRBS may be used to develop linguistic fuzzy mod- 
els. Nevertheless, compared with the linguistic FRBS, 
the fact of having a different consequent value for each 
rule (no global semantic is used for the output variable) 
worsens the interpretability. 


13.2.5 Fuzzy Rule-Based Classifiers 


Previous sections have implicitly considered FRBSs 
working with inputs and, what is more important, out- 


puts which are real variables. These kinds of fuzzy 
systems show an interpolative behavior where the over- 
all output is a combination of the individual outputs of 
the fired rules. This interpolative behavior is explicit in 
TSK models but it is also present in Mamdani systems. 
This situation gives FRBSs a sort of smooth output, 
generating soft transitions between rules, and being one 
of the significant properties of FRBSs. 

A completely different situation is that of having 
a problem where the output takes values from a finite 
list of possible values representing categories or classes. 
Under those circumstances, the interpolative approach 
of previously defined aggregation and defuzzification 
methods, makes no sense. As a consequence, some ad- 
ditional comments will be added to highlight the main 
characteristics of fuzzy rule-based classifiers (FRBCs), 
and the differences with other FRBSs. 

A fuzzy rule-based classifier is an automatic clas- 
sification system that uses fuzzy rules as knowledge 
representation tool. Therefore, the fuzzy classification 
rule structure is as follows 


IF X; is A, and... 
THEN Y isC , 


and X, is Ay 
(13.27) 


with Y being a categorical variable, so C being a class 
label. The processing structure is similar to that previ- 
ously described in what concerns to the evaluation of 
matching degree between each rule’s antecedent and 
current input, i.e., for each rule R; we obtain h; = 
T (Aj (x1), .--,Ain(%n)). Once obtained h;, the winner 
rule criteria could be applied so that the overall output 
is assigned with the consequent of the rule achieving 
the highest matching degree (highest value of h;). More 
elaborated evaluations as voting are also possible. 

Other alternative representations that include a cer- 
tainty degree or weight for each rule have also been 
considered [13.34]. In this case, the previously de- 
scribed rule will also include a rule weight w; that 
weights the matching degree during the inference pro- 
cess. The effect will be that the winning rule will be 
that achieving the highest value of h; - w;, or in the case 
of voting schemes, the influence of the vote of the rule 
will be proportional to this value. 


13.2.6 Type-2 Fuzzy Rule-Based Systems 


The idea of extending fuzzy sets by allowing member- 
ship functions to include some kind of uncertainty was 
already mentioned by Zadeh in early papers [13.6-8]. 
The idea, that was not really exploited for a long period, 
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has achieved now a significant presence in the literature 
with the proposal of Type-2 fuzzy systems and Interval 
type-2 fuzzy systems [13.35]. The main concept is that 
the membership degree is not a value but a fuzzy set or 
an interval, respectively. The effect is obtaining addi- 
tional degrees of freedom being available in the design 
process, but increasing the complexity of the process- 
ing structure that requires now a type-reduction step 
added to the overall process described in previous sec- 
tion. As the complexity of the type reduction process 
is much lower for Interval type-2 fuzzy systems than 
in the general case of type-2 fuzzy systems, interval ap- 
proaches are the most widely considered now in the area 
of Type-2 fuzzy sets. 


13.2.7 Fuzzy Systems with Implicative Rules 


Rule-based systems mentioned to this point consider 
tules that, having the form if X is A then Y is B, model 
the inference through a t-norm, usually minimum or 
product (Sect. 13.1.2). With this interpretation, rules are 


described as conjunctive rules, representing joint sets 
of possible input and output values. As mentioned in 
the chapter devoted to fuzzy control, these rules should 
be seen not as logical implications but rather as input— 
output associations. 

That kind of rule is the one commonly used in 
real applications to the date. However, different au- 
thors have pointed out that the same rule will have 
a completely different meaning when modeled in terms 
of material implications (the approach for Boolean 
if-then statements in propositional logic) [13.18]. As 
a result, in addition to the common interpretation 
of fuzzy rules that is widely considered in the lit- 
erature, some authors are exploring the modeling of 
fuzzy rules (with exactly the same structure pre- 
viously mentioned) by means of material implica- 
tions [13.36]. Even being in a quite preliminary stage 
of development, it is interesting to mention this ideas 
since it constitutes a completely different interpreta- 
tion of FRBSs, offering so new possibilities to the 
field. 


13.3 Hierarchical Fuzzy Rule-Based Systems 


The knowledge structure of FRBSs offers different 
options to introduce hierarchical structures. Rules, par- 
titions, or variables can be distributed at different levels 
according to their specificity, granularity, relevance, etc. 
This section will introduce different approaches to hier- 
archical FRBSs. 

It would be possible to consider hierarchical fuzzy 
systems as a different type of FRBS, so including it in 
previous section, or as a design option to build simpler 
FRBSs, being then included as part of the next section. 
Including it in previous section could be a little bit con- 
fusing since it is possible to combine the hierarchical 
approach with several of the structural models defined 
there, it seems better to consider it independently devot- 
ing a section to analyze them. 

The definition of hierarchical fuzzy systems as 
a method to solve problems with a higher level of 
complexity than those usually focused on with FRBSs, 
has produced some good results. In most of the cases, 
the underlying idea is to cope with the complexity of 
a problem by applying some kind of decomposition 
that generates a hierarchy of lower complexity sys- 
tems [13.37]. 

Several methods to establish hierarchies in fuzzy 
controllers have been proposed. These methods 


could be grouped according to the way they struc- 
ture the inference process, and the knowledge ap- 
plied. 

A first approach defines the hierarchy as a prioriti- 
zation of rules in such a way that rules with a different 
level of specificity receive a different priority, having 
higher priority those rules being more specific [13.38, 
39]. With this kind of hierarchy, a generic rule is ap- 
plied only when no suitable specific rule is available. In 
this case, the hierarchy is the effect of a particular im- 
plication mechanism applying the rules by taking into 
account its priority. This methodology defines the hi- 
erarchy (the decomposition) at the level of rules. The 
rules are grouped into prioritized levels to design a hi- 
erarchical fuzzy controller. 

Another option is that of considering a hierarchy of 
fuzzy partitions with different granularity [13.40]. From 
that point, an FRBS is structured in layers, where each 
layer contains fuzzy partitions with a different granu- 
larity, as well as the rules using those fuzzy partitions. 
Usually, every partition in a certain layer has the same 
number of fuzzy terms. In this case, rules at different 
layers have different granularity, being somehow re- 
lated to the idea of specificity of the previous paragraph. 
It is even possible to generate a multilevel grid-like 
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partition where only for some specific regions of the 
input space (usually those regions showing poor perfor- 
mance) a higher granularity is considered [13.41], with 
a similar approach to that already considered in some 
neuro-fuzzy systems [13.42]. 

A completely different point of view is that of in- 
troducing the decomposition at the level of variables. In 
this case, the input space is decomposed into subspaces 
of lower dimensionality, and each input variable is only 
considered at a certain level of the hierarchy. The re- 
sult is a cascade structure of FRBSs where, in addition 
to a subset of the input variables, the output of each 
level is considered as one of the inputs to the follow- 
ing level [13.43]. As a result, the system is decomposed 
into a finite number of reduced-order subsystems, elim- 
inating the need for a large-sized inference engine. This 
decomposition is usually stated as a way to maintain 
under control the problems generated by the so-called 
course of dimensionality, the exponential growth of the 
number of rules related to the number of variables of 
the system. 

The number of rules of an FRBS with n input vari- 
ables and / linguistic terms per variable, will be /”. In 
this approach to hierarchical FRBSs, the variables (and 
rules) are divided into different levels in such a way that 
those considered the most influential variables are cho- 
sen as input variables at the first level, the next most 
important variables are chosen as input variables at the 


second level, and so on. The output variable of each 
level is introduced as input variable at the following 
level. 

With that structure, the rules at first level of the 
FRBS have a similar structure to any Mamdani FRBS, 
i.e., that describe by (13.1), but at k-th level (k > 1), 
rules include the output of the previous level as input 


IF Xn,41 is LTy,41 and ... and Xn, +n, is LTN +n 
and O;—1 is LTox—; THEN O; is LTok, (13.28) 


where the value N; determines the input variables con- 
sidered in previous levels 


k—1 
N; = ) Nt, 
t=1 


with n, being the number of system variables applied at 
level ¢. Variable O% represent the output of the k level 
of the hierarchy. All outputs are intermediate variables 
except for the output of the last level that will be Y (the 
overall output of the system). 

With this structure it is shown [13.43] that the num- 
ber of rules in a complete rule base could be reduced 
to a linear function of the number of variables, while in 
a conventional FRBS it was an exponential function of 
the number of variables. 


(13.29) 


13.4 Fuzzy Rule-Based Systems Design 


Once defined the components and functioning of 
an FRBS, it is time to consider its design, i.e., how to 
built an FRBS to solve a certain problem while showing 
some specific properties. The present section will focus 
on this question. 

An FRBS can be characterized according to its 
structure and its behavior. When referring to its struc- 
ture, we can consider questions as the dimension of 
the system (number of variables, fuzzy sets, rules, etc.) 
as well as other aspects related to properties of its 
components (distinguishability of the fuzzy sets, re- 
dundancy of the fuzzy rules, etc.). On the other hand, 
the characterization related to the behavior mostly an- 
alyzes properties considering the input-output relation 
defined by the FRBS. In this area, we can include ques- 
tions as stability or accuracy. Finally, there is a third 
question that simultaneously involves structure and be- 
havior. This question is interpretability, a central aspect 


in fuzzy systems design that is considered in an inde- 
pendent chapter. 


13.4.1 FRBS Properties 


All the structural properties to be mentioned are related 
to properties of the KB, and basically cover charac- 
teristics related to the individual fuzzy sets, the fuzzy 
partitions related to each input and output variable, the 
fuzzy rules, and the rule set as a whole. 

The elemental components of the KB are fuzzy sets. 
At this level, we have several questions to be analyzed 
as normality, convexity, or differentiability of fuzzy 
sets; all of them being related to the properties of the 
membership function (ua (x)) defining the fuzzy set (A). 
In most applications the considered fuzzy sets adopt 
predefined shapes as triangular, trapezoidal, Gaussian, 
or bell; the fuzzy sets are then defined by only changing 
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some parameters of these parameterized functions. In 
summary, most fuzzy sets considered in FRBSs are nor- 
mal and convex sets belonging to one of two possible 
families: piecewise linear functions and differentiable 
functions. Piecewise linear functions are basically tri- 
angular and trapezoidal functions offering a reduced 
complexity from the processing point of view. On the 
other hand, differentiable functions are mainly Gaus- 
sian, bell, and sigmoidal functions being better adapted 
to some kind of differential learning approaches as 
those used in neuro-fuzzy systems, but adding complex- 
ity from the processing point of view. 

Once individual fuzzy sets have been considered, 
the following level is that of fuzzy partitions related 
to each variable. The main characteristics of a fuzzy 
partition are cardinality, coverage, and distinguishabil- 
ity. Cardinality corresponds to the number of fuzzy sets 
that compose the fuzzy partitions. In most cases, this 
number ranges from 3 to 9, with 9 being an upper limit 
commonly accepted after the ideas of Miller [13.44]. 
The larger the number of fuzzy sets in the partition, the 
most difficult the design and interpretation of the FRBS. 
Coverage corresponds to the minimum membership de- 
gree with which any value of the variable (x), through 
its universe of discourse (U), will be assigned to at least 
a fuzzy set (A;) in the partition. Coverage is then defined 
as 


min max [4,(x) , (13.30) 
i=1...n 


xEU 
being n the cardinality of the partition. As an example, 
the fuzzy partition in Fig. 13.2 has a coverage of 0.5. 
Finally, distinguishability of fuzzy sets is related to the 
level of overlapping of their membership functions, be- 
ing analyzed with different expressions. 

On the basis of the fuzzy sets and fuzzy partitions, 
the fuzzy rules are built. The first structural question 
regarding fuzzy rules is the type of fuzzy rule to be con- 
sidered: Mamdani, TSK, descriptive or approximate, 
DNF, etc. If we consider now the interaction between 
the different fuzzy rules of a fuzzy system, questions as 
knowledge consistency or redundancy appear, i. e., does 
a fuzzy system include pieces of knowledge (usually 
rules) providing contradictory (or redundant) informa- 
tion for a specific situation. Finally, when considering 
the rule base as a whole, completeness and complex- 
ity are to be considered. Completeness refers to the 
fact that any potential input value will fire at least one 
rule. 

Considering now behavioral properties, the most 
widely analyzed are stability and accuracy. It is also 


possible to take into account other properties as con- 
tinuity or robustness, but we will concentrate in those 
having the larger presence in the literature. Behavioral 
properties are related to the overall system, i.e., to the 
processing structure as well as to the KB. 

Stability is a key aspect of dynamical systems anal- 
ysis, and plays a central role in control theory. FRBSs 
are nonlinear dynamical systems, and after its early ap- 
plication to control problems, the absence of a formal 
stability analysis was seriously criticized. As a con- 
sequence, the stability question received significant 
attention from the very beginning, at present being 
a widely studied problem [13.45] for both Mamdani 
and TSK fuzzy systems, considering the use of different 
approaches as Lyapunov’s methods, Popov criterion or 
norm-based analysis among others. 

Another question with a continuous presence in the 
literature is that of accuracy and the somehow related 
concept of universal approximation. The idea of fuzzy 
systems as universal approximators states that, given 
any continuous real-valued function on a compact sub- 
set of R”, we can, at least in theory, find an FRBS that 
approximates this function to any degree. This prop- 
erty has been established for different types of fuzzy 
systems [13.46—-48]. On this basis, the idea of build- 
ing fuzzy models with an unbounded level of accuracy 
can be considered. In any case, it is important to notice 
that previous papers proof the existence of such a model 
but assuming at the same time an unbounded complex- 
ity, i. e., the number of fuzzy sets and rules involved in 
the fuzzy system will usually grow as the accuracy im- 
proves. That means that improving accuracy is possible 
but always with a cost related either to the complexity 
of the system or to the relaxation of some of its proper- 
ties (usually interpretability). 


13.4.2 Designing FRBSs 


Given a modeling, classification, or control problem to 
be solved, and assumed it will be focused on through 
an FRBS, there are several steps in the process of 
design. The first decision is the choice between the 
different types of systems mentioned in Sect. 13.2, par- 
ticularly Mamdani and TSK approaches. They offer 
different characteristics related to questions as their ac- 
curacy and interpretability, as well as different methods 
for the derivation of its KB. 

Once chosen a type of FRBS, its design im- 
plies the construction of its processing structure as 
well as the derivation of its KB. Even consider- 
ing that there are several options to modify the 
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processing structure of the system (Sect. 13.1.2), 
most designers consider a standard inference engine 
and concentrate on the knowledge extraction prob- 
lem. 

Going now to the knowledge extraction problem, 
some of its parts are common to any modeling pro- 
cess (being fuzzy or not). Questions as the selection of 
the input and output variables and the determination of 
the range of those variables are generic to any model- 
ing approach. The specific aspects related to the fuzzy 
environment are the definition of the fuzzy sets or the 
fuzzy partition related to each of those variables, and 
the derivation of a suitable set of fuzzy rules. These 
two components can be jointly derived in a single pro- 
cess, or sequentially performed by considering first the 
design of the fuzzy partition associated with each vari- 
able and then the fuzzy rules. The design process can 
be based on two main sources of information: expert 
knowledge and experimental data. 

If we first consider the definition of fuzzy sets and 
fuzzy partitions, quite different approaches [13.49] can 
be applied. Even the idea of simply generating a uni- 


13.5 Conclusions 


Fuzzy rule-based systems constitute a tool for repre- 
senting knowledge and reasoning on it. Jointly with 
fuzzy clustering techniques, FRBSs are probably the 
developments of fuzzy sets theory leading to the larger 
number of applications. These systems, being a kind of 
rule-based system, can be analyzed as knowledge-based 
systems showing a structure with two main compo- 
nents: knowledge and processing. The processing struc- 
ture relays on many concepts presented in previous 
chapters as fuzzy implications, connectives, relations 
and so on. In addition, some new concepts as fuzzifica- 
tion and defuzzification are required when constructing 
a fuzzy rule-based system. But the central concept 
of fuzzy rule-based systems are fuzzy rules. Different 
types of fuzzy rules have been considered, particularly 
those having a fuzzy (or not) consequent, producing dif- 
ferent types of FRBS. In addition, new formulations are 
being considered, e.g., implicative rules. Eventually, the 
representation capabilities of fuzzy sets have been con- 
sidered as too limited to represent some specific kinds 
of knowledge or information, and some extended types 
of fuzzy sets have been defined. Type-2 fuzzy sets are 
an example of extension of fuzzy sets. 


formly distributed strong fuzzy partition of a certain 
cardinality is widely considered. 

Going now to rules, Mamdani FRBSs are partic- 
ularly adapted to expert knowledge extraction, and 
knowledge elicitation for that kind of system has been 
widely considered in the literature. In any case, there 
is not a standard methodology for fuzzy knowledge 
extraction from experts and at present most practical 
works consider either a direct data-driven approach, or 
the integration of expert and data-driven knowledge ex- 
traction [13.50]. 

When considering data-driven knowledge extrac- 
tion, there is an almost endless list of approaches. 
Some options are the use of ad-hoc methods based 
on data covering measures (as [13.46]), the generation 
of fuzzy decisions trees [13.51], the use of cluster- 
ing techniques [13.52], and the use of hybrid systems 
where genetic fuzzy systems [13.53] and neuro fuzzy 
systems [13.54] represent the most widely considered 
approaches to fuzzy systems design. Some of those 
techniques produce both the partitions (or fuzzy sets) 
and the rules in a single process. 


Having been said that FRBSs are knowledge-based 
systems, and as a consequence, its design involves, 
apart from aspects related to the processing structure, 
the elicitation of a suitable KB properly describing the 
way to solve the problem under consideration. Even 
considering the large number of problems solved us- 
ing FRBSs, there is not a clear design methodology 
defining a well-established design protocol. In addi- 
tion, two completely different sources of knowledge, 
requiring different extraction approaches, have been 
considered when building FRBSs: expert knowledge 
and data. Many expert and data-driven knowledge ex- 
traction techniques and methods are described in the 
literature and can be considered. Connected to this 
question, as part of the process to provide automatic 
knowledge extraction capabilities to FRBSs, many hy- 
brid approaches have been proposed, genetic fuzzy 
systems and neuro-fuzzy systems being the most widely 
considered. 

In summary, FRBSs are a powerful tool to solve 
real world problems, but many theoretical aspects and 
design questions remain open for further investiga- 
tion. 
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14. interpretability of Fuzzy Systems: 
Current Research Trends and Prospects 


Jose M. Alonso, Ciro Castiello, Corrado Mencar 


Fuzzy systems are universally acknowledged as 
valuable tools to model complex phenomena 
while preserving a readable form of knowledge 
representation. The resort to natural language for 
expressing the terms involved in fuzzy rules, in 
fact, is a key factor to conjugate mathematical 
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gated, it appears that (a) the simple adoption of 
fuzzy sets in modeling is not enough to ensure 
interpretability; (b) fuzzy knowledge representa- 
tion must confront the problem of preserving the 


overall system accuracy, thus yielding a trade- the Interpretability-Accuracy Pi 
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the current literature panorama of computational 
intelligence. This chapter gives an overview of the 
topics related to fuzzy system interpretability, fac- 
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considering? How to ensure interpretability, and 
how to assess (quantify) it? Finally, how to design 
interpretable fuzzy models? 

The objective of this chapter is to provide some 
answers for the questions posed above. Section 14.1 
deals with the challenging task of setting a proper 
definition of interpretability. Section 14.2 intro- 
duces the main constraints and criteria that can 
be adopted to ensure interpretability when de- 
signing interpretable fuzzy systems. Section 14.3 
gives a brief overview of the soundest indexes for 


assessing interpretability. Section 14.4 presents 
the most popular approaches for designing fuzzy 
systems endowed with a good interpretability- 
accuracy trade-off. Section 14.5 enumerates some 
application fields where interpretability is a main 
concern. Section 14.6 sketches a number of chal- 
lenging tasks which should be addressed in the 
near future. Finally, some conclusions are drawn 
in Sect. 14.7. 
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The key factor for the success of fuzzy logic stands 
in the ability of modeling and processing perceptions 
instead of measurements [14.1]. In most cases, such 
perceptions are expressed in natural language. Thus, 
fuzzy logic acts as a mathematical underpinning for 
modeling and processing perceptions described in nat- 
ural language. 

Historically, it has been acknowledged that fuzzy 
systems are endowed with the capability to conjugate 
a complex behavior and a simple description in terms 
of linguistic rules. In many cases, the compilation of 
fuzzy systems has been accomplished manually, with 
human knowledge purposely injected in fuzzy rules in 
order to model the desired behavior (the rules could 
be eventually tuned to improve the system accuracy). 
In addition, the great success of fuzzy logic led to the 
development of many algorithms aimed at acquiring 
knowledge from data (expressing it in terms of fuzzy 
rules). This made the automatic design of fuzzy sys- 
tems (through data-driven design techniques) feasible. 
Moreover, theoretical studies proved the universal ap- 
proximation capabilities of such systems [14.2]. 


14.1 The Quest for Interpretability 


Answering the question What is interpretability? is 
not straightforward. Defining interpretability is a chal- 
lenging task since it deals with the analysis of the 
relation occurring between two heterogeneous entities: 
a model of the system to be designed (usually formal- 
ized through a mathematical definition) and a human 
user (meant not as a passive beneficiary of a system’s 
outcome, but as an active reader and interpreter of the 
model’s working engine). In this sense, interpretability 
is a quality which is inherent in the model and yet it 
refers to an act performed by the user who is willing to 
grasp and explain the meaning of the model. 

To pave the way for the definition of such a relation, 
a common ground must be settled. This could be rep- 
resented by a number of fundamental properties to be 
incorporated into a model, so that its formal description 
becomes compatible with the user’s knowledge repre- 
sentation. In this way, the human user may interface the 
mathematical model resting on concepts that appear to 
be suitable to deal with it. The quest for interpretability, 
therefore, calls for the identification of several features. 
Among them, resorting to an appropriate framework for 
knowledge representation is a crucial element and the 
adoption of a fuzzy inference engine based on fuzzy 


The adoption of data-driven design techniques is 
a common practice nowadays. Nevertheless, while 
fuzzy sets can be generally used to model perceptions, 
some of them do not lead to a straight interpretation 
in natural language. In consequence, the adoption of 
accuracy-driven algorithms for acquiring knowledge 
from data often results in unintelligible models. In 
those cases, the fundamental plus of fuzzy logic is 
lost and the derived models are comparable to other 
measurement-based models (like neural networks) in 
terms of knowledge interpretability. 

In a nutshell, interpretability is not granted by 
the adoption of fuzzy logic which represents a nec- 
essary yet not a sufficient requirement for modeling 
and processing perceptions. However, interpretability 
is a quality that is not easy to define and quantify. 
Several open and challenging questions arise while con- 
sidering interpretability in fuzzy modeling: What is 
interpretability? Why interpretability is worth consid- 
ering? How to ensure interpretability? How to assess 
(quantify) interpretability? How to design interpretable 
fuzzy models? And so on. 


rules is straightforward to approach the linguistic-based 
formulation of concepts which is typical of the human 
abstract thought. 

A distinguishing feature of a fuzzy rule-based 
model is the double level of knowledge representation. 
The lower level of representation is constituted by the 
formal definition of the fuzzy sets in terms of their 
membership functions, as well as the aggregation func- 
tions used for inference. This level of representation 
defines the semantics of a fuzzy rule-based model as 
it determines the behavior of the model, i.e. the in- 
put/output mapping for which it is responsible. 

On the higher level of representation, knowledge is 
represented in the form of rules. They define a formal 
structure where linguistic variables are involved and re- 
ciprocally connected by some formal operators, such as 
AND, THEN, and so on. Linguistic variables correspond 
to the inputs and outputs of the model. The (sym- 
bolic) values they assume are related to linguistic terms 
which, in turn, are mapped to the fuzzy sets defined 
in the lower level of representation. The formal oper- 
ators are likewise mapped to the aggregation functions. 
This mapping provides the interpretative transition that 
is quite common in the mathematical context: a formal 


Interpretability of Fuzzy Systems | 14.1 The Quest for Interpretability 


structure is assigned semantics by mapping symbols 
(linguistic terms and operators) to objects (fuzzy sets 
and aggregation functions). 

In principle, the mapping of linguistic terms to 
fuzzy sets is arbitrary. It just suffices that identical lin- 
guistic terms are mapped to identical fuzzy sets. Of 
course, this is not completely true for formal opera- 
tors (e.g., t-norms, implications, etc.). The correspond- 
ing aggregation functions should satisfy a number of 
constraints; however some flexibility is possible. Nev- 
ertheless, the mere use of symbols in the high level of 
knowledge representation implies the establishment of 
a number of semiotic relations that are fundamental for 
the quest of interpretability of a fuzzy model. In partic- 
ular, linguistic terms — as usually picked from natural 
language — must be fully meaningful for the expected 
reader since they denote concepts, i. e. mental represen- 
tations that allow people to draw appropriate inferences 
about the entities they encounter. 

Concepts and fuzzy sets, therefore, are both denoted 
by linguistic terms. Additionally, concepts and fuzzy 
sets play a similar role: the former (being part of the 
human knowledge) contribute to determine the behav- 
ior of a person; the latter (being the basic elements of 
a fuzzy rule base) contribute to determine the behavior 
of a system to be modeled. As a consequence, concepts 
and fuzzy sets are implicitly connected by means of 
common linguistic terms they are related to, which re- 
fer to object classes in the real world. The key essence 
of interpretability is therefore the property of cointen- 
sion [14.3] between fuzzy sets and concepts, consisting 
in the possibility of referring to similar classes of ob- 
jects: such a possibility is assured by the use of common 
linguistic terms. 

Semantic cointension is a key issue when dealing 
with interpretability of fuzzy systems. It has been in- 
troduced and centered on the role of fuzzy sets, but 
it can be easily extended to refer to some more com- 
plex structures, such as fuzzy rules or the whole fuzzy 
models. In this regard, a crisp assertion about the im- 
portance of cointension pronounced at the level of the 
whole model is given by the Michalski’s Comprehensi- 
bility Postulate [14.4]: 


The results of computer induction should be sym- 
bolic descriptions of given entities, semantically 
and structurally similar to those a human expert 
might produce observing the same entities. Com- 
ponents of these descriptions should be compre- 
hensible as single chunks of information, directly 
interpretable in natural language, and should relate 


quantitative and qualitative concepts in an inte- 
grated fashion. 


It should be observed that the above postulate 
has been formulated in the general area of machine 
learning. Nevertheless, the assertion made by Michal- 
ski has important consequences in the specific area 
of fuzzy modeling (FM) too. According to the Com- 
prehensibility Postulate, results of computer induction 
should be described symbolically. Symbols are nec- 
essary to communicate information and knowledge; 
hence, pure numerical methods, such as neural net- 
works, are not suited for meeting interpretability unless 
an interpretability-oriented postprocessing of the result- 
ing knowledge is performed. 

The key point of the Michalski’s postulate is the 
human centrality of the results of a computer induc- 
tion process. The importance of the human compo- 
nent implicitly suggests a novel aspect to be taken 
into account in the quest for interpretability. Actu- 
ally, the semantic cointension is related to one facet 
of the interpretability process, which can be referred 
to as comprehensibility of the content and behavior of 
a fuzzy model. In other words, cointension concerns 
the semantic interpretation performed by a user de- 
termined to comprehend such a model. On the other 
hand, when we turn to consider the cognitive capa- 
bilities of human brains and their intrinsic limitations, 
then a different facet of the interpretability process 
can be defined in terms of readability of the bulk 
of information conveyed by a fuzzy model. In that 
case, simplicity is required to perform the interpretation 
process because of the limited ability to store informa- 
tion in the human brain’s short-term memory [14.5]. 
Therefore, structural measures concerning the com- 
plexity of a rule base affect the cognitive efforts of 
a user determined to read and interpret a fuzzy model. 


Comprehensibility and readability represent two 
facets of a common issue and both of them are to 
be considered while assessing the interpretability pro- 
cess. In particular, this distinction should be acknowl- 
edged when criteria are specifically designed to provide 
a quantitative definition of interpretability. 


14.1.1 Why Is Interpretability So Important? 


A great number of inductive modeling techniques are 
currently available to acquire knowledge from data. 
Many of these techniques provide predictive models 
that are very accurate and flexible enough to be applied 
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in a wide range of applications. Nevertheless, the re- 
sulting models are usually considered as black boxes, 
i.e. models whose behavior cannot be easily explained 
in terms of the model structure. On the other hand, the 
use of fuzzy rule-based models is a matter of design 
choice: whenever interpretability is a key factor, fuzzy 
rule-based models should be naturally preferred. It is 
worth noting that interpretability is a distinguishing fea- 
ture of fuzzy rule-based models. Several reasons justify 
a choice inclined toward interpretability. They include 
but are not limited to: 


© Integration: In an interpretable fuzzy rule-based 
model the acquired knowledge can be easily verified 
and related to the domain knowledge of a hu- 
man expert. In particular, it is easy to verify if 
the acquired knowledge expresses new and inter- 
esting relations about the data; also, the acquired 
knowledge can be refined and integrated with ex- 
pert knowledge. 

© Interaction: The use of natural language as a mean 
for knowledge communication enables the possibil- 
ity of interaction between the user and the model. 
Interactivity is meant to explore the acquired knowl- 
edge. In practice, it can be done at symbolical level 
(by adding new rules or modifying existing ones) 
and/or at numerical level (by modifying the fuzzy 
sets denoted by linguistic terms; or by adding new 
linguistic terms denoting new fuzzy sets). 

© Validation: The acquired knowledge can be eas- 
ily validated against common-sense knowledge and 
domain-specific knowledge. This capability enables 
the detection of semantic inconsistencies that may 
have different causes (misleading data involved in 
the inductive process, local minimum where the 
inductive process may have been trapped, data over- 
fitting, etc.). This kind of anomaly detection is 
important to drive the inductive process toward 
a qualitative improvement of the acquired knowl- 
edge. 

© Trust: The most important reason to adopt inter- 
pretable fuzzy models is their inherent ability to 
convince end users about the reliability of a model 
(especially those users not concerned with knowl- 
edge acquisition techniques). An interpretable fuzzy 
rule-based model is endowed with the capability of 
explaining its inference process so that users may 
be confident on how it produces its outcomes. This 
is particularly important in such domains as medi- 
cal diagnosis, where a human expert is the ultimate 
responsible for a decision. 


14.1.2 A Historical Review 


It has been long time since Zadeh’s seminal work on 
fuzzy sets [14.6] and nowadays there are lots of fruit- 
ful research lines related to fuzzy logic [14.7]. Hence, 
we can state that fuzzy sets and systems have become 
the subjects of a mature research field counting several 
works both theoretical and applied in their scope. Fig- 
ure 14.1 shows the distribution of publications per year 
regarding interpretability issues. Three main phases can 
be identified taking into account the historical evolution 
of FM. 


From 1965 to 1990 

During this initial period, interpretability emerged 
naturally as the main advantage of fuzzy systems. 
Researchers concentrated on building fuzzy models 
mainly working with expert knowledge and a few sim- 
ple linguistic variables [14.8—10] and linguistic rules 
usually referred to as Mamdani rules [14.11]. As a re- 
sult, those designed fuzzy models were characterized 
by their high interpretability. Moreover, interpretability 
is assumed as an intrinsic property of fuzzy systems. 
Therefore, there are only a few publications regard- 
ing interpretability issues. Note that the first proposal 
of a fuzzy rule-based system (FRBS) was presented 
by Mamdani who was able to augment Zadeh’s initial 
formulation allowing the application of fuzzy systems 
to a control problem. These kinds of fuzzy systems 
are also referred to as fuzzy logic controllers, as pro- 
posed by the author in his pioneering paper. In addition, 
Mamdani-type FRBSs soon became the main tool to de- 
velop linguistic models. Of course, many other rule for- 
mats were arising and gaining importance. In addition 
to Mamdani FRBSs, probably the most famous FRBSs 
are those proposed by Takagi and Sugeno [14.12], the 
popular TSK fuzzy systems, where the conclusion is 
a function of the input values. Due to their current popu- 
larity, in the following we will use the term fuzzy system 
to denote Mamdani-type FRBSs and their subsequent 
extensions. 


From 1990 to 2000 
In the second period the focus was set on accuracy. 
Researchers realized that expert knowledge was not 
enough to deal with complex systems. Thus, they ex- 
plored the use of fuzzy machine learning techniques 
to automatically extract knowledge from data [14.13, 
14]. Accordingly, those designed fuzzy models became 
composed of extremely complicated fuzzy rules with 
high accuracy but at the cost of disregarding inter- 
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Fig. 14.1 Publications per year related to interpretability issues 


pretability as a side effect. Obviously, automatically 
generated rules were rarely as readable as desired. 
Along this period some researchers started claiming 
that fuzzy models are not interpretable per se. Inter- 
pretability is a matter of careful design. Thus, inter- 
pretability issues must be deeply analyzed and seriously 
discussed. Although the amount of publications related 
to interpretability issues is still small in this period, 
please pay attention to the fact that publications begin 
to grow exponentially at the end of this second phase. 


From 2000 to 2013 
After the two previous periods, researchers realized that 
both expert-driven (from 1965 to 1990) and data-driven 
(from 1990 to 2000) design approaches have their own 
advantages and drawbacks, but they are somehow com- 
plementary. For instance, expert knowledge is general 
and easy to interpret but hard to formalize. On the con- 
trary, knowledge derived from data can be extracted 
automatically but it becomes quite specific and its inter- 
pretation is usually hard [14.15]. Moreover, researchers 
were aware of the need of taking into account simulta- 
neously interpretability and accuracy during the design 
of fuzzy models. As a result, during this third phase 
the main challenge was how to combine expert knowl- 
edge and knowledge extracted from data, with the aim 
of designing compact and robust systems with a good 
interpretability—accuracy trade-off. When considering 
both interpretability and accuracy in FM, two main 
strategies turn up naturally [14.16]: linguistic fuzzy 
modeling (LFM) and precise fuzzy modeling (PFM). On 
the one hand, in LFM, designers first focus on the inter- 
pretability of the model, and then they try to improve its 
accuracy [14.17]. On the other hand, in PFM, design- 


ers first build a fuzzy model maximizing its accuracy, 
and then they try to improve its interpretability [14.18]. 
As an alternative, since accuracy and interpretability 
represent conflicting goals by nature, multiobjective 
fuzzy modeling strategies (considering accuracy and 
interpretability as objectives) have become very popu- 
lar [14.19, 20]. 

At the same time, there has been a great effort 
for formalizing interpretability issues. As a result, the 
number of publications has grown. Researchers have 
actively looked for the right definition of interpretabil- 
ity. In addition, several interpretability constraints have 
been identified. Moreover, interpretability assessment 
has become a hot research topic. In fact, several in- 
terpretability indexes (able to guide the FM design 
process) have been defined. Nevertheless, a universal 
index widely admitted is still missing. Hence, further 
research on interpretability issues is demanded. 

Unfortunately, although the number of publications 
was growing exponentially until 2009, later it started 
decreasing. We would like to emphasize the impact of 
the two pioneer books [14.17, 18] edited in 2003. They 
contributed to make the fuzzy community aware of the 
need to take into account again interpretability as a main 
research concern. It is worth noting that the first formal 
definition of interpretability (in the fuzzy literature) was 
included in [14.18]. It was given by Bodenhofer and 
Bauer [14.21] who established an axiomatic treatment 
of interpretability at the level of linguistic variables. 

We encourage the fuzzy community to keep pay- 
ing attention to interpretability issues because there is 
still a lot of research to be done. Interpretability must 
be the central point on system modeling. In fact, some 
of the hottest and most recent research topics like pre- 
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High-level 
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cisiated natural language, computing with words, and 
human centric computing strongly rely on the inter- 
pretability of the designed models. The challenge is 
to better exploit fuzzy logic techniques for improving 


the human-centric character of many intelligent sys- 
tems. Therefore, interpretability deserves consideration 
as a main research concern and the number of publica- 
tions should grow again in the next years. 


14.2 Interpretability Constraints and Criteria 


Interpretability is a quality of fuzzy systems that is not 
immediate to quantify. Nevertheless, a quantitative def- 
inition is required both for assessing the interpretability 
of a fuzzy system and for designing new fuzzy systems. 
This requirement is especially stringent when fuzzy 
systems are automatically designed from data, through 
some knowledge extraction procedure. 

A common approach for defining interpretability 
is based on the adoption of a number of constraints 
and criteria that, taken as a whole, provide for a def- 
inition of interpretability. This approach is inherent 
to the subjective nature of interpretability, because 
the validity of some conditions/criteria is not univer- 
sally acknowledged and may depend on the application 
context. 

In the literature, a large number of interpretability 
constraints and criteria can be found. Some of them 
are widely accepted, while others are controversial. The 
nature of these constraints and criteria is also diverse. 
Some are neatly defined as a mathematical condition, 
others have a fuzzy character and their satisfaction is 
a matter of degree. This section is addressed to give 
a brief yet homogeneous outline of the best known 
interpretability constraints and criteria. The reader is re- 
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ferred to the specialized literature for deeper insights on 
this topic [14.22, 23]. 

Several ways are available to categorize inter- 
pretability constraints and criteria. It could be possible 
to refer to their specific nature (e.g., crisp vs. fuzzy), 
to the components of the fuzzy system where they are 
applied, or to the description level of the fuzzy system 
itself. Here, as depicted in Fig. 14.2, we choose a hi- 
erarchical organization that starts from the most basic 
components of a fuzzy system, namely the involved 
fuzzy sets, and goes on toward more complex levels, 
such as fuzzy partitions, fuzzy rules, up to considering 
the model as a whole. 


14.2.1 Constraints and Criteria for Fuzzy Sets 


Fuzzy sets are the basic elements of fuzzy systems and 
their role is to express elementary yet imprecise con- 
cepts that can be denoted by linguistic labels. Here 
we assume that fuzzy sets are defined on a universe 
of discourse represented by a closed interval of the 
real line (this is the case of most fuzzy systems, espe- 
cially those acquired from data). Thus, fuzzy sets are 
the building blocks to translate a numerical domain in 
a linguistically quantified domain that can be used to 
communicate knowledge. 

Generally speaking, single fuzzy sets are employed 
to express elementary concepts and, through the use of 
connectives, are combined to represent more complex 
concepts. However, not all fuzzy sets can be related to 
elementary concepts, since the membership function of 
a fuzzy set may be very awkward but still legitimate 
from a mathematical viewpoint. Actually, a subclass of 
fuzzy sets should be considered, so that its members 
can be easily associated with elementary concepts and 
tagged by the corresponding linguistic labels. Fuzzy 
sets of this subclass must verify a number of basic in- 
terpretability constraints, including: 


© Normality: At least one element of the universe 
of discourse is a prototype for the fuzzy set, i.e. 
it is characterized by a full membership degree. 


Interpretability of Fuzzy Systems 
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A normal fuzzy set represents a concept that fully 
qualifies at least one element of the universe of dis- 
course, i. e. the concept has at least one example that 
fulfills it. On the other hand, a subnormal fuzzy set 
is usually a consequence of a partial contradiction 
(it is easy to show that the degree of inclusion of 
a subnormal fuzzy set in the empty set is nonzero). 

© Continuity: The membership function is continu- 
ous on the universe of discourse. As a matter of 
fact, most concepts that can be naturally represented 
through fuzzy sets derive from a perceptual act, 
which comes from external stimuli that usually vary 
in continuity. Therefore, continuous fuzzy sets are 
better in accordance with the perceptive nature of 
the represented concepts. 

© Convexity: In a convex fuzzy set, given three el- 
ements linearly placed on the axis related to the 
universe of discourse, the degree of membership of 
the middle element is always greater than or equal 
to the minimum membership degree of the side ele- 
ments [14.24]. This constraint encodes the rule that 
if a property is satisfied by two elements, then it is 
also satisfied by an element settled between them. 


14.2.2 Constraints and Criteria 
for Fuzzy Partitions 


The key success factor of fuzzy logic in modeling 
is the ability of expressing knowledge linguistically. 
Technically, this is realized by linguistic variables, i.e. 
variables that assume symbolic values called linguis- 
tic terms. The peculiarity of linguistic variables with 
respect to classical symbolic approaches is the interpre- 
tation of linguistic terms as fuzzy sets. The collection of 
fuzzy sets used as interpretation of the linguistic terms 
of a linguistic variable forms a fuzzy partition of the 
universe of discourse. 

To understand the role of a fuzzy partition, we 
should consider that it is meant to define a relation 
among fuzzy sets. Such a relation must be co-intensive 
with the one connecting the elementary concepts repre- 
sented by the fuzzy sets involved in the fuzzy partition. 
That is the reason why the design of fuzzy partitions 
is so crucial for the overall interpretability of a fuzzy 
system. The most critical interpretability constraints for 
fuzzy partitions are: 


© Justifiable number of elements: The number of 
fuzzy sets included in a linguistic variable must be 
small enough so that they can be easily remembered 


and recalled by users. Psychological studies suggest 
at most nine fuzzy sets or even less [14.5, 25]. Usu- 
ally, three to five fuzzy sets are convenient choices 
to set the partition cardinality. 

Distinguishability: Since fuzzy sets are denoted 
by distinct linguistic terms, they should refer to 
well-distinguished concepts. Therefore, fuzzy sets 
in a partition should be well separated, although 
some overlapping is admissible because usually 
perception-based concepts are not completely dis- 
joint. Several alternatives are available to quantify 
distinguishability, including similarity and possibil- 
ity [14.26]. 

Coverage: Distinguishable fuzzy sets are necessary, 
but if they are too much separated they risk to 
under-represent some subset of the universe of dis- 
course. The coverage constraint requires that each 
element of the universe of discourse must belong to 
at least one fuzzy set of the partition with a mem- 
bership degree not less than a threshold [14.22]. 
This requirement involves that each element of the 
universe of discourse has some quality that is well 
represented in the fuzzy partition. On the other 
hand, the lack of coverage is a signal of incom- 
pleteness of the fuzzy partition that may hamper the 
overall comprehensibility of the system’s knowl- 
edge. Coverage and distinguishability are somewhat 
conflicting requirements that are usually balanced 
by fuzzy partitions that enforce the intersection of 
adjacent fuzzy sets to elements whose maximum 
membership degree is equal to a threshold (usually 
the value of this threshold is set to 0.5). 

Relation preservation: The concepts that are rep- 
resented by the fuzzy sets in a fuzzy partition are 
usually cross related. The most immediate relation 
which can be conceived among concepts is related 
to the order (e.g., Low preceding Medium, preced- 
ing High, and so on). Relations of this type must 
be preserved by the corresponding fuzzy sets in the 
fuzzy partition [14.27]. 

Prototypes on special elements: In many problems, 
some elements of the universe of discourse have 
some special meaning. A common case is the mean- 
ing of the bounds of the universe of discourse, 
which usually represent some extreme qualities 
(e.g., Very Large or Very Small). Other examples 
are possible, which could be aside from the bounds 
of the universe of discourse being, instead, more 
problem-specific (e.g., prototypes could be con- 
ceived for the icing point of water, the typical 
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human body temperature, etc.). In all these cases, 
the prototypes of some fuzzy sets of the partition 
must coincide with such special elements. 


14.2.3 Constraints and Criteria 
for Fuzzy Rules 


In most cases, a fuzzy system is defined over a multi- 
dimensional universe of discourse that can be split into 
many one-dimensional universes of discourse, each of 
them associated with a linguistic variable. A subset of 
these linguistic variables is used to represent the input 
of a system, while the remaining variables (usually only 
one variable) are used to represent the output. The in- 
put/output behavior is expressed in terms of rules. Each 
rule prescribes a linguistic output value when the input 
matches the rule condition (also called rule premise), 
usually expressed as a logical combination of soft con- 
straints. A soft constraint is a linguistic proposition 
(specification) that ties a linguistic variable to a linguis- 
tic term (e.g., Temperature is High). Furthermore, the 
soft constraints combined in a rule condition may in- 
volve different linguistic variables (e.g., Temperature is 
High AND Pressure is Low). 

A fuzzy rule is a unit of knowledge that has the 
twofold role of determining the system behavior and 
communicating this behavior in a linguistic form. The 
latter feature urges to adopt a number of interpretability 
constraints which are to be added up to the constraints 
required for fuzzy sets and fuzzy partitions. Some of the 
most general interpretability constraints and criteria for 
fuzzy rules are as follows: 


@ Description length: The description length of 
a fuzzy rule is the sum of the number of soft con- 
straints occurring in the condition and in the conse- 
quent of the rule (it is usually known as total rule 
length). In most cases, only one linguistic variable 
is represented in a rule consequent, therefore the de- 
scription length of a fuzzy rule is directly related 
to the complexity of the condition. A small number 
of soft constraints in a rule implies both high read- 
ability and semantic generality; hence, short rules 
should be preferred in fuzzy systems. 

© Granular outputs: The main strength of fuzzy sys- 
tems is their ability to represent and process im- 
precision in both data and knowledge. Imprecision 
is part of fuzzy inference, therefore the inferred 
output of a fuzzy system should carry information 
about the imprecision of its knowledge. This can be 
accomplished by using fuzzy sets as outputs. De- 


fuzzification collapses fuzzy sets into single scalars; 
it should be therefore used only when strictly nec- 
essary and in those situations where outputs are not 
the object of user interpretation. 


14.2.4 Constraints and Criteria 
for Fuzzy Rule Bases 


As previously stated, the interpretability of a rule base 
taken as a whole has two facets: (1) a structural facet 
(readability), which is mainly related to the easiness 
of reading the rules; (2) a semantic facet (compre- 
hensibility), which is related to the information con- 
veyed to the users who are willing to understand the 
system behavior. The following interpretability con- 
straints and criteria are commonly defined to ensure 
the structural and semantic interpretability of fuzzy rule 
bases. 


© Compactness: A compact rule base is defined by 
a small number of rules. This is a typical structural 
constraint that advocates for simple representation 
of knowledge in order to allow easy reading and un- 
derstanding. Nevertheless, a small number of rules 
usually involves low accuracy; it is therefore very 
common to balance compactness and accuracy in 
a trade-off that mainly depends on user needs. 

© Average firing rules: When an input is applied to 
a fuzzy system, the rules whose conditions are ver- 
ified to a degree greater than zero are firing, i.e. 
they contribute to the inference of the output. On 
an average, the number of firing rules should be as 
small as possible, so that users are able to under- 
stand the contributions of the rules in determining 
the output. 

© Logical view: Fuzzy rules resemble logical proposi- 
tions when their linguistic description is considered. 
Since linguistic description is the main mean for 
communicating knowledge, it is necessary that log- 
ical laws are applicable to fuzzy rules; otherwise, 
the system behavior may result counter intuitive. 
Therefore, the validity of some basic laws of the 
propositional logic (like Modus Ponens) and the 
truth-preserving operations (e.g., application of dis- 
tributivity, De Morgan laws, etc.) should also be 
verified for fuzzy rules. 

© Completeness: The behavior of a fuzzy system is 
well defined for all inputs in the universe of dis- 
course; however, when the maximum firing strength 
determined by an input is too small, it is not easy to 
justify the behavior of the system in terms of the 
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activated rules. It is therefore required that for each 
possible input at least one rule is activated with a fir- 
ing strength greater than a threshold value (usually 
set to 0.5) [14.22]. 

© Locality: Each rule should define a local model, 
i.e. a fuzzy region in the universe of discourse 
where the behavior of the system is mainly due 
to the rule and only marginally by other rules that 
are simultaneously activated [14.28]. This require- 
ment is necessary to avoid that the final output 
of the system is a consequence of an interpolative 
behavior of different rules that are simultaneously 
activated with high firing strengths. On the other 
hand, a moderate overlapping of local models is 
admissible in order to enable a smooth transition 
from a local model to another when the input 


14.3 Interpretability Assessment 


The interpretability constraints and criteria presented 
in previous section belong to two main classes: (1) 
structural constraints and criteria referring to the static 
description of a fuzzy model in terms of the elements 
that compose it; (2) semantic constraints and criteria 
quantifying interpretability by looking at the behav- 
ior of the fuzzy system. Whilst structural constraints 
address the readability of a fuzzy model, semantic con- 
straints focus on its comprehensibility. 

Of course, interpretability assessment must regard 
both global (description readability) and local (infer- 
ence comprehensibility) points of view. It must also 
take into account both structural and semantic issues 
when considering all components (fuzzy sets, fuzzy 
partitions, linguistic partitions, linguistic propositions, 
fuzzy rules, fuzzy operators, etc.) of the fuzzy system 
under study. 

Thus, assessing interpretability represents a chal- 
lenging task mainly because the analysis of inter- 
pretability is extremely subjective. In fact, it clearly 
depends on the feeling and background (knowledge, ex- 
perience, etc.) of the person who is in charge of making 
the evaluation. Even though having subjective indexes 
would be really appreciated for personalization pur- 
poses, looking for a universal metric widely admitted 
also makes the definition of objective indexes manda- 
tory. Hence, it is necessary to consider both objective 
and subjective indexes. On the one hand, objective in- 
dexes are aimed at making feasible fair comparisons 
among different fuzzy models designed for solving 


values gradually shift from one fuzzy region to an- 
other. 


In summary, a number of interpretable constraints 
and criteria apply to all levels of a fuzzy system. This 
section highlights only the constraints that are general 
enough to be applied independently on the model- 
ing problem; however, several problem-specific con- 
straints are also reported in the literature (e.g., attribute 
correlation). Sometimes interpretability constraints are 
conflicting (as exemplified by the dichotomy distin- 
guishability versus coverage) and, in many cases, they 
conflict with the overall accuracy of the system. A bal- 
ance is therefore required, asking in its turn for a way 
to assess interpretability in a qualitative but also quanti- 
tative way. This is the main subject of the next section. 


the same problem. On the other hand, subjective in- 
dexes are thought for guiding the design of customized 
fuzzy models, thus making easier to take into account 
users’ preferences and expectations during the design 
process. 

The rest of this section gives an overview on 
the most popular interpretability indexes which turn 
out from the specialized literature. Firstly, Zhou and 
Gan [14.29] established a two-level taxonomy regard- 
ing interpretability issues. They distinguished between 
low-level (also called fuzzy set level) and high-level 
(or fuzzy rule level). This taxonomy was extended 
by Alonso et al. [14.30] who introduced a conceptual 
framework for characterizing interpretability. They con- 
sidered both fuzzy partitions and fuzzy rules at several 
abstraction levels. Moreover, in [14.31] Mencar et al. 
remarked the need to distinguish between readability 
(related to structural issues) and comprehensibility (re- 
lated to semantic issues). Later, Gacto et al. [14.32] 
proposed a double axis taxonomy regarding semantic 
and structural properties of fuzzy systems, at both par- 
tition and rule base levels. Accordingly, they pointed 
out four groups of indexes. Below, we briefly introduce 
the two most sounded indexes inside each group (they 
are summarized in Fig. 14.3): 


G1. Structural-based interpretability at fuzzy partition 
level: 
© Number of features. 
© Number of membership functions. 
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Fuzzy partition level 


Structural—based 
interpretability 


Fuzzy rule base level 


Semantic—based 


interpretability GM3M index 


Context—adaptation—based index 


Gl G2 
Number of features Number of rules 
Number of membership functions Number of conditions 

G3 G4 


Semantic—cointension—based index 
Co-firing—based—comprehensibility index 


Fig. 14.3 Interpretability indexes considered in this work 


G2. Structural-based interpretability at fuzzy rule base 
level: 
@ Number of rules. This index is the most widely 


used [14.30]. 


© Number of conditions. This index corresponds 


to the previously mentioned total rule length 
which was coined by Ishibuchi et al. [14.33]. 


G3. Semantic-based interpretability at fuzzy partition 
level: 
© Context-adaptation-based index [14.34]. This 


index was introduced by Botta et al. with the 
aim of guiding the so-called context adaptation 
approach for multiobjective evolutionary design 
of fuzzy rule-based systems. It is actually an 
interpretability index based on fuzzy ordering 
relations. 

GM3M index [14.35]. Gacto et al. proposed an 
index defined as the geometric mean of three 
single metrics. The first metric computes the 
displacement of the tuned membership func- 
tions with respect to the initial ones. The second 
metric evaluates the changes in the shapes of 
membership functions in terms of lateral am- 
plitude rate. The third metric measures the area 
similarity. This index was used to preserve 
the semantic interpretability of fuzzy partitions 
along multiobjective evolutionary rule selection 
and tuning processes aimed at designing fuzzy 
models with a good interpretability-accuracy 
trade-off. 


G4. Semantic-based interpretability at fuzzy rule base 
level: 
@ Semantic-cointension-based index [14.36]. This 


index exploits the cointension concept coined 
by Zadeh [14.3]. In short, two different concepts 
referring almost to the same entities are taken as 
cointensive. Thus, a fuzzy system is deemed as 
comprehensible only when the explicit seman- 
tics (defined by fuzzy sets attached to linguistic 


terms as well as fuzzy operators) embedded in 
the fuzzy model is cointensive with the implicit 
semantics inferred by the user while reading 
the linguistic representation of the rules. In the 
case of classification problems, semantic coin- 
tension can be evaluated through a logical view 
approach, which evaluates the degree of fulfill- 
ment of a number of logical laws exhibited by 
a given fuzzy rule base [14.31]. The idea mainly 
relies on the assumption that linguistic propo- 
sitions resemble logical propositions, for which 
a number of basic logical laws are expected to 
hold. 

Co-firing-based — comprehensibility index 
[14.37]. It measures the complexity of under- 
standing the fuzzy inference process in terms of 
information related to co-firing rules, i. e. rules 
firing simultaneously with a given input vector. 
This index emerges in relation with a novel 
approach for fuzzy system comprehensibility 
analysis, based on visual representations of the 
fuzzy rule-based inference process. Such rep- 
resentations are called fuzzy inference-grams 
(fingrams) [14.38,39]. Given a fuzzy rule 
base, a fingram plots it graphically as a social 
network made of nodes representing fuzzy rules 
and edges connecting nodes in terms of rule 
interaction at the inference level. Edge weights 
are computed by paying attention to the number 
of co-firing rules. Thus, looking carefully at 
all the information provided by a fingram it 
becomes easy and intuitive understanding the 
structure and behavior of the fuzzy rule base it 
represents. 


Notice that, most published interpretability indexes 


only deal with structural issues, so they correspond to 
groups G1 and G2. Indexes belonging to these groups 
are mainly quantitative. They essentially analyze the 
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structural complexity of a fuzzy model by counting the 
number of elements (membership functions, rules, etc.) 
it contains. As a result, these indexes can be deemed 
as objective ones. Although these indexes are usually 
quite simple (that is the reason why we have just listed 
them above), they are by far the most popular ones. On 
the contrary, only a few interpretability indexes are able 
to assess the comprehensibility of a fuzzy model deal- 
ing with semantic issues (they belong to groups G3 and 
G4). This is mainly due to the fact that these indexes 
must take into account not only quantitative but also 
qualitative aspects of the modeled fuzzy system. They 
are inherently subjective and therefore not easy to for- 


malize (that is the reason why we have provided more 
details above). Anyway, the interested reader is referred 
to the cited papers for further information. Moreover, 
a much more exhaustive list of indexes can be found 
in [14.32]. 

Even though there has been a great effort in the last 
years to propose new interpretability indexes, a univer- 
sal index is still missing. Hence, defining such an index 
remains a challenging task. Anyway, we would like to 
highlight the need to address another encouraging chal- 
lenge which is a careful design of interpretable fuzzy 
systems guided by one or more of the already existing 
interpretability indexes. 


14.4 Designing Interpretable Fuzzy Systems 


Linguistic (Mamdani-type) fuzzy systems are widely 
known as a powerful tool to develop linguistic mod- 
els [14.11]. They are made up of two main components: 


@ The inference engine, that is the component of the 
fuzzy system in charge of the fuzzy processing tasks. 

@ The knowledge base (KB), that is the component of 
the fuzzy system that stores the knowledge about 
the problem being solved. It is composed of: 

— The fuzzy partitions, describing the linguistic 
terms along with the corresponding membership 
functions defining their semantics, and 

— The fuzzy rule base, constituted by a collection 
of linguistic rules with the following structure 


IF X; is A; and ... and X, is Ay 
THEN Y; is Bı and ... and Y,, is Bm 


with X; and Y; being input and output linguis- 
tic variables, respectively, and A; and B; being 
linguistic terms defined by the corresponding 
fuzzy partitions. This structure provides a nat- 
ural framework to include expert knowledge 
in the form of linguistic fuzzy rules. In addi- 
tion to expert knowledge, induced knowledge 
automatically extracted from experimental data 
(describing the relation between system input 
and output) can also be easily formalized in the 
same rule base. Expert and induced knowledge 
are complementary. Furthermore, they are rep- 
resented in a highly interpretable structure. The 
fuzzy rules are composed of input and output 
linguistic variables which take values from their 
term sets having a meaning associated with each 


linguistic label. As a result, each rule is a de- 
scription of a condition-action statement that 
offers a clear interpretation to a human. 


The accuracy of a fuzzy system directly depends 
on two aspects, the composition of the KB (fuzzy 
partitions and fuzzy rules) and the way in which it im- 
plements the fuzzy inference process. Therefore, the 
design process of a fuzzy system includes two main 
tasks which are going to be further explained in the fol- 
lowing subsections, regarding both interpretability and 
accuracy: 


© Generation of the KB in order to formulate and de- 
scribe the knowledge that is specific to the problem 
domain. 

© Conception of the inference engine, that is the 
choice of the different fuzzy operators that are em- 
ployed by the inference process. 


Mamdani-type fuzzy systems favor interpretability. 
Therefore, they are usually considered when looking for 
interpretable fuzzy systems. However, it is important 
to remark that they are not interpretable per se. Notice 
that designing interpretable fuzzy systems is a matter of 
careful design. 


14.4.1 Design Strategies for the Generation 
of a KB Regarding 
the Interpretability-Accuracy 
Trade-Off 


The two main objectives to be addressed in the FM 
field are the interpretability and accuracy. Of course, 
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the ideal aim would be to satisfy both objectives to The rest of this section provides additional details 
a high degree but, since they represent conflicting related to each of these approaches. 
goals, it is generally not possible. Regardless of the 
approach, a common scheme is found in the existing First Interpretability Then Accuracy 
literature: LFM has some inflexibility due to the use of linguistic 
variables with global semantics that establishes a gen- 
© Firstly, the main objective (interpretability or accu- eral meaning of the used fuzzy sets [14.40]: 
racy) is tackled defining a specific model structure 
to be used, thus setting the FM approach. 1. There is a lack of flexibility in the fuzzy system 
@ Then, the modeling components (model structure because of the rigid partitioning of the input and 
and/or modeling process) are improved by means output spaces. 
of different mechanisms to achieve the desired ratio 2. When the system input variables are dependent, it 
between interpretability and accuracy. is very hard to find out right fuzzy partitions of the 
input spaces. 
This procedure resulted in four different possibili- 3. The usual homogeneous partitioning of the in- 
ties: put and output spaces does not scale to high- 
dimensional spaces. It yields to the well-known 
1. LFM with improved interpretability, curse of dimensionality problem that is character- 
2. LFM with improved accuracy, istic of fuzzy systems. 
3. PFM with improved interpretability, and 4. The size of the KB directly depends on the number 
4. PFM with improved accuracy. of variables and linguistic terms in the model. The 
derivation of an accurate linguistic fuzzy system 
Option (1) gives priority to interpretability. Al- usually requires a big number of linguistic terms. 
though a fuzzy system designed by LFM uses a model Unfortunately, this fact causes the number of rules 
structure with high descriptive power, it has some prob- to rise significantly, which may cause the system to 
lems (curse of dimensionality, excessive number of lose the capability of being readable by human be- 
D input variables or fuzzy rules, garbled fuzzy sets, etc.) ings. Of course, in most cases it would be possible 
= that make it not as interpretable as desired. In conse- to obtain an equivalent fuzzy system with a much 
Ex quence, there is a need of interpretability improvements smaller number of rules by renouncing to that kind 
P to restore the pursued balance. of rigidly partitioned input space. 
F On the contrary, option (4) considers accuracy as 


the main concern. However, obtaining more accuracy in 
PFM does not pay attention to the interpretability of the 
model. Thus, this approach goes away from the aim of 
this chapter. It acts close to black-box techniques, so it 
does not follow the original objective of FM (not taking 
profit from the advantages that distinguish it from other 
modeling techniques). 

Finally, the two remaining options, (2) and (3), pro- 
pose improvement mechanisms to compensate for the 
initial imbalance in the quest for the best trade-off be- 
tween interpretability and accuracy. In summary, three 
main approaches exist depending on how the two ob- 
jectives are optimized (sequentially or at once): 


© First interpretability then accuracy (LFM with im- 
proved accuracy). 

© First accuracy then interpretability (PFM with im- 
proved interpretability). 

@ Multiobjective design. Both objectives are opti- 
mized at the same time. 


However, it is possible to make some considerations 
to face the disadvantages enumerated above. Basically, 
two ways of improving the accuracy in LFM can be 
considered by performing the improvement in: 


© The model structure, slightly changing the rule 
structure to make it more flexible, or in 

@ The modeling process, extending the model design 
to other components beyond the rule base, such as 
the fuzzy partitions, or even considering more so- 
phisticated derivations of it. 


Note that, the so-called strong fuzzy partitions are 
widely used because they satisfy most of the inter- 
pretability constraints introduced in Sect. 14.2.2. The 
design of fuzzy partitions may be integrated within the 
whole derivation process of a fuzzy system with differ- 
ent schemata: 


© Preliminary design. It involves extracting fuzzy 
partitions automatically by induction (usually per- 
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formed by nonsupervised clustering techniques) 
from the available dataset. 

@ Embedded design. Following a meta-learning pro- 
cess, this approach first derives different fuzzy 
partitions and then samples its efficacy running 
an embedded basic learning method of the entire 
KB [14.41]. 

@ Simultaneous design. The process of designing 
fuzzy partitions is developed together with the 
derivation of other components such as the fuzzy 
rule base [14.42]. 

@ A posteriori design. This approach involves tuning 
of the previously defined fuzzy partitions once the 
remaining components have been obtained. Usu- 
ally, the tuning process changes the membership 
function shapes with the aim of improving the 
accuracy of the linguistic model [14.43]. Neverthe- 
less, sometimes it also takes care of getting better 
interpretability (e.g., merging membership func- 
tions [14.44]). 


It is also possible to opt for using more sophisticated 
tule base learning methods while the fuzzy partitions 
and the model structure are kept unaltered. Usually, all 
these improvements have the final goal of enhancing the 
interpolative reasoning the fuzzy system develops. For 
instance, the COR (cooperative rules) method follows 
the primary objective of inducing a better cooperation 
among linguistic rules [14.45]. 

As an alternative, other authors advocate the exten- 
sion of the usual linguistic model structure to make 
it more flexible. As Zadeh highlighted in [14.46], 
a way to do so without losing the description abil- 
ity to a high degree is to use linguistic hedges (also 
called linguistic modifiers in a wider sense). In ad- 
dition, the rule structure can be extended through 
the definition of double-consequent rules, weighted 
rules, rules with exceptions, hierarchical rule bases, 
etc. 


First Accuracy Then Interpretability 
The birth of more flexible fuzzy systems such as TSK 
or approximate ones (allowing the FM to achieve higher 
accuracy) entailed the eruption of PFM. Nevertheless, 
the modeling tasks with these kinds of fuzzy systems 
increasingly resembled black-box processes. Conse- 
quently, nowadays several researchers share the idea 
of rescuing the seminal intent of FM, i.e. to preserve 
the good interpretability advantages offered by fuzzy 
systems. This fact is usually attained by reducing the 
complexity of the model [14.47]. Furthermore, there are 


approaches aimed at improving the local description of 
TSK-type fuzzy rules: 


© Merging/removing fuzzy sets in precise fuzzy sys- 
tems. The interpretability of TSK-type fuzzy sys- 
tems may be improved by removing those fuzzy 
sets that, after an automatic adaptation and/or ac- 
quisition, do not contribute significantly to the 
model behavior. Two aspects must be consid- 
ered: 

— Redundancy. It refers to the coexistence of simi- 
lar fuzzy sets representing compatible concepts. 
In consequence, models become more complex 
and difficult to understand (the distinguishabil- 
ity constraint is not satisfied). 

— Irrelevancy. It arises when fuzzy sets with 
a constant membership degree equal to 1, or 
close to it, are used. These kinds of fuzzy sets 
do not furnish relevant information. 

The use of similarity measures between fuzzy sets 

the has been proposed to automatically detect these 

undesired fuzzy sets [14.48]. Through first merg- 
ing/removing fuzzy sets and then merging fuzzy 
tules, the precise fuzzy model goes through an in- 
terpretability improvement process that makes it 
less complex (more compact) and more easily in- 
terpretable (more transparent). 

© Ordering/selecting TSK-type fuzzy rules. An effi- 
cient way to improve the interpretability in FM is 
to select a subset of significant fuzzy rules that rep- 
resent in a more compact way the system to be 
modeled. Moreover, as a side effect this selection 
of important rules reduces the possible redundancy 
existing in the fuzzy rule base, thus improving 

the generalization capability of the system, i.e., 

its accuracy. For instance, resorting to orthogonal 

transformations [14.49] is one of the most success- 
ful approaches in this sense. 

© Exploiting the local description of TSK-type fuzzy 
rules. TSK-type fuzzy systems are usually consid- 
ered as the combination of simple models (the rules) 
that describe local behaviors of the system to be 
modeled. Hence, insofar as each fuzzy rule is either 
forced to have a smoother consequent polynomial 
function or to develop an isolated action, the inter- 
pretability will be improved: 

— Smoothing the consequent polynomial func- 
tion [14.50]. Through imposing several con- 
straints to the weights involved in the poly- 
nomial function of each rule consequent then 
a convex combination of the input variables is 
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performed. This contributes to a better under- 
standing of the model. 

— Isolating the fuzzy rule actions [14.47]. The de- 
scription of each fuzzy rule is improved when 
the overlapping between adjacent input fuzzy 
sets is reduced. Note that the performance re- 
gion of a rule is more clearly defined by avoid- 
ing that other rules have high firing degree in the 
same area. 


Multiobjective Design 

Since interpretability and accuracy are widely recog- 
nized as conflicting goals, the use of multiobjective 
evolutionary (MOE) strategies is becoming more and 
more popular in the quest for the best interpretability- 
accuracy trade-off [14.19,51]. Ducange and Marcel- 
loni [14.52] proposed the following taxonomy of mul- 
tiobjective evolutionary fuzzy systems: 


@ MOE Tuning. Given an already defined fuzzy sys- 
tem, its main parameters (typically membership 
function parameters but also fuzzy inference param- 
eters) are refined through MOE strategies [14.53, 
54]. 

@ MOE Learning. The components of a fuzzy sys- 
tem KB, the both fuzzy partitions forming the 
database (DB) and fuzzy rules forming the rule-base 
(RB), are automatically generated from experimen- 
tal data. 

— MOE DB Learning. The most relevant variables 
are identified and the optimum membership 
function parameters are defined from scratch. 
It usually wraps a RB heuristic-based learning 
process [14.55]. 

— MOE RB Selection. Starting from an initial RB, 
a set of nondominated RBs is generated by 
selecting subsets of rules exhibiting different 
trade-offs between interpretability and accu- 
racy [14.56]. In some works [14.35,57], MOE 
RB selection and MOE tuning are carried out 
together. 

— MOE RB Learning. The entire set of fuzzy rules 
is fully defined from scratch. In this approach, 
uniformly distributed fuzzy partitions are usu- 
ally considered [14.58]. 

— MOE KB Learning. Simultaneous evolution- 
ary learning of all KB components (DB and 
RB). Concurrent learning of fuzzy partitions 
and fuzzy rules proved to be a powerful tool 
in the quest for a good balance between inter- 
pretability and accuracy [14.59]. 


It is worthy to note that for the sake of clarity we 
have only cited some of the most relevant papers in the 
field of MOE fuzzy systems. For further details, the in- 
terested reader is referred to [14.51,52] where a much 
more exhaustive review of related works is carried out. 


14.4.2 Design Decisions at Fuzzy Processing 
Level 


Although there are studies analyzing the behavior of the 
existing fuzzy operators for different purposes, unfor- 
tunately this question has not been considered yet as 
a whole from the interpretability point of view. Keeping 
in mind the interpretability requirement, the implemen- 
tation the of the inference engine must address the 
following careful design choices: 


@ Select the right conjunctive operator to be used in 
the antecedent of the rule. Different operators (be- 
longing to the t-norm family) are available to make 
this choice [14.60]. 

@ Select the operator to be used in the fuzzy impli- 
cation of IF-THEN rules. Mamdani proposed to 
use the minimum operator as the t-norm for im- 
plication. Since then, various other t-norms have 
been suggested as implication operator [14.60], 
for instance the algebraic product. Other important 
family of implication operators are the fuzzy im- 
plication functions [14.61], one of the most usual 
being the Lukasiewicz’s one. Less common impli- 
cation operators such as force-implications [14.62], 
t-conorms and operators not belonging to any of the 
most known implication operator families [14.63, 
64] have been considered too. 

© Choose the right inference mechanism. Two main 
strategies are available: 

— FATI (First Aggregation Then Inference). All 
antecedents of the rules are aggregated to form 
a multidimensional fuzzy relation. Via the com- 
position principle the output fuzzy set is derived. 
This strategy is preferred when dealing with im- 
plicative rules [14.65]. 

— FITA (First Inference Then Aggregation). The 
output of each rule is first inferred, and then all 
individual fuzzy outputs are aggregated. This is 
the common approach when working with the 
usual conjunctive rules. This strategy has be- 
come by far the most popular, especially in case 
of real-time applications. The choice for an out- 
put aggregation method (in some cases this is 
called the also operator) is closely related to 
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the considered implication operator since it has 

to be related to the interpretation of the rules 
(which is connected to the kind of implication). 

© Choose the most suitable defuzzification interface 
operation mode. There are different options being 
the most widely used the center of area, also called 


center of gravity, and the mean of maxima. Even 
though most methods are based on geometrical or 
Statistical interpretations, there are also paramet- 
ric methods, adaptive methods including human 
knowledge, and even evolutionary adaptive meth- 
ods [14.66]. 


14.5 Interpretable Fuzzy Systems in the Real World 


Interpretable fuzzy systems have an immediate impact 
on real-world applications. In particular, their useful- 
ness is appreciable in all application areas that put 
humans at the center of computing. Interpretable fuzzy 
systems, in fact, conjugate knowledge acquisition capa- 
bilities with the ability of communicating knowledge in 
a human-understandable way. 

Several application areas can take advantage from 
the use of interpretable fuzzy systems. In the follow- 
ing, some of them are briefly outlined, along with a few 
notes on specific applications and potentialities. 


@ Environment: Environmental issues are often chal- 
lenging because of the complex dynamics, the 
high number of variables and the consequent un- 
certainty characterizing the behavior of subjects 
under study. Computational intelligence techniques 
come into play when tolerance for imprecision 
can be exploited to design convenient models 
that are suitable to understand phenomena and 
take decisions. Interpretable fuzzy systems show 
a clear advantage over black-box systems in pro- 
viding knowledge that is capable of explaining 
complex and nonlinear relationships by using lin- 
guistic models. Real-world environmental applica- 
tions of interpretable fuzzy systems include: harm- 
ful bioaerosol detection [14.67]; modeling habitat 
suitability in river management [14.68]; modeling 
pesticide loss caused by meteorological factors in 
agriculture [14.69], and so on. 

© Finance: This is a sector where human-computer 
cooperation is very tight. Cooperation is carried out 
in different ways, including the use of computers 
to provide business intelligence for decision sup- 
port in financial operations. In many cases financial 
decisions are ultimately made by experts, who can 
benefit from automated analyses of big masses of 
data flowing daily in markets. To this pursuit, Com- 
putational intelligence approaches are spreading 
among the tools used by financial experts in their 
decisions, including interpretable fuzzy systems for 


stock return predictions [14.70], exchange rate fore- 
casting [14.71], portfolio risk monitoring [14.72], 
etc. 

Industry: Industrial applications could take advan- 
tage from interpretable fuzzy systems when there is 
the need of explaining the behavior of complex sys- 
tems and phenomena, like in fault detection [14.73]. 
Also, control plans for systems and processes can 
be designed with the help of fuzzy systems. In such 
cases, a common practice is to start with an ini- 
tial expert knowledge (used to design rules which 
are usually highly interpretable) that is then tuned 
to increase the accuracy of the controller. However, 
any unconstrained tuning could destroy the origi- 
nal interpretability of the knowledge base, whilst, 
by taking into account interpretability, the possibil- 
ity of revising and modifying the controller (or the 
process manager) can be enhanced [14.74]. 
Medicine and Health-care: As a matter of fact, in al- 
most all medical contexts intelligent systems can be 
invaluable decision support tools, but people are the 
ultimate actors in any decision process. As a conse- 
quence, people need to rely on intelligent systems, 
whose reliability can be enhanced if their outcomes 
may be explained in terms that are comprehensible 
by human users. Interpretable fuzzy systems could 
play a key role in this area because of the possibility 
of acquiring knowledge from data and communicat- 
ing it to users. In the literature, several approaches 
have been proposed to apply interpretable fuzzy 
systems in different medical problems, like assisted 
diagnosis [14.75], prognosis prediction [14.76], pa- 
tient subgroup discovery [14.77], etc. 

Robotics: The complexity of robot behavior model- 
ing can be tackled by an integrated approach where 
a first modeling stage is carried out by combining 
human expert and empirical knowledge acquired 
from experimental trials. This integrated approach 
requires that the final knowledge base is provided 
to experts for further maintenance: this task could 
be done effectively only if the acquired knowledge 
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is interpretable by the user. Some concrete applica- 
tions of this approach can be found in robot local- 
ization systems [14.78] and motion analysis [14.79, 
80]. 

@ Society: The focus of intelligent systems for social 
issues has noticeably increased in recent years. For 


reasons that are common to all the previous appli- 
cation areas, interpretable fuzzy systems have been 
applied in a wide variety of scopes, including qual- 
ity of service improvement [14.81], data mining 
with privacy preservation [14.82], social network 
analysis [14.37], and so on. 


14.6 Future Research Trends on Interpretable Fuzzy Systems 


Research on interpretable fuzzy systems is open in sev- 
eral directions. Future trends involve both theoretical 
and methodological aspects of interpretability. In the 
following, some trends are outlined amongst the pos- 
sible lines of research development [14.7]. 


© Interpretability definition: The blurred nature of in- 
terpretability requires continuous investigations on 
possible definitions that enable a computable treat- 
ment of this quality in fuzzy systems. This require- 
ment casts the research on interpretable fuzzy sys- 
tems toward cross-disciplinary investigations. For 
instance, this research line includes investigations 
on computable definitions of some conceptual qual- 
ities, like vagueness (which has to be distinguished 
from imprecision and fuzziness). Also, the problem 
of interpretability of fuzzy systems can be intended 
as a particular instance of the more general problem 
of communication between granular worlds [14.83], 
where many aspects of interpretability could be 
treated in a more abstract way. 

© Interpretability assessment: A prominent objec- 
tive is the adoption of a common framework for 
characterizing and assessing interpretability with 
the aim of avoiding misleading notations. Within 
such a framework, novel metrics could be de- 
vised, especially for assessing subjective aspects 
of interpretability, and integrated with objective in- 
terpretability measures to define more significant 
interpretability indexes. 

© Design of interpretable fuzzy models: A current re- 
search trend in designing interpretable fuzzy models 
makes use of multiobjective genetic algorithms in 


14.7 Conclusions 


Interpretability is an indispensable requirement for de- 
signing fuzzy systems, yet it cannot be assumed to hold 
by the simple fact of using fuzzy sets for modeling. In- 
terpretability must be encoded in some computational 


order to deal with the conflicting design objectives 
of accuracy and interpretability. The effectiveness 
and usefulness of these approaches, especially those 
concerning advanced schemes, have to be veri- 
fied against a number of indexes, including indexes 
that integrate subjective measures. This verifica- 
tion process is particularly required when tackling 
high-dimensional problems. In this case, the combi- 
nation of linguistic and graphical approaches could 
be a promising approach for descriptive and ex- 
ploratory analysis of interpretable fuzzy systems. 

@ Representation of fuzzy systems: For very complex 
problems the use of novel forms of representa- 
tion (different from the classical rule based) may 
help in representing complex relationship in com- 
prehensible ways thus yielding a valid aid in de- 
signing interpretable fuzzy systems. For instance, 
a multilevel representation could enhance the inter- 
pretability of fuzzy systems by providing different 
granularity levels for knowledge representation. On 
the one hand, the highest granulation levels give 
a coarse (yet immediately comprehensible) descrip- 
tion of knowledge, while lower levels provide for 
more detailed knowledge. 


As a final remark, it is worth observing that inter- 
pretability is one aspect of the multifaceted problem of 
human-centered design of fuzzy systems [14.84]. Other 
facets include acceptability (e.g., according to ethical 
rules), interestingness of fuzzy rules, applicability (e.g., 
with respect to law), etc. Many of them are not yet in the 
research mainstream but they clearly represent promis- 
ing future trends. 


methods in order to drive the design of fuzzy systems, 
as well as to assess the interpretability of existing mod- 
els. The study of interpretability issues started about 
two decades ago and led to a number of theoretical 
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and methodological results of paramount value in fuzzy 
modeling. Nevertheless, research is still open both in 


depth — through new ways of encoding and assessing 
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15. Fuzzy Clustering — Basic Ideas and Overview 


Sadaaki Miyamoto 


This chapter overviews basic formulations as 
well as recent studies in fuzzy clustering. A ma- 
jor part is devoted to the discussion of fuzzy 
c-means and their variations. Recent top- 
ics such as kernel-based fuzzy c-means and 
clustering with semi-supervision are men- 
tioned. Moreover, fuzzy hierarchical clustering 
is overviewed and fundamental theorem is 
given. 
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15.1 Fuzzy Clustering 


Data clustering is an old subject [15.1—4] but recently 
more researchers are developing different techniques 
and application fields are enlarging. Fuzzy cluster- 
ing [15.5-10] is also popular in a variety of fuzzy 
systems. This chapter reviews basic ideas of fuzzy clus- 
tering, and provides a brief overview of recent studies. 
First, we consider the most popular method in fuzzy 
clustering, i.e., fuzzy c-means. There are many vari- 
ations, extensions, and applications of fuzzy c-means, 


15.2 Fuzzy c-Means 


We begin with basic notations and then introduces 
the method of fuzzy c-means by Dunn [15.5,6] and 
Bezdek [15.7, 8]. 


15.2.1 Notations 


The set of objects for clustering is denoted by X = 
{x1,... Xy} where each objects is a point of p- 
dimensional Euclidean space RP: x, = (xl, .. . , 3%), k = 
1,..., N. Clusters are denoted either by G; or simply by 
i when no confusion arises. Clustering uses a similarity 
or dissimilarity measure. In this section, a dissimilarity 
measure denoted by D(x, y), x, y € RP, is used. 
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some of which are described here. Recent studies on 
kernel-based methods and clustering with semisupervi- 
sion are also discussed in relation to fuzzy c-means. 

Moreover, another fuzzy clustering is briefly men- 
tioned which uses the transitive closure of fuzzy 
relations [15.10]. This method is shown to be 
equivalent to the well-known methods of the sin- 
gle linkage of agglomerative hierarchical cluster- 
ing [15.11]. 


Although we have different choices for dissimilarity 
measure, a standard measure is the squared Euclidean 
distance 


P 
D(x, y) = lx =y = X w -y . (15.1) 


j=1 


In fuzzy c-means and related methods, the number 
of clusters, denoted by c is assumed to be given be- 
forehand. The membership of object x, to cluster i is 
assumed to be given by uxi. Moreover, the collection 
of all memberships is denoted by matrix U = (uzi). It 
is natural to assume that ug; € [0,1] for all 1 < i< c 
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and 1 <k< N, and, moreover, De uy = 1l, for all 
1<k<N. 

The method of fuzzy c-means also uses a center for 
a cluster, which is denoted by v; = (v},..., v?) € RP for 
cluster i. For the ease of reference, all cluster centers are 
summarized into V = (v1, ... , Ve). 


Basic K-Means Algorithm 
Many studies of clustering handles K-means [15.12] as 


a standard method. 


Algorithm 15.1 KM: Basic K-means algorithm 


KM0: Generate randomly c cluster centers. 

KM1: Allocate each object x, (k = 1,...,N) to the 
cluster of the nearest center. 

KM2: Calculate new cluster centers as the centroid 
(the center of gravity). If all cluster cen- 
ters are convergent, stop. Otherwise, go to 
KMI. 

End KM. 

Note that the centroid of a cluster G; is given by 
v= Gl J yeg; Xk» Where |G;| is the number of objects 
in Gi. 


15.2.2 Fuzzy c-Means Algorithm 


It should first be noted that the basic idea of fuzzy 
c-means is an alternative optimization of an objective 
function proposed by Dunn [15.5, 6] and Bezdek [15.7, 
8] 


c N 
JUV) = YD "Dev (m> 1), 
i=1k=1 
(15.2) 


where D(x;,v;) is the squared Euclidean distance 
(15.1). 

Using this objective function, the following alterna- 
tive optimization is carried out. 


Algorithm 15.2 FCM: Fuzzy c-means algorithm 


FCMO: Generate randomly initial fuzzy clusters. 


Let the solutions be (U, V) 


FCM1: Minimize J(U, V) with respect to U. Let the 
optimal solution be a new U. 
FCM2: Minimize J(U, V) with respect to V. Let the 


optimal solution be a new V. 


FCM3: If the solution (U s V) is convergent, stop. 
Else go to FCM1. 


End FCM. 


Note that optimization with respect to U is with the 
constraint 

ui E€ [0,1], Vi<i<c,1<k<QN, 
c 


Yiug=l, YI<k<N, 
j=l 


(15.3) 


while optimization with respect to V is without any con- 
straint. 
It is not difficult to have the optimal solutions as 
follows 
1 
= D(xk.vi) maT 
Uki = yeo ; (15.4) 
IEL poy.) = 
N = 
— Dei (tna) Xk 
a N = 
=i (ji) 
The derivations are omitted; the readers should refer 
to [15.8] or other textbooks. 
Note also that (15.4) appears ill-defined when x, = 
v;. In such a case, we use 


1 


= 1+ R p (Saee 
Dies D(x.) 71 


(15.5) 


Uki ; (15.6) 


which has the same value as (15.4) without a singular 
point. 


Moreover, we write these equations without the use 
of bars like 


1 
— 
D(xk,vi) mT 


Da — 
a 


1 = 
D(x yj) m—I 
N 
opm na) Xk 
T SN a er oe 
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for simplicity and without any confusion. 


Uki = 
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15.2.3 A Natural Classifier 


These solutions lead us to the following natural fuzzy 
classifier with a given set of cluster centers V 


1 


Le Sa, 


1 
D(x.vj) m—1 


Uj (x; V) = (15.7) 
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There is nothing strange in (15.7), since U;(x; V) has 
been derived from ug; simply by replacing object x; by 
the variable x. 

This replacement appears rather trivial and it also 
appears that U;(x; V) has no further information than 
ugi. On the contrary, this function is important if we 
wish to observe theoretical properties of fuzzy c-means. 

The following propositions are not difficult to prove 
and hence the proofs are omitted [15.13]. In particular, 
the first proposition is trivial. 


Proposition 15.1 
U(x; V) = ug, i. e., the fuzzy classifier interpolates the 
membership value uzi. 


Proposition 15.2 
When |x| go to infinity, U;(x; V), i=1,...,c, ap- 
proaches the same value of 1/c 


1 
lim U;(x;V)=-. 
c 


lix co 


Proposition 15.3 
The maximum value of U;(x; V),i=1,...,c, is at x = 
Vi 


max U;(x; V) = U;(v;,V) = 1. 
xERP 


The significance of the function U;(x; V) is shown in 
these propositions. An object x; is a fixed point, while 
xis a variable. Without such a variable, we cannot ob- 
serve theoretical properties of fuzzy c-means. 


15.2.4 Variations of Fuzzy c-Means 


Many variations of fuzzy c-means have been studied, 
among which we first mention fuzzy c-varieties [15.8], 
fuzzy c-regressions [15.14], and the method of 
Gustafson and Kessel [15.15] to take clusterwise co- 
variance into account. Note that these are relatively old 
variations and they all are based on variations of objec- 
tive functions including the change of D(x, v). 

In this section, we use the additional symbols 
(x,y) =x! y= y! x, which is the standard scalar prod- 
uct of the Euclidean space R?. Moreover, we introduce 


D(x, v; S) = (x= v) ST! (xv), 


which is the squared Mahalanobis distance. 


Fuzzy c-Varieties 
Let us first consider a q-dimensional subspace 


span{s},...,Sg} = {a11 H +H dgSq: 
-0 <q <+, 
k=1,...,q}, 
where s1,...,Sq is a set of orthonormal set of vectors 


with q < p. So is a given vector of RP. A linear variety L 
in R? is represented by 


L= {l= sot aisi +++ + ags: 
-00 <q < +00, k= laag} 


Let 


P(x, 1) = arg max (x — so, l— so) 
IEL 


be the projection of x onto L. We then define 
D(xXk, li) = lxx ae soll? — P(x, 1) “ 


We consider the objective function for fuzzy c-varieties 


EN 
JU, D=) 9 "Dæ l) (m>1), (15.8) 


i=1k=1 


where L = (li, ... , le). 
The derivation of the solutions is omitted here, but 
the solutions are as follows: 


== 
1 
D(xk,li) ™=1 


Uki = s i ; (15.9) 
I=L (yh) T 

-Ù yL 1 (uki) Xk 

sO = m (15.10) 


a (uki) 


while a" (Gj =1,...,q) is the normalized eigenvector 
corresponding to the g maximum eigenvalues of the ma- 


trix 


N 


F aw E 
Aj = Yo u)" (x -s?) (x — sf?) . 


k=1 


(15.11) 


Note that the superscript s® shows those vectors for 
cluster i. 

Therefore, alternative optimization of (15.8) is done 
by calculating (15.9), (15.10), and the eigenvectors for 


(15.11) repeatedly until convergence. 


241 


esl | d Wed 


242 


7st | d Hed 


Part B 


Fuzzy Logic 


Fuzzy c-Regression Models 
In this section, we assume that x= (x!, ...,4?) is 
a p-dimensional independent variable, while y is 
a scalar-valued dependent variable. Hence data set 
{(x1, y1), ---, (xw, yn)} is handled. We consider c re- 
gression models 


p 
jj +1 : 
y=) > Bx + BP s te Tysciges 
j=l 
Hence, the squared error is taken to be the dissimilarity 


2 


p o. 
D((xx, yk), Bi) = | yk— 5 ppe 


j=l 


and an objective function 


c N 
J(U,B) =) J 0" D(x, Ye), B) (m> 1), 
i=1k=1 
(15.12) 


is considered, where B; = (B},..., petty and B= 
(B,,...,B.). To express the solutions in a compact 
manner, we introduce two vectors 


z= (xl, 38,1), B= (Bhs b t). 


Then we have 


1 
= 
D( (xx .yK) Bi) mI 


Mi = Fe ; (15.13) 
T=" Oky), Bi) mT 
N ly 

p= (2 Yi (uei)"Yeze (15.14) 
k=1 k=1 


Thus the alternative optimization of J(U, B) is to calcu- 
late (15.13) and (15.14) iteratively until convergence. 


The Method of Gustafson and Kessel 
The method of Gustafson and Kessel enables us to in- 
corporate clusterwise covariance variables denoted by 
S,,...,5,. We consider 


c N 
J(U,V,S) = > Ys u)" Dx. v; Si) (m>1), 
i=1k=1 
(15.15) 


where a simplified symbol S = (S1,..., Sc) and the 
clusterwise squared Mahalanobis distance D(xx, v;i; Sj) 
is used. Note also that S; is with the constraint 


(0; > 0) 


where p; is a fixed parameter and |S;| is the determinant 
of S;. We assume, for simplicity, p; = 1 [15.16]. 
The solutions are as follows 


[Si] = pi (15.16) 


= 


c A 
Dre vi Si) \ T 

uri = »( vist 2) (15.17) 
= D(x, vj; Sj) 
N m 

y= Det et)" (15.18) 
pe UK)” 
1 N 

i= aE J u" avav. (15.19) 
Sil? k=1 


where $; = YON (u) (te — vi) (Xe — vi) T 

Since three types of variables are used for the 
method of Gustafson and Kessel, the alternative opti- 
mization iteratively calculates (15.17-15.19) until con- 
vergence. 


15.2.5 Possibilistic Clustering 


The possibilistic clustering [15.17,18] proposed by 
Krishnapuram and Keller does not use the con- 
straint (15.3) in the alternative optimization algorithm 
FCM. Rather, the optimization with respect to U is 
without any constraint. To handle arg miny J(U, V) 
of (15.2) without a constraiint leads to the trivial so- 
lution of U=O (the zero matrix), and hence they 
proposed a modified objective function 


c 


N 
Joos(U, V) = Yuri)" D(x, vi) 


i=1k=1 


g N 
+ Xn xe — Uji)" (m> 1), 


i=1 k=l 
(15.20) 
where n; (i = 1,...,c) is a positive constant. 
We easily have the optimal U 
1 
p= = (15.21) 
D(xg.vi) \) "T 
ae) 


while optimal V is given by (15.5). 
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The natural fuzzy classifier derived from the possi- 
bilistic clustering is 


1 
Ui(x; vi) = — r >; A EE T 
ie (22) 
(15.22) 


This classifier has the following properties: 


Proposition 15.4 
Ui (xx; vi) = uki by (15.21), i.e., the possibilistic classi- 
fier interpolates the membership value u;;. 


Proposition 15.5 
When |x| go to infinity, U;(x;v;) (i= 1,...,c) ap- 
proaches zero 


lim U(x; v) =0. 


IlxI| co 


Proposition 15.6 
The maximum value of U;(x;v;) (i= 1,...,c) is atx = 
Vi 


max U;(x; vi) = U;(j, v) = 1. 
xERP 


15.2.6 Kernel-Based Fuzzy c-Means 


The support vector machines [15.19, 20] is now one of 
the most popular methods of supervised classification. 
Since positive definite kernels are frequently used in 
support vector machines, the study of kernels has also 
been done by many researchers (e.g., [15.21]). The pos- 
itive definite kernels can also be used for fuzzy c-means, 
as we see in this section. 

The reason why we use kernels for clustering is that 
essentially the K-means and fuzzy c-means have linear 
boundaries between clusters. 

Note that the K-means classifier uses the nearest 
center allocation rule for a given x € R? 


x—> G; 4> i=arg min D(x, vi), 
1<j<c 


and hence the rules generates the Voronoi re- 
gion [15.22] with the centers v,,...,v. which has 
piecewise linear boundaries. 


For fuzzy c-means, the classifiers are fuzzy but if 
we introduce simplification of the rules by crisp reallo- 
cation [15.13] by 


x—> G; => i=arg max U;(x; V). 
zje 


Then we again have the Voronoi regions with the cen- 
ters Vig sees Veo 

The introduction of the covariance variables enables 
the cluster boundaries to be quadratic, but more flexible 
nonlinear boundaries cannot be obtained. 

In order to have clusters with nonlinear boundaries, 
we can use positive-definite kernels. Kernels are intro- 
duced by using a high-dimensional mapping ®: R? —> 
H, where H is a Hilbert space with the inner product 
(-,-)y and the norm ||- |a. 

Given objects x1,...xy, we consider its images 
by the mapping ®: ®(x,),..., (xy). Note that the 
method of kernels does not assume that an explicit form 
of (xı), ..., (xy) is known, but their inner product 
(D(x;), B%))z is assumed to be known. Specifically, 
a positive-definite function K(x, y) is given and we as- 
sume 


K(x, y) = (Dx), OQ))a - 


This assumption seems abstract, but if we are given an 
actual kernel function, the method becomes simple, for 
example, a well-known kernel is the Gaussian kernel 


K(x, y) = exp(—C||x—yl|*) . 


Then what we handle is (®(x;), ®())4 = exp(—C||x— 
yl®. 

We now proceed to consider kernel-based fuzzy 
c-means [15.23]. The objective function uses (x1), 
..., (xy) and cluster centers w1, ..., We of H 


c N 
J(U, V) = » P (wei) IPO) -wilg (m>1), 


i=1k=1 


(15.23) 
where W = (w1, ..., Wc). We have 
— 
mat 
‘me ie (15.24) 
gal a e 
IDe —wil g! 
N m 
— (upi) D (xk 
"m Dg= Uni)” Pe) (15.25) 


YL 1 (uki) 
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Note, however, that the explicit form of ®(x,) and 
hence w; is not available. Therefore, we eliminate w; 
from the iterative calculation. Thus, the updating w; is 
replaced by the update of 


Dy (Xx, Wi) = || (Xe) — will? - 
We then have 
Dy (Xk, Wi) = K (Xx, Xt) 
20 Sn 
XC K (x4, Xx) 


~ SN 
i (Uji) j=l 


1 
(Eii) 


N N 
x by YS yue)" K, xe) . 


+ 


j=l €=1 
(15.26) 
Using (15.26), we calculate 
—__1___ 
uy = — Pe Gem (15.27) 


UF! an 


=1 . 
Dy (xg. wi) m1 


Thus the alternative optimization repeats (15.26) 
and (15.27) until convergence. 

Fuzzy classifiers of kernel-based fuzzy c-means can 
also be derived [15.13]. We omit the details and show 
the function in the following 


N 
doi" KG. y) 


N 
ae (uki) j=l 


N 1 
(Sey) 


N N 
x D Yo Cuei)” K i xe), 


j=1 £=1 


D(x; vi) = K(x, x) — 


(15.28) 
—ı 


|S Dæ 
Uj(x) = Diaen) (15.29) 


J=1 


A Simple Numerical Example 
A well-known and simple example to see how the ker- 
nel method work to produce clusters with nonlinear 
boundaries is given in Fig. 15.1. There is a circular clus- 
ter inside another group of objects of a ring shape. We 
call it ring around circle data. 

Figure 15.2 shows the shape of a fuzzy classifier 
(15.29) with m = 2 and c = 2 obtained from the ring 
around circle data. Thus the ring and the circle inside 
the ring are perfectly separated. 


15.2.7 Clustering 
with Semi-Supervision 


Recently many studies consider semisupervised learn- 
ing (e.g., [15.24,25]). In this section we briefly 
overview literature in fuzzy clustering with semi- 
supervision. 

We begin with two classes of semisupervised learn- 
ing after Zhu and Goldberg [15.25]. They defined 
semisupervised classification that has a set of labeled 
samples and another set of unlabeled samples. Another 
class is called constrained clustering which has two 
sets of must-links ML = {(x,y),...} and cannot-links 
CL = {(z,w),...}. Two objects x and y in the must-link 
set has to be allocated in the same cluster, while z and 
w in the cannot-link set has to be allocated to different 
clusters. 


A 
1 x 
x x 
x% mx į x * 
gone he xX yx x x 
xX x xX, x 
xX x 
0.8 xx x% xX 
x x x 
x x 
x 
0.6 ork xy x 
x x x x% x 
x Xx xX x x 
$ x<” X x% 7 
x xx X x x xXx 
x x x, 
0.4 x * Xx% xx 
xx $x 
x x 
x 
x * 
xx 
0.2 
x x x| x 
x x 
x x z 
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x K x xx Lk 
0 x% x 
> 
0 0.2 0.4 0.6 0.8 1 


Fig. 15.1 An example of a circle and a ring around the 
circle 
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15.3 Hierarchical Fuzzy Clustering 


Let us briefly mention two studies in the first 
class of semisupervised classification. Bouchachia and 
Pedrycz [15.26] used the following objective function 
that has an additional term 


e N 
J(U, V) = x Yi {quay Dax, vi) 


i=1k=1 
+ a (uzi — ki)” D(&k, vi) } . 


where iz; is a given membership showing semisupervi- 
sion. 

Miyamoto [15.27] proved that an objective func- 
tion with entropy term [15.13,28-31] can generalize 
the EM solution of the mixture of Gaussian densities 
with semisupervision [15.25, p. 27]. 

Another class of constrained clustering [15.32, 33] 
has also been studied using a modified objective func- 
tion [15.34] with additional terms of the must-link and 
cannot-link 


e N 
JU, V) =X) ui)’ Dar vi) 


i=1k=1 


> 5 Ugly + > 5 UkiUjk 


(xx ..43) EML il (k. )ECL i= 1 


N 
+a 
k=1 


15.3 Hierarchical Fuzzy Clustering 


There is still another method of fuzzy clustering that is 
very different from the above fuzzy c-means which is 
related to the single linkage in agglomerative hierarchi- 
cal clustering. 

In this section, we assume that objects X = 
{x1,...,Xn} are not necessarily in an Euclidean space. 
Rather, a relation S(x,y) satisfying reflexivity and 
symmetry 


S(xx)=1, VWxex, 

S(x,y) = S(y,x), Vx, yEex 
is assumed, where a larger value of S(x, y) means that x 
and y are more similar. 


We then describe the general algorithm of agglom- 
erative hierarchical clustering as follows: 


(15.30) 
(15.31) 


Algorithm 15.3 AHC: Algorithm of Agglomerative 
Hierarchical Clustering 


AHCI: Let initial clusters be individual objects 


Gi = {xj}, i= 1,...,N and put K = N. 


Fig. 15.2 Two clusters and a fuzzy classifier from the ring 
circle data; fuzzy c-means with m = 2 is used 


To summarize, the method of fuzzy c-means with 
semisupervision including constrained clustering has 
not yet gained wide popularity, due to the limited num- 
ber of studies comparing with those in another field of 
machine learning where many papers and a number of 
books have been published [15.24, 25, 32]. Hence more 
results can be expected in this area of studies. 


AHC2: Find pair of clusters of maximum similarity 
(Gp, G4) = arg max S(G;, G;) . 
ij 

Merge G, = G U G4. K = K- 1 and if K = 
1, stop. 

AHC3: Update S(G,, G’) for all other clusters G’. 
Go to AHC1. 

End AHC. 


The updating step of AHC3 admits different choices 
of similarity between clusters, among which the single 
linkage methods uses 


eee 


Although there are other choices, discussion in this sec- 
tion is focused upon the single linkage. 

On the other hand, studies in the 1970s including 
Zadeh’s [15.10] proposed hierarchical clustering using 


around 
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the transitive closure of fuzzy relations S(x, y). To de- 
fine the transitive closure, we introduce the max—min 
composition 


(So T)(x, z) = max min{S(x, y), T(y,z)} , 


where S and T are the fuzzy relations of X. Using the 
max-min composition, we can define the transitive clo- 
sure S* of S 


S* (x, y) = max{S(x,y), S y), P@y),- 5, 


where S?=SoS and S‘=SoS*—!, It also is not 
difficult to see S* = SMT! when S is reflexive and 
symmetric. 

When S is reflexive and symmetric, the transitive 
closure S* is also reflexive and symmetric, and more- 
over transitive 

S* (x,y) > min{S* (x, z),S*(z,y)}}, YzexX. 

If a fuzzy relation is reflexive, symmetric, and transi- 
tive, then it is called a fuzzy equivalence relation: it 
has a property that every a-cut is a crisp equivalence 
relation 


B“ lex x)=, YxexX, (15.32) 
[S*]a(x, y) = [S*]le0 x), Yx, yEX, (15.33) 
[S*]o(x,y) = 1, [S*]e(y,z) = 1 

> [S*]a(%,z)=1, (15.34) 


where [S*]q (x, y) is the a-cut of S* (x, y) 


[S*]e(x,y) =1 — > S*(x,y) >a; 
[S*]a@,y) =0 <> S*(x,y) <a. 


Thus each a-cut of S* induces an equivalence class of 
X, and moreover if œ decreases, the equivalence class 


15.4 Conclusion 


We overviewed fuzzy c-means and related studies. 
Kernel-based clustering algorithm and clustering with 
semisupervision were also discussed. Moreover, fuzzy 
hierarchical clustering which is based on fuzzy graphs 
and very different from fuzzy c-means was briefly re- 
viewed. 


becomes coarser, and therefore S* defines a hierarchical 
clusters. 

We now can describe a fundamental theorem on 
fuzzy hierarchical clustering. 


Theorem 15.1 Miyamoto [15.11] 

Given a set of objects X = {x1,...,xy} and a similar- 
ity measure S(x,y) for all x, y € X, the following four 
methods give the same hierarchical clusters: 


1. Clusters by the single linkage, 
Clusters by the transitive closure S*, 

3. Clusters as vertices of connected components of 
fuzzy graph with vertices X and edges X x X with 
membership values S(x, y), and 

4. Clusters generated from the maximum spanning 
tree of the network with vertices X and edges X x X 
with weight S(x, y). 


The above theorem needs some more explanations. 
Connected components of a fuzzy graph means the 
family of those connected components of all -cuts of 
the fuzzy graph. Since connected components grows 
with decreasing a, those sets of vertices form hier- 
archical clusters. The minimum spanning tree is well 
known, but the maximum spanning tree is used instead. 
The way in which hierarchical clusters are generated 
is the same as the connected components of the fuzzy 
graph. 

Although this theorem shows the importance of 
fuzzy hierarchical clustering, it appears that no new re- 
sults that are useful in applications are included in this 
theorem. Miyamoto [15.35] showed, however, that other 
methods of DBSCAN [15.36] and Wishart’s mode anal- 
ysis [15.37] have close relations to the above results, 
and he discusses the possibility of further applica- 
tion of this theorem, e.g., to nonsymmetric similarity 
measure. 


New methods and algorithms based on the idea of 
fuzzy c-means are still being developed, as the funda- 
mental idea has enough potential to produce many new 
techniques. On the other hand, fuzzy hierarchical clus- 
tering is rarely mentioned in the literature. However, 
there are possibilities for having new theory, methods, 
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and applications in this area, as the fundamental math- 
ematical structure is well established. 

Many important studies were not mentioned in 
this overview, for example, Ruspini’s method [15.38] 


ers may read books on fuzzy clustering [15.8, 
9,13, 16,39] for details of fuzzy clustering. Also, 
Miyamoto [15.11] can still be used for study- 
ing the fundamental theorem on fuzzy hierarchical 


and cluster validation measures are important. Read- clustering. 
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16. An Algebraic Model of Reasoning 
to Support Zadeh's CWW 


Enric Trillas 


In the very wide setting of a Basic Fuzzy Alge- 
bra, a formal algebraic model for Commonsense 
Reasoning is presented with fuzzy and crisp sets 
including, in particular, the usual case of the Stan- 
dard Algebras of Fuzzy Sets. The aim with which 
the model is constructed is that of, first, adding to 
Zadeh's Computing with Words a wide perspective 
of ordinary reasoning in agreement with some ba- 
sic characteristics of it, and second, presenting an 
operational ground on which linguistic terms can 
be represented, and schemes of inference posed. 
Additionally, the chapter also tries to express the 
author's belief that reasoning deserves to be stud- 
ied like an Experimental Science. 
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16.1 A View on Reasoning 


Thinking is a yet not scientifically well-known natu- 
ral and complex neurophysiological phenomenon that, 
shown by people and given thanks to their brains, is 
mostly and significantly externalized in some observ- 
able physical ways, like it is the case of talking by 
means of uttered or written words. Only recently the 
functioning of the brain’s systems started to be stud- 
ied with the current methods of experimental science, 
and made some knowledge on its internal working 
possible. 

Talking acquires full development with a typically 
social human manifestation called telling with, at least, 
its two modalities of discourse and narrative that, either 
in different oral, spatial hand’s signs, or written forms, 
not only support telling but, together with abstraction, 
could be considered among the highest expressions of 
brain’s capability of thinking, surely reinforced during 
evolution by the physical possibilities of the humans to 
tackle and to consider the possible usefulness of ob- 


jects. Telling can be roughly described as consisting in 
chains of sentences organized with some purpose. 

Thinking and telling are but names for abstract 
concepts covering the totality of those human actions 
designated by to think, to tell, and to discuss, of which 
only the last two can be directly observed by a layper- 
son. With telling and discussing not only reasoning is 
shown, but also abstraction is conveyed. In this sense, it 
can be said that telling and discussing cannot exist with- 
out reasoning, and that they are intermingled in some 
inextricable form that allows us to guess for foresee- 
ing what will come in the future, to imagine what could 
happen in it, and to express it by words. At this point, 
the human capability of conjecturing appears as some- 
thing fundamental [16.1]. 

Foreseeing and imagining resulted essential for the 
growing and expanding of mankind on Earth, and are 
assisted by the human capabilities of guessing (or con- 
jecturing), and refuting, for saying nothing on those 


v 
w 
= 
“e 
ic) 
= 
fon) 
= 


250 PartB 


Fuzzy Logic 


T94 | 9 Hed 


emotions that so often drive human reasoning toward 
creative thinking. Of course, as thinking comprises 
more features than reasoning, like they are the case of 
feelings and imagining with sounds, images, etc., both 
concepts should not be confused. 

This Introduction only refers generically and just 
in a co-lateral form, to telling as supporting discourse 
and narrative in natural language, and for whose un- 
derstanding the context-dependent and purpose-driven 
concept of the meaning of statements [16.2], their se- 
mantics, is essential. This chapter mainly deals with 
an algebraic analysis of the reasoning that, generated 
by the physical processes in the brain thinking consists 
in, is externalized by means of the language of signs, 
oral or written expressions, figures, etc. It pretends 
nothing else than to be a first trial toward a possi- 
ble and more general algebra of reasoning than are 
Boolean algebras, orthomodular lattices, and standard 
algebras of fuzzy sets, the algebraic structures in which 
classical propositional calculus, quantum physics’ rea- 
soning, and approximate reasoning are, respectively, 
represented and formally presented [16.3, 4]. 

Reasoning is considered the manifestation of ratio- 
nality [16.5], a concept that comes from the Latin word 
ratio (namely, referred to comparing statements), and 
from very old, allowed to believe in a clear cut existing 
among the living species and under which the human 
one is the unique that is rational, that can reason. Rea- 
soning is also, at its turn, an abstract concept referring 
to several ways for obtaining conclusions from a pre- 
vious knowledge, or information, or evidence, given by 
statements that are called the premises; it is sometimes 
said that the premises are the reasons for the conclu- 
sions, or the reasons that support them. Conclusions are 
also statements, and without a previous knowledge nei- 
ther reasoning, nor understanding, is possible. 

Apparent processes of reasoning is what, to some 
extent, is observable and can be submitted to a 
Menger’s kind of exact thinking [16.6], by analyzing 


16.2 Models 


The pair telling—reasoning can be completed to a triplet 
of philosophically essential concepts with that of mod- 
eling that, facilitated by abstraction, is one among the 
best ways people have for capturing the basics of the 
phenomena appearing in some reality by not only tak- 
ing some perspective and distance with them, but, after 
observation, recognizing their more basic treats. Thanks 


them in general enough algebraic terms. This is ac- 
tually the final goal of this chapter, whose aim is to 
be placed at the ground of Zadeh’s Computing with 
Words (CWW) [16.7], helping to adopt in it the point 
of view of ordinary reasoning and not only that of the 
deductive one, and for a viewing of CWW close to the 
mathematical modeling of natural language and ordi- 
nary reasoning. 


Note 

Due, in a large part, to the author’s lack of knowledge, 
there are many topics appearing as manifestations of 
thinking and, up to some extent, matching with rea- 
soning, that cannot even be slightly taken into account 
in a paper that is, in itself, of a very limited scope, 
and only contains generic reflections in its nontechnical 
Sects. 16.1-16.4. Among these topics, there are some 
that, as far as the author knows, are not yet submitted 
to a systematic scientific study. It is the case, for in- 
stance, of what could be called sudden direct action, 
or action under pressing, as well as those concerning 
thinking and reasoning in both the beaux arts, and the 
music’s creation [16.8]. 

Those topics can still deserve some scientific and 
subsequent philosophical reflections. In some of them 
like it is, for instance, the case in modern paint- 
ing, it appears a yet mysterious kind of play between 
actual or virtual situations, and where the same ob- 
jects of reasoning can be seen as unfinished, but 
not as not-finished [16.9]. That is, in which, the 
antonym seems to play a role different from that of the 
negate [16.10]. 

Being those topics yet open to more analysis, per- 
haps some of the methodologies of analogical or case- 
based reasoning could suggest some ways for an ex- 
ploration of them in terms of fuzzy sets [16.11], and 
in view of a possible computational mechanization of 
some of their aspects. 


to models, not only the terms or words employed in 
the linguistic description of a phenomenon can be well 
enough understood by bounding its meaning in a formal 
frame, but only thanks to mathematical models the use 
of the safest type of reasoning, formal deduction, can be 
used for its study. In addition and currently, models are 
often useful formalisms through which the possibility 
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of coping with human reasoning by means of comput- 
ers can be done. This chapter essentially presents a new 
mathematical (algebraic) model for reasoning that is 
neither directly based on truth, but on contradiction, nor 
confuses reasoning and deducing, a confusion CWW 
should not fall in, since people mainly reason in non- 
formal and nondeductive ways. 

If Experimental Science can be roughly described 
as an art for building plausible specific models of some 
reality with the aim of intellectually capturing it, in no 
case they (namely, the algebraic ones) are established 
forever but do change with time, something that shows 
they cannot be always confused with a time-cut of the 
reality tried to be modeled. On the other side, good 
models do warrant, at least, to preserve what already 
hold in them once new evidence on the corresponding 
reality is known. 

Even if models are not to be confused with what 
they represent, same as an architect’s mockup that 
should not be confused with a building following 


16.3 Reasoning 


To give a definition of reasoning is very difficult, if 
not impossible since, at the end, it is a family of natu- 
ral processes generated in the brain. A first operational 
question that can be posed is, What is reasoning for? of 
which, in a first approach, can be just said that reason- 
ing is actually intermingled with the will of people to 
ask and to answer questions, to influence, to inform, to 
teach, to convince, or just to communicate with other 
people. Something usually managed by means of re- 
ciprocal telling, or dialogue, or conversation, between 
people. 

In a second and perhaps complementary approach, 
it should also be said that reasoning also serves for sat- 
isfying human’s will of searching new ideas that are 
not immediately seen in the evidence, and that can 
help for a further exploring of the reality to which the 
premises refer to. The human will for communicating 
and influencing, for foreseeing and for exploring, are 
made possible by means of reasoning that, in this per- 
spective, seems to be a capability acquired thanks to 
the brain complexity once it, and through the senses, 
is in contact with the world and can try to under- 
stand what is and what happens in it by means of the 
neuronal/synaptic representations reached in the brain 
thanks, in part, to the external receptors of the human 
nervous system. 


from it, mathematical and computational modeling are 
among the greatest human acquisitions coming from 
abstraction and safe reasoning. Actually, a good deal 
of what characterizes the current civilization derives 
from models, and mathematical ones are usually con- 
sidered at the very top of rationality since they had 
proven to allow for a good comprehension of several 
and important realities previously recognized as actu- 
ally existing. 

Models are a good help for a better understanding 
of the reality they model, and mathematical ones show 
the so-called unreasonable effectiveness of mathemat- 
ics [16.12] for the understanding of reality thanks, in 
a good part, to formal deduction, the safest form of rea- 
soning they allow to use. If it can metaphorically be said 
that if reality is in color, a model of this reality is a sim- 
plification of it in black and white; of course and up to 
some extent, modeling is an art with whose simplifica- 
tions it should not be avoided what is essential for the 
description of the corresponding reality. 


To answer a more scientifically sensitive and less 
psychological question, What does reasoning appear 
to be? let us place ourselves in two different, although 
overlapping, points of view: those from the premises 
and those from the conclusions. From the first, rea- 
soning can try to confirm, to explain, to enlarge, or 
to refute, the information conveyed by the premises. 
From the second, and in confrontation with the con- 
text of the premises, conclusions can be classified in 
either necessary, or contingent. At its turn, the last 
can be either explanations or speculations that is, con- 
clusions trying to foresee, either backward or forward 
and perhaps by jumps from the premises and without 
clear rules for it, respectively, something new and cur- 
rently unknown, but that eventually can be suggested by 
the premises and, very often, reached thanks to some 
additional background knowledge. All this is what con- 
ducts to the typically human capability called creativity, 
many times obtained through either hypotheses (back- 
ward case), or speculations (forward jumping case). It 
even helps to take rational, pondered decisions, that are 
among the essential characteristics shown by the intel- 
ligence attributed to people. 

For what concerns necessary conclusions, they are 
not usually for capturing something radically new since, 
in general, what is not included, or hidden, in the 
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premises, but is external to them, is changing. From 
necessary premises not only necessary conclusions fol- 
low under some rules of inference, but also contingent 
ones that, in addition, cannot be always deployed from 
the premises under some well-known precise rules. 
Necessary conclusions are surely useful for arriving at 
a better understanding of what is strictly described by 
the premises, and for deploying what they contain, as it 
happens in formal sciences, and also to show what can- 
not be the case if the given premises do not perfectly 
reflect the reality. In this sense, refutation of either a part 
of the evidence, or of hypotheses, etc., is actually im- 
portant [16.13]. 


16.3.1 A Remark 
on the Mathematical Reasoning 


In the case of mathematics, where the only certified 
knowledge is furnished by the theorems proved by 
formal deduction, the former statement necessary con- 
clusions are not for capturing something radically new, 
could be actually surprising, especially if it could mean 
that nothing new can be deductively deployed from 
some supposedly necessary premises. For instance, 
from the Peano’s axioms defining the set N of natural 
numbers, a big amount of new and fertile concepts and 
theorems are deductively deployed and successfully ap- 
plied to fields outside mathematics. 

To quote a case: The not easily captured — for 
nonmathematicians — abstract concept of the real num- 
ber is constructed after that of rational number (whose 
set is Q), that comes from an equivalence between 
pairs of integers (set Z), after defining the concept of 
the integer number from another equivalence between 
pairs of natural numbers and that, at the end and with 
a jump from pairs to some infinite sequences of rational 
numbers, makes of each equivalence’s class of such se- 
quences a real number (set R). Not only real numbers 
are very useful in maths and outside them, but many 
more concepts arise after the real number is constructed 
in such a classificatory way, like, for instance, the two 
classifications of R in rational/irrational, and in alge- 
braic/transcendental, from which a big amount of useful 
certified knowledge arises. Some hints on how all that 
happened are the following. 

The just described process for knowing through 
classifying came along a large period of time in which 
the same concept of number suffered changes. For 
instance, irrational numbers were not seen as actual 
numbers by some old Greek thinkers for whom num- 
bers were only the rational ones. Also, the natural 


number concept came from the counting of objects 
in the real world, the integer from indexing units in 
scales above and below some point, and the rational 
ones from systematically fractioning segments. Even 
for a long time the number 0 was neither known, nor 
latter on considered as an actual number, and the in- 
terest in irrational numbers grown from the necessity 
of managing expressions composed by roots of ratio- 
nal numbers, essentially in the solution of polynomial 
equations, as well as from the relevance of some rare 
numbers like x and e. Since along this process, math 
was involved in many practical problems, those math- 
ematicians who finally constructed the real numbers in 
just pure mathematical terms and under the deductive 
procedures characterizing math, and including in it inte- 
gers, rational, and irrationals, were strongly influenced 
by that large history. A hint on this influence is shown, 
for instance, by the names assigned to some classes of 
natural numbers: prime, quadratic, cubic, friend, etc., 
not to speak of the concept of a complex number com- 
ing from the so-called imaginary numbers. 

The above-mentioned equivalences allowing to pass 
from N to Z, Z to Q, and Q to R, are classifications in 
some sets that, at each case, are derived from the for- 
mer one and that, once well formally constructed, and 
hence being of an increasingly abstract character and 
named with more or less common names, all of them 
are based, at the end, on the Peano’s definition of N. 
This definition makes N a very intriguing set, which cu- 
rious and often surprising certified properties generated 
one of the more complex branches of maths, Number 
Theory, in which study many concepts of mathemati- 
cal analysis and probability theory are used. Let us just 
remember the sophisticated proof that changed the old 
Fermat’s conjecture into a theorem, that is, into certified 
mathematical knowledge. Before deductively proven, 
mathematical conjectures, and in particular those in 
Number Theory, are but speculations well based on 
many positive instances. 

Even accepting that math is freely created by math- 
ematicians from some accepted minimal number of 
noncontradictory and independent axioms, and that the 
only way of certifying its knowledge is by deductive 
proof, it should not be forgotten that to imagine what at 
each case can be deduced from the premises, and that 
is often done by analogy with a previously solved, or at 
least considered, case, is a sample of commonsense rea- 
soning. In addition, the beauty mathematicians attribute 
to their results not only play an important role in the de- 
velopment of math, but also show how mathematicians 
do reason same as cultured people do. A nice example 
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of the fact that mathematicians also do reason through 
commonsense reasoning, is shown by the typical state- 
ment How beautiful this (supposed) concept or theorem 
is, claimed before a deductive proof fails and avoids to 
accept it as mathematical knowledge. 

Mathematicians such as philosophers, detectives, 
writers, businessmen, scientists, physicians, etc., rea- 
son thanks to their brains and with the experiential 
background knowledge stored in it. They are moved 
by curiosity, supported in imagination and conjectur- 
ing, and with the will of reaching new knowledge in 
their respective field. In addition, a high level of cre- 
ativity seems to be a remarkable characteristic of great 
mathematicians. All that originated Wigner’s famous 
unreasonable effectiveness of mathematics in the nat- 
ural sciences [16.12]. 

A way for obtaining a positive confirmation against 
reality of a reasoning, can be searched for through 
those conclusions showing a good level of agreement 
with the reality to which the information refers to, and 
a negative one through the refutations, or conclusions 
contradicting either the premises, or what necessarily 
follows from them. With respect to explanations, or hy- 
potheses, from them not only necessarily should follow 
the premises, but also all that necessarily follows from 
them. Of course, all that can be qualified as necessary 
should not be only as safe as the premises could be, 
but obtained by means of precise rules allowing any- 
body to totally reproduce the processes going from the 
premises to the conclusions. In this sense, perhaps it 
could be better said that necessary conclusions are safe 
in the context of the premises, but that contingent con- 
clusions are unsafe, and that the first should be obtained 
in a way showing that they are just reproducibly de- 
ployed from the premises. It seems clear enough that 
confirmation with reality should be searched, in gen- 
eral, by means of some theoretic or experimental testing 
of the conclusions against the reality to which they refer 
to. 

With respect to premises and to ensure its safety, its 
set should be bound to show some internal consistency, 
like it is for instance to neither contain contradictory 
pairs of them, nor self-contradictory ones, since in such 
a case it does not seem acceptable that the premises 
can jointly convey information that could be taken as 
admissible in the model. The same happens with the 
set of necessary conclusions directly deployed from the 
premises, since in this case the existence of contradic- 
tory conclusions will delete its necessity. The case of 
contingent conclusions is different, as it comes from 
everyday experience, since the existence of contradic- 


tory explanations or contradictory speculations is not 
only not surprising at all, but it is sometimes the case of 
having contradictory hypotheses or speculations for or 
from the same phenomenon. 

All that, if only expressed by words or linguistic 
terms, cannot facilitate by itself a clear, distinct, and 
complete comprehension on the subject of reasoning 
with fertility enough to go further. To increase the com- 
prehension of the machinery of reasoning is for what 
a modeling of it in black and white [16.14] can be 
a good help toward a better understanding of what it 
is, or surrounds it, like mathematical models offer in 
experimental sciences from, at least, Newton’s time. Of 
course, to establish a mathematical model for reason- 
ing it is indeed necessary not only to be acquainted 
both with what reasoning is, and its different modali- 
ties, but to have a suitable frame of representation for 
all the involved linguistic terms, including the concept 
of contradiction. 

The concept of representation in a suitable formal 
frame is essential for establishing models and, jointly 
with the use of the deductive (safe) reasoning the for- 
mal frame makes possible, is what not only marks an 
important frontier between Science and Philosophy, but 
also helps us to show the unreasonable effectiveness of 
mathematics in Science and Technology. For instance, 
like the set of rational numbers is a good enough for- 
mal frame for the shop’s bill, the three-dimensional real 
space is a formal frame for 3D Euclidean geometry, the 
four-dimensional Riemann space is that for Relativity 
theory, and the infinite-dimensional Hilbert space is the 
frame for quantum physics. Mathematical models add 
to the study of reality the gift consisting in the possibil- 
ity of systematically applying to its analysis the safest 
form of reasoning, that is, formal deduction. 

When the linguistic terms translating the corre- 
sponding concepts are precise and, at least in principle, 
all the information that eventually can be needed is 
supposed to be available, like it is the case in the 
classical propositional calculus, the frame of Boolean 
algebras seems to be well suitable for representing these 
terms [16.3]. If the linguistic terms designating the 
involved basic concepts are precise but not all the infor- 
mation is always available, like it happens, for instance, 
in quantum physics, weaker structures than Boolean al- 
gebras, like orthomodular lattices are, could be taken 
into account. If there are involved essentially imprecise 
linguistic terms, like it happens in commonsense rea- 
soning, then the so-called algebras of fuzzy sets seem 
to be suitable once the linguistic terms are well enough 
designed by fuzzy sets [16.15, 16]. 


253 


€°OL| a Hed 


254 PartB 


Fuzzy Logic 


1'94 | 9 Hed 


The problem of selecting a convenient frame of 
representation for ordinary reasoning, defined by a min- 
imal number of axioms, is a crucial point that should 
be established in agreement with both the methodologi- 
cal principle the Occam’s Razor states by Not introduce 
more entities than those strictly necessary, and with 
the Menger’s addenda, Nor less than those with which 
some interesting results can be obtained [16.17]. Such 
methodological principle and addenda are, of course, 
taken into account in what follows since, on the con- 
trary, almost nothing could be added to which philoso- 
phers said. Two important features that Menger’s exact 
thinking [16.6] offers through mathematical models are 
that it is always clear in it under which presuppositions 
the obtained results can hold, that deductive (safe) rea- 
soning can be extensively used through a mathematical 
symbolism translating the basic treats of the subject, 
and that what is not yet included in it has a possibility 
of, at least, being clearly situated outside the model and, 
perhaps, latter on included in a new and larger model. 
This is what, at the end, happened with the old Euclid’s 
Geometry and Linear Algebra. 


16.3.2 A Remark on Medical Reasoning 


The field of medicine [16.18], that is full of imprecise 
technical concepts, is one in which the ordinary rea- 
soning used in it, deserves a careful consideration. In 
particular, it is important to know if a medical con- 
cept can or cannot be specified by a classical set, in 
the negative case, since it is not possible to conduct 
the corresponding reasoning in the classical Boolean 
frame. For instance, if two concepts D and B, are in- 
terpreted as fuzzy sets, the statement ((D and B) or 
(D and not B)) is only equivalent to D in some stan- 
dard algebras of fuzzy sets in which no law of duality 
holds [16.19]. 


16.4 Reasoning and Logic 


Let us stop for a while at the question Which is the rela- 
tionship between reasoning and logic? requiring to first 
stop at the concept of logic, classically understood as 
the formal study of the laws of reasoning and that, in 
addition and modernly, is basically understood by re- 
stricting reasoning to deduction. Today’s logic is indeed 
the study of systems allowing the safest type of reason- 
ing under which, and from a consistent set of premises, 
a consistent set of necessary conclusions, or logical 


In the typically clinical reasoning for diagnosing, 
there are many technical concepts that cannot be con- 
sidered precise by being clearly subjected to degrees. 
It is hence the concept of observed diabetes that can 
be submitted to a (empirical) Sorites’ process [16.20] 
to conclude that it cannot be represented by a classical 
set. For instance, a patient with 100 mg/dl of glucose in 
blood does not suffer diabetes, as well as with 101, 102, 
..., and up to 120 mg/dl, in which moment the patient 
could be diagnosed with diabetes. Nevertheless, since 
the crisp mark 120 is not liable in all cases by being 
a somehow changing experimental threshold, it is bet- 
ter to frame the diagnose in the setting of fuzzy logic 
and by also taking into account the weight, the kind of 
job, the age, as well as the usual alimentation of the 
patient. That is, the physicians do reason on the basis 
of the complex imprecise predicate that could be writ- 
ten by: (Diabetes/Patient) (p) ~ glucose in blood (p) & 
weight (p) & job (p) & age (p) & alimentation (p), at its 
turn composed by elemental predicates, some of which 
are also imprecise. 

The physicians cannot reason by only taking into 
account the amount of glucose in blood, and under the 
typical schemes of classical reasoning. Once the current 
medical concepts are translated into fuzzy terms, it is 
necessary to follow the reasoning under those schemes 
allowed in a suitable algebra of fuzzy sets [16.19], and 
once it is designed accordingly with the context where 
the concepts are inscribed in. 

In addition, since what the processes conducting to 
diagnose try to find is a good enough hypothesis match- 
ing with the symptoms of the presumed illness, it is rel- 
evant for the researchers on medical reasoning to know 
about inductive abduction and speculation [16.18], and, 
mainly, for what respects to applying CWW. For all that 
the laws holding in the framework taken for represent- 
ing the involved fuzzy terms is relevant. 


consequences, is derived in a step-by-step ruled pro- 
cess that can be fully reproduced by another person who 
masters the use of rules. Hence, nowadays logic con- 
sists in the study of the so-called deductive systems, and 
it mainly avoids other types of reasoning like those by 
analogy, by abduction, and by induction. After Tarski 
formalized deductive systems by means of consequence 
operators [16.21], a logic is defined, in mathematical 
terms, as a pair consisting in a set of statements and 
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a consequence operator that can be applied to some (se- 
lected) subset of these statements, and once all that is 
represented in a formal frame. 

Nevertheless, it seems that in commonsense, every- 
day, or ordinary reasoning, people only make deduc- 
tions in, at most, a 25% of the cases [16.22], and that 
many conclusions are just reached either by the help of 
analogy from a precedent and similar case, or just by 
speculating at each case accordingly with some rules 
of thumb. In addition, some properties and schemes of 
reasoning that were classically considered like laws of 
reasoning, today cannot be seen as universally valid as 
it is with the distributive law in the reasoning of quan- 
tum physics, and with the several schemes of reasoning 
with fuzzy sets studied in [16.4] in the line of analyz- 
ing what is sometimes known as the preservation of the 
Aristotelian form. 

Consequently, the analysis of more than 75% of 
nondeductive reasoning processes should be considered 
of an upmost importance for a more complete study 
of reasoning. In some sense, such study began with 
the work of Peirce [16.23] for understanding scientific 


thinking, and latter on was continued with the studies 
on nonmonotonic reasoning in the field of Artificial In- 
telligence for helping to mechanize some ordinary ways 
of reasoning [16.24,25]. Since Computing with Words 
deals with ordinary reasoning in natural language, it 
seems obvious that the nondeductive ways of reason- 
ing should be not only in the back of CWW, but also 
taken into account in it. 

Toward a formal, algebraic, study of such 75% is 
mainly devoted to this chapter that, essentially, can be 
seen as a trial for enlarging logic from formal deductive 
to everyday reasoning with, perhaps, a kind of return- 
ing to Middle Age’s logic, as it can be considered from 
the Occam’s saying [16.26] that demonstration is the 
noblest part of logic, reflecting that logic was seen in 
that time as more than the study of deduction, even if it 
is considered a crucial form of reasoning. In the model 
presented in this chapter, based on conjectures (a term 
coming from [16.27], deduction, as a modality of rea- 
soning, plays a central role both in a weak and a strong 
form, respectively, corresponding to the formal, and to 
the ordinary ways of reasoning. 


16.5 A Possible Scheme for an Algebraic Model 


of Commonsense Reasoning 


It does not seem that the term deduction can refer to 
the same concept in formal and in ordinary reasoning, 
since in the first it appears more strict than in the second 
where, for instance, the conclusions (consequences) are 
not necessarily admissible like the premises. Think, 
for instance, on what basis philosophers consider is 
a deduction, and on what mathematicians refer to by 
a proof. Anyway, in both cases it should try to re- 
flect a safe enough kind of reasoning, in the sense of 
attributing to the conclusions no less confidence than 
that attributed to the items of initial information. These 
items should also be admissible in the sense of being as 
safe as possible knowledge on some subject. The good 
quality of initial information is actually important in 
any process of reasoning. 

Mathematics is considered the paradigm of deduc- 
tive reasoning, but it does not mean (as it is remarked 
in Sect. 16.3) that mathematicians only reason de- 
ductively since when they search for something new 
they do reason like other people do, often by do- 
ing jumps from the initial information to unwarranted 
conclusions [16.22]. What is not at all accepted in math- 
ematics are contradictions, and nondeductive proofs. 


If for mathematicians the mathematical model, based 
on the admissibility of its axioms, is their reality, then 
for applied scientists or for engineers a model is just 
a representation of some reality. For instance, no en- 
gineer confuses the actual working of a machine with 
a dynamic model of it and, when launching a rocket, 
it is well known that the so-called nominal trajec- 
tory (computed from a mathematical model), is not 
exactly coincidental with the actual one, and the perfor- 
mance of the rocket’s propulsion system is measured 
by taking into account the difference between these 
trajectories. In commonsense reasoning (CR) the situa- 
tion yet shows currently sensible differences with these 
cases. 

Some characteristics separating commonsense from 
formal deductive reasoning are as follows: 


a) CR does not consist in a single type of reasoning, 
but in several. A reasoning in CR can be schema- 
tized by P F q, where P is a set of admissible items 
of information, q is a conclusion under the consid- 
ered type of reasoning, and the symbol F reflects the 
corresponding reasoning’s process. Only in the case 
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b) 


c) 


d) 


e) 


of a deductive reasoning these processes are done 
under a strict regulation. 

Alternatively, if AC(P) reflects the set of attain- 
able conclusions, the scheme can be changed to 
q € AC(P). 

Often the items of information from which CR starts 
are expressed in natural language, with precise and 
imprecise linguistic terms, numbers, functions, pic- 
tures, etc. In addition and also often, such items of 
information are partial and/or partially liable with 
respect to the reality they are concerned with. In 
what follows, it will be supposed that no ambiguous 
terms are contained in these items, and that they are 
expressed by linguistic statements. 

Often CR lacks monotony. That is, when the num- 
ber of initial information items increases, then either 
the number of conclusive items decreases (anti- 
monotony), or there is no law for its variation (non 
monotony). Deduction is always monotonic, that is, 
no less conclusive items are obtained when the num- 
ber of items of initial information grows. 

It is typical of CR to jump from the initial in- 
formation to some conclusions, that is, that no 
step-by-step/element-after-element way can be fol- 
lowed. Jumping is never the case in formal de- 
duction, where the conclusions should be deployed 
from the initial items of information in a strict 
step-by-step manner and under previously known 
rules, even if the current reasoner avoids some triv- 
ial steps that, nevertheless, always can be easily 
recovered. 

In CR, people try to obtain either explanations, 
or refutations, or what is hidden in the given in- 
formation, or new ideas lucubrate from what it is 
supposedly known (the initial information). Only 
the third of these kinds of conclusions are typical 
of deduction. 

A minimal limitation in CR is that of keeping some 
kind of consistency among the given items of initial 
information, like it is not containing two contra- 
dictory such items, and also between them and the 
conclusions. 


Let us denote by P an accepted set of items of admis- 


sible initial information (premises), and by AC(P) the 
attainable conclusions under one of the last four types 
of CR in (e). It will be supposed that P is in some des- 
ignated family F of sets able to consistently describe 
something, and that AC(P) is in a larger family C such 
that F C C. Hence, AC can be seen as a mapping AC: 
F — C. The sets in F are supposed to contain items of 


admissible information, but this is not the case for those 
sets in C — F. Then: 


a) 


b) 


c) 


To do an analysis of CR in mathematical terms, an 
algebraic frame for representing all that is involved 
in CR should be selected in a way of not introducing 
more objects and laws than those strictly necessary 
at each case like they are, for instance, a symbolic 
representation of the linguistic connectives and, or, 
not, and If/Then. The symbols that will be used in 
this chapter are ., +,’, and <, of which the first two 
are binary operations, the third is a unary operation, 
and the fourth is a binary relation. 

Basic in all kind of reasoning is the concept of con- 
sistency even if it is not a unique way of seeing it. 
Three possible definitions of consistency are as fol- 
lows: 


@ Consistency is identified with noncontradiction: 
If p, then it is never not-g, symbolically repre- 
sented by p <q’. 

@ Consistency is identified with joint noncontra- 
diction: If (p and q), then it is never not-(p and 
q), symbolically represented by p.q É (p.q)’. 

@ Consistency is identified with incompatibility: It 
is never (p and q), symbolically represented by 
p.q = 0, provided there exist a symbol 0 like in 
set theory is the empty set Ø. 


At each case a suitable definition of consistency 
should be chosen accordingly with the correspond- 
ing context but, in what follows is just taken the 
first one. Of course, no other and less formal ways 
of seeing at consistency should be excluded and, in 
any case, the concept of consistency between pairs 
of elements should be extended to sets of premises 
and sets of conclusions to make them consisting, 
respectively, in admissible premises and attainable 
conclusions. 

Notice that in a Boolean algebra, since it is p.q = 
0&p<q & p.q< (p.q)’, the former definitions 
are equivalent, and, for this reason in the rea- 
soning with precise linguistic terms there is no 
discussion for what refers to the concept of consis- 
tency. 

p is contradictory with q, provided If p, then not- 
q, and p is self-contradictory when If p, then 
not-p. Notice that in ortholattices the only self- 
contradictory element is p = 0, and that with fuzzy 
sets endowed with the negation l-id, the self- 
contradictory fuzzy sets are those such that A < 
A' =1—A $ A(x) <}, for all x € X. 
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d) If P C AC(P), for all P in F, it is said that AC is ex- 
tensive, and AC(P) necessarily contains some items 
of admissible information. 

e) If P CQ, both in F, then if AC(P) C AC(Q), it is 
said that AC is monotonic. If AC(Q) C AC(P), an- 
timonotonic, and if AC is neither monotonic, nor 
antimonotonic, it is said that AC is nonmonotonic 
and there can exist cases in which, being P C Q, 
AC(P) and AC(Q) are not comparable under the set- 
inclusion C. 

f) Provided q represents a statement, and q’ represents 
its negation not-q: If q is in AC(P), then q’ is not in 
AC(P), or q’ is in AC(P)‘, it is said that AC is con- 
sistent in P. AC is just consistent if it is consistent 
in all Pin F. 

g) AC is said to be a closure, if for all PEF it is 
AC(P) € F, and AC(AC(P)) = AC?(P) = AC(P). 

h) AC(P) € C—F means that not all the elements in 
AC(P) show the characteristics that make admissi- 
ble those items in F. 


Main Definitions 

1) A mapping AC: F—C is said to be a weak- 
deduction operator [16.28] if it is monotonic, and 
consistent under a suitable definition. 

2) A mapping AC: F-—F is said to be a strong- 
deduction operator, or a Tarski’s logical conse- 
quence operator [16.21], if it is a weak deduction 
one that is also extensive, and is a closure. 


If AC is a weak-deduction operator, the elements 
in AC(P) are called weak consequences of P. If AC 
is a Tarski’s operator, the elements in AC(P) are the 
strong or logical consequences of P. Since logicians 
universally consider that Tarski’s operators translate 
the characteristics of formal deductive systems, or for- 
mal deduction, it will be here considered that weak- 
deduction operators translate those of (some kind of) 
commonsense deduction. 


Remarks 16.1 


a) Notice that, in the model, F represents the family of 
those sets whose elements are accepted as items of 
admissible initial information, and that such admis- 
sibility, once translated into a suitable definition of 
consistency, should be defined at each case. Hence, 
each time F should be conveniently chosen. For in- 
stance, a possible definition is: F is the family of 
those P for which there are no p and q in it and such 
that If p, then not-p. 


b) At each case, it should be defined to which ground- 
set W the sets F and C are included in, and W should 
be endowed with operations able to represent all 
that is necessary for the formalization of CR. For 
instance, it will be supposed that there is a binary re- 
lation < in W such that p < q translates into W the 
linguistic statement Jf p, then q. Analogously, and 
in the same vein in which ’ represents the linguis- 
tic not, there should be binary operations ., and +, 
representing, respectively, the linguistics and, and 
or. 

c) Basic in CR is the idea of conjecture, and that of 
refutation. Once a weak or strong consequence op- 
erator AC is adopted, the refutations of P could 
be defined as those elements r € W, such that 7’ € 
AC(P), that is those whose negation is deducible 
from P. At its turn, the conjectures from P can be de- 
fined as those elements q € W such that g’ ¢ AC(P), 
that is, those whose negation is not deducible from 
P. In this sense, the conjectures are the elements that 
are not (deductive) refutations of P. Both concepts 
could be more precisely named AC-refutations, and 
AC-conjectures. 


Since both precise and imprecise linguistic terms 
are usually managed in CR, in what follows W will 
be the set of all fuzzy sets in a universe of dis- 
course X, that is, W = [0, 1]*, the set of all functions 
A:X— [0,1]. This set will be endowed with the al- 
gebraic structure of a Basic Fuzzy Algebra, where 
the restriction of its operations to {0,1}* makes this 
set a Boolean algebra, isomorphic to the power set 
2* endowed with the classical set-operations of in- 
tersection, union, and complement of subsets. Crisp 
sets allow us to represent precise linguistic terms as 
it is stated by the axiom of Specification in naive set 
theory [16.29], an axiom that cannot be immediately 
extended to imprecise predicates since, for instance, 
they are not always represented by a single fuzzy 
set. 


Definition 16.1 [16.1] 

If .and + are binary operations and ’ is a unary one, 
then ((0, 1}*,., +, ^) is a Basic Fuzzy Algebra (BFA) 
provided it holds, 


1. Ap.A = A.Aọ = Ap, A} .A = A.A; =A, Ap tA =A ++ 
Ap =A,A+A; =A, +A=A, 

2. IfA < B, then C.A < C.B, A.C < B.C,C+A<C+ 
B,A+C<B+C,and B’ <A’. 

3. Aj =A, and Aj = Ao. 


S°OL| g Hed 


258 PartB | Fuzzy Logic 


S'94 | d Hed 


a) 


b) 


c) 


d) 


e) 


If A,B € {0,1}, then A.B = min(A, B), A+ B= 
max(A, B), and A’ = 1 —A, 

where Apo is the function Aọ(x) = 0, A, is the func- 
tion A; (x) = 1, and 1 —A is (1—A)(x) = 1— A(x), 
for all x € X. Obviously, in 2*, Ao represents the 
empty set Ø, A; the ground set X, and 1—A the com- 
plement A° of A. 

Of course, it is A < B if and only if A(x) < B(x), for 
all x in X, a partial order that with crisp sets reduces 
toACB. 


Notice that: 


The formal connectives., +, and’, in a BFA are nei- 

ther presumed to be functionally expressible, nor 

associative, nor commutative, nor distributive, nor 
dual, etc. 

Only if . = min, and + = max, the BFA is a lattice 

that, if the negation ’ is a strong one (A” = A, for all 

A), is a De Morgan—Kleene algebra. Hence, no BFA 

is a Boolean algebra, and not even an ortholattice. 

It is not difficult to prove [16.28] that it is always 

A.B < min(A, B) < max(A, B) < A + B. Of course, 

the standard algebras of fuzzy sets are particular 

BFAs. 

It is also easy to prove that: 

d.1) In a BFA with + = max, it holds the first law 
of semiduality: A’ + B’ < (A.B)’, regardless of 
which are . and’. 

d.2) In a BFA with . = min, it holds the second law 
of semiduality: (A + B)’ < A’.B’, regardless of 
which are + and’. 

d.3) Regardless of ’, in a BFA with . = min, and 
+ = max, both semiduality laws hold. 

Obviously, all standard algebras of fuzzy 

sets [16.30] (those in which . is decomposed 

by a continuous t-norm, + by a continuous t- 

conorm, and ’ by a strong negation function) are 

BFAs. 


Since BFAs are defined by just a few axioms in prin- 


ciple only allowing very simple calculations, what can 
be proven in their framework has a very general validity 
that is not modified by the addition of new independent 
axioms. One of the weaknesses of Boolean and De Mor- 
gan algebras, as well as of orthomodular lattices, for 
representing CR, just lies in the big amount of laws they 
enjoy and make them too rigid to afford the flexibility 
natural language and CR show in front of any artificially 
constructed language, and of formal reasoning. 


Remarks 16.2 
1. 


For what concerns the representation of a linguis- 
tic predicate L in a universe of discourse X by 
a fuzzy set, it is of an actual interest to reflect 
on what can mean the values Az (x), for x € X and 
Az : X — [0, 1], the membership function of a fuzzy 
set labeled L. Just the expression fuzzy set labeled 
L, forces that the membership function Az should 
translate something closely related to L, namely , to 
the meaning of L in X. It is not clear at all that all 
predicates can be represented by a function taking 
its values in the totally ordered unit interval of the 
real line: It should be added to the involved predi- 
cate the possibility of some numerical quantification 
of its meaning. 

To well linguistically manage a numerically quan- 
tifiable predicate L in X it should, at least, be 
recognized when it is, or it is not, the case that x 
is less P than y, a linguistic (empirical and percep- 
tively captured) relationship that can be translated 
into a binary relation x <z y, with <¿C X x X. This 
relation reflects how the amount of L varies on X, 
and once the pair (X, <,) is known, a measure of 
the extent up to which each x € X, is L, is a mapping 
M_,:X — [0,1] such that x <z y => Mz(x) < MLO), 
and those elements x such that M(x) = 1, if ex- 
isting, can be called the prototypes of L in X. 
Analogously, those y such that Mz (y) =0 can be 
called the antiprototypes of L in X. Obviously, and 
in the same vein that there is not a single probabil- 
ity measuring a random event, there is not always 
a single measure Mz. 

If the use of the predicate is precise in X, all their 
elements x should be prototypes, or antiprototypes, 
that is Mz(x) is in {0,1}. When for some x it is 
0 < M(x) < 1, it is said that the use of L is impre- 
cise in X. Once a triplet (X, <z, Mz) is known, it is 
a quantity that can be understood as reflecting the 
meaning of L in X [16.2]. Calling M; an ideal mem- 
bership function of the fuzzy set labeled L, it can 
be said that it exists when the meaning of L in X is 
a quantity. 

Notice that each measure Mz defines a new binary 
relation given by x <m y & Mz(x) < Mi (y), obvi- 
ously verifying <C <mz, that is, the new relation 
is larger than the former that is directly drawn from 
the perceived linguistic behavior of L in X. 

It is said that M; perfectly reflects L whenever 
<z~=<w_, but, since the second is always a linear re- 
lation — for all x, y in X, it is either Mz (x) < M (y), 
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or M(x) > M: (y) —, and <z is not usually so, not 
always can be the case that Mz perfectly reflects 
L. This is one of the reasons for which, being <z 
often difficult to be completely known — for in- 
stance, if X is not finite — the designer just arrives 
to a function A, (the membership function of the 
fuzzy set labeled L) that is not usually the ideal 
membership function Mz but an approximation of 
it, obtained through the data on L that are avail- 
able to the designer. Of course, a good design is 
reached when it can be supposed that the value Sup 


{x € X; /Mz,(x) — Az (x)/} is minimized. 


From all that it comes the importance of carefully 
designing [16.15,16] the membership functions 
with which a fuzzy system is represented. Analo- 
gous comments can be made for what concerns the 
, and the axioms they verify, to 
reach an election of the BFA (0, 1]*,.,+,’) well 


connectives., +, ’ 


linked to the currently considered problem. 


2. The suitability of the wide structure of BFAs for 
representing CR comes from, for instance, the fact 
that the linguistic conjunction and is not always 
commutative specially when time intervenes, the 
laws of duality are not always valid when dealing 
with statements in natural language, the connec- 
tives’ decomposability (or functional expressibil- 
ity), is not always guaranteed, the distributive laws 


between . and + not always hold, etc. 


3. When the linguistic terms are represented in a set W 
endowed with a partial order <, and with operations 
. of conjunction, + of disjunction, and’ of negation, 
respectively, representing the linguistic connectives 
If/Then, and, or, not, if (W,., +) is a lattice (for 
all that concerns lattices, see [16.31]) at least the 


following five points do hold: 


I. ItisA< B & A.B =A & A+B = B: The con- 
ditional statement Zf A, then B should be equiva- 
lent to the statements A and B coincides with A, 


and A or B coincides with B. 


II. If and is represented by the lattice’s conjunction 
., A.B is the greatest lower bound of both A and 
B: It should be known the set of all that is below 
A (C is below A means C < A), the set of all that 
is below B, the intersection of these two sets, 
and that A.B is the greatest element in this last 


set (respect to the partial order <). 


HI. Analogously, for the case of or, if represented 
by the lattice’s disjunction +, there should be 
known the sets of elements in W that are greater 
than A, those that are greater than B, their inter- 


section, and that A + B is the lowest element in 
this last set. 
IV. A.B = B.A: The meaning of the statements A 
and B, and B and A cannot be different. 
V. A+B = B +A: The meaning of the statements 
A or B, and B or A cannot be different. 
All this shows that contrary to what usually hap- 
pens in both CR and the applications, where all the 
previous information is not only costly in search- 
ing for, in money, and almost impossible to collect 
completely, a lot of structural information on the 
reasoning’s context should be necessarily known for 
establishing the model. Something that is typical in 
formal sciences, but that in the case of CR, and also 
in many applications, produces some scepticism for 
the possibility of always taking (W,., +) as a lat- 
tice. 
To count with a representation’s lattice for CR is 
but something to be considered rare or, at least, 
limited to some cases as it can be that of repre- 
senting a formal-like type of reasoning with precise 
linguistic terms where the former five points are 
usually accepted. These are, for instance, the cases 
in Boolean algebras with A > B=A’+B=1<¢ 
A< B, and orthomodular lattices with A —> B = 
A’+A.B=1<A<B, with the respective impli- 
cation operators — translating the corresponding 
linguistic If/Then. 
With the standard algebras of fuzzy sets, a lattice 
is only reached when the connectives are given by 
the greatest t-norm min, and the lowest t-conorm 
max [16.30]. In this case, the implication func- 
tions with which itis A —> B= A, & A < B, are the 
T-residuated ones [16.30], functionally expressible 
through the numerical functions Jz (a, b) = Sup{r € 
[0, 1]; T(a,r) < b}, where T is a left-continuous 
t-norm, and that generalize the Boolean material 
conditional A > B = A’ +B since, in a complete 
Boolean algebra, it is A’ + B = Sup{C;A.C < B}. 
The fuzzy implications given by (A > B) (x, y) = 
J7(A(x), B(y)) enjoy many of the typical properties 
of Boolean algebras with the material conditional, 
and with them the standard algebra with min, max, 
and the strong negation l-id, enjoys, among these 
algebras, the biggest amount of Boolean laws and 
makes of it a very particular algebra to be used for 
extensive use in CR. 
Nevertheless, it should be remembered that what 
concerns CR, when time intervenes not always can 
coincide the meanings of the statements A and B, 
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and B and A, as it is the case He sneezed and came 
to bed, and He came to bed and sneezed. 

4. The BFA’s structure, based on [0, 1]*, can yet be 
made more abstract. It simply requires to con- 
sider, instead of [0,1]*, once pointwise ordered 
by “A < B & A(x) < B(x), for all x in X, a poset 
(L, <), with minimum 0 and maximum 1, endowed 
with two binary operations . and +, and a unary 
one, containing a subset Lo ({0, 1} C Lo C L) that, 
with the restrictions of the three operations ., +, 
and ’, is a Boolean algebra, and verifying anal- 


ogous laws to the former in | to 4. These alge- 
braic structures are called [16.32] Formal Basic 
Flexible Algebras, and they are a shell compris- 
ing ortholattices, De Morgan algebras, BFAs and, 
of course, orthomodular lattices, Boolean algebras, 
and standard algebras of fuzzy sets, as particular 
cases. By taking (W,., +, ’) as a Formal Flexi- 
ble Algebra, what follows can be generalized, with 
a few restrictions, to such abstract and general 
shell. 


16.6 Weak and Strong Deduction: Refutations and Conjectures in a BFA 


(with a Few Restrictions) 


Let ((0, 1}*, ., +, ^) be a BFA whose negation’ is a weak 
one, that is, restricted to verify the law A < A” for all 
A € [0, 1]*, and whose conjunction . is associative, and 
commutative. No other properties are presumed and, 
hence, what follows contains by large the case with 
a standard algebra of fuzzy sets, and what can be ob- 
viously restricted to the Boolean algebra of crisp sets in 
{0, 1%. 

Let us consider as the former family F of admis- 
sible premises, the F(.) comprising the finite sets (of 
premises) P = {A;,...,A,}, such that their conjunction 
Ap = Å; ... An is not self-contradictory, that is, Ap £ 
Aj. Of course, Ap £ Ap implies A; £ Ay, for all Aj, Aj 
in P, and, obviously, it should be Ap 4 Ag. Hence, sets 
P € F(.) neither contain contradictory premises, nor the 
empty set Ag. Notice that associativity and commuta- 
tivity of the conjunction are presumed just to warrant 
a nonambiguous definition of Ap, and the restriction on 
the negation ’ is just to allow some step in a proof. 
Under these conditions, the operator defined [16.28] 
by 


C.(P) = {B € [0, 1]“; Ap < B} , 


translating into the BFA the statement If A} and Az and 
...and A,, then B, or B follows from Ap in the order <, 
verifies the following: 


a) Usually, C.(P) is not in F(.), for instance, it is not 
always finite. Hence, in general it has no sense to 
reapply the operator C. to C.(P). That is, the opera- 
tor C.? cannot be usually defined, and less again to 
make C. a closure. 

b) Since Ap < A;, 1 <i<n, it is PC C.(P):C. is ex- 
tensive. 


c) 


d) 


e) 


8) 


h) 


i) 


If P C Q, with Q = PU {A,41,...,Am}, and since 
Ap <A,...An-An+ti..-Am, then it follows C.(P) C 
C.(Q) : C. is monotonic. 

If B € C.(P), it is not B’ € C.(P) : Ap < B and Ap < 
B’ > B < B” < Ap, and it follows the absurd Ap < 
A}:C. is consistent in all P € F(.). 

Obviously, A; € C.(P), but Ao ¢ C.(P). Hence, 
C.(P) Æ ø. Analogously, C.(P) cannot coincide 
with the full set [0, 1]* since it will imply the ab- 
surd Ap = Ao. 

Consequently, all operators C. are consistent and 
extensive weak-deduction operators. 


In addition, 


If C.(P) is not finite, then it is obviously Inf C.(P) = 
Ap. 

No contradictory elements are in C.(P): If B,C € 
C.(P) and it were B < C’, from Ap < B, Ap < C, it 
follows B’ < A; and C’ < Aj, and the absurd Ap < 


B<C' < Ab. 
If . <.2, that is, the operation .; is weaker than 
the operation .2, it is obviously Aj.) ....1An < 


Aj.2....2A,, and hence C.2(P) C C.;(P). That is, 
the bigger the operation ., the smaller the set C. 
(P). Consequently, if it can be selected the opera- 
tion min, it is Cmin (P) C C.(P), for all operations., 
and all P € F(min): Cmin is the smallest among the 
operators C.. 

Notice, that it is always F(min) C F(.): the family 
with min is the smallest among those of admissible 
premises. 

Provided C.(P) is a finite set, and since C. is 
extensive, it is C.(P) = PU{A,41,...,Am}, and 
then Ac.p) =A... An An+1 apg Ap.An+1 
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i) 


...Am <Ap. Thus, if .= min, Ac py = min(Ap, 
min(A;,...,Am)) =Ap, that means Cmin(P) € 
F(min), and has sense to reapply Cmin. Since 
P C Cmin(P) > Cmin(P) C Cmin(Cmin(P), and, 
if A € Cmin(Cmin(P)), then Acmincp) = Ap < A, 
it is A € Cmin(P), and from Cmin(Cmin(P)) C 
Cmin(P), it finally follows Cmin?(P) = Cmin(P). 
In conclusion, provided all the involved sets 
Cmin (P), with P € F(min), were finite, it is 
Cmin: F(min) > F(min), and Cmin will be 
a strong, or Tarski’s, consequence operator. 
Provided the family F of sets of premises is made 
free of only containing finite sets, since it always 
exist Inf and, bounding P to verify Inf P £ (Inf P)’, 
for all P € F, the operator 


Coo(P) = {B € [0, 1]*; Inf P < B} , 


obviously verifying Inf Coo(P)=InfP, is not 
only extensive, consistent, and monotonic, but it 
is also a closure: From Coo(P)€F, and PC 
Coo(P), it follows Coo(P) C Coo(Coo(P)), but if 
BeECoo(Coo(P)), that is, Inf Coo(P) <B, from 
Inf Coo(P) = Inf P, follows B € Coo(P). Finally, 
Coo (Coo (P)) = Coo(P). 

Coo, restricted to finite sets is just Cmin, and re- 
stricted to all the crisp sets in {0, 1}*, is the con- 
sequence operator on which classical propositional 
calculus is developed [16.3]. 


Definition 16.2 
Given a weak-deduction operator C. [16.28]: 


1. 


2: 


The set of C.-refutations of P, is Ref.(P) = {B € 
[0, 1]*; B’ € C.(P)}. 

The set of C.-conjectures of P, is Conj.(P) = {B € 
[0, 1]}*; B’ € C.(P)*}. 

Namely, refutations are those fuzzy sets whose 
negation is weakly deducible from the premises, 
and conjectures those whose negation is not weakly 
deducible from them. Obviously, Conj.(P) = 
Ref.(P)°. 

Notice that it immediately follows that all oper- 
ators Ref. are consistent, and monotonic, but not 
extensive, and that all operators Conj. are extensive, 
antimonotonic, not consistent, and consequently it 
cannot be stated that Conj.(P) is always in F(.). It 
is Conj.: F(.) > [0, 1]*:, and not all conjectures can 
be taken as items of admissible information. 

It is also immediate that Ref.(P) UConj.(P) = 
[0, 1]*, and Ref.(P) NConj.(P) = Ø, that is, both sets 


constitute a partition of the set of all fuzzy sets in 

X, and Conj.(P) = Ref.(P)°. Hence, the conjectures 

are those fuzzy sets that are nonrefutable in front of 

the information furnished by P. 

Since, C min(P) C C.(P), it follows that: 

@ Refmin(P) C Ref.(P): Refmin is the smallest 
among refutation operators. 

@ Conj.(P) C Conj min(P): Conjmin is the big- 
gest among conjecture operators. Namely, 


P C C min(P) C C.(P) 
C Conj.(P) C Conj min(P), 


a chain of inclusions showing that both weak 
and strong consequences are but a particular 
type of conjectures. Consequently, in the model 
deducing is but one of the forms of conjecturing 
as it is asked for in [16.27]. 


Remarks 16.3 
1. 


Only if itis . = min, it holds: Ref min (C min(P)) = 
Refmin(P), and Conjmin (Cmin(P) = Conj 
min(P), showing that strong consequences nei- 
ther allow to obtain more refutations, nor more 
conjectures. 

Since Ao € Ref.(P), it is Ref. (P) 4 Ø. Nevertheless, 
Ref.(P) cannot coincide with the full set [0, 1]*, 
since it will imply A = Apo. 

The sets Conj.(P) cannot be empty since it will 
imply C.(P) = [0,1]*. On the other side, it is 
Conj.(P) = [0, 1}* © C.(P) =ø. Hence, it is al- 
ways, 


Ø t Conj.(P) Z [0, 1]}* . 


In this model, the empty set Ap cannot be taken for 
either conjecturing, or deducing, or refuting. 

The particularization of the concept of conjecture 
to crisp sets, that is, to the fuzzy sets in {0, 1}* Cc 
[0, 1]}*, reduces to take as F(min) the set of those 
crisp sets that are nonempty, since with crisp sets it 
is A CAS & A = Ø, and thus Conj.(P) is the set of 
those B C X, such that AN B Æ Ø, since with crisp 
sets it is A CBS & AN B = Ø. 

With classical sets, that is, in Boolean algebras, 
there is no distinction between contradiction, and 
incompatibility. 

After the classical definition: B is decidable < ei- 
ther B is provable, or not B is provable, it can be 
defined the set of .-weakly decidable elements for 
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P by C.(P) URef.(P), and the set of strongly de- be essayed by posing à la Popper [16.13], 
cidable elements for P by C min(P) U Ref min(P). 
Obviously, and for all operations . , strongly decid- CR.(P) = Ref.(P) U Conj.(P) , 
able elements are .-weakly decidable ones, but not 
reciprocally [16.32]. once the operations., and ’ are selected, and C.(P) 
The nonstrongly decidable elements are those fuzzy defined with, at least, A Æ Ap. Anyway, this defi- 
sets in the set nition gives nothing else than CR.(P) = Conj.(P) U 
Conj.(P)° = Ref.(P) U Ref.(P)°, with which the for- 
(C min(P)URef min(P))° mal model for CR appears as nothing else than 
= Cmin(P)° N Ref min(P)* either conjecturing, or refuting once .-weak con- 
2 z z poaa sequences are taken as the basic concept toward 
= Crain NConjmin(E);, formalizing CR. 
that is, they are the conjectures that are not strong Thus, a new concept of strong-CR, can be intro- 
consequences. Analogously, given P, the .-weakly duced by 
nondecidable elements are those fuzzy sets that are . , o 
.-weak conjectures, but not .-weak consequences. CR min(P) = Ref min(P) U Conj min(P) , 
Consequently, to obtain a classification of the dif- i i 
ference sets Conj.(P) —C.(P), and Conj min(P) — and in both cases deduction (weak and strong, re- 
C min(P), is actually important. spectively), is a type of conjecturing [16.27]. i 
. Since Ref. is monotonic and consistent, it could be 8. The imposed PORTAL and associativity of 
alternatively taken for defining conjectures [16.28] the conjincuon can be avoided by previously Hix- 
in the parallel form ing an algorithm to define Ap. For instance, if P 
contains four premises, then the algorithm’s steps 
Conj * .(P) = {B € 0, 1*; B’ ¢ Ref.(P)} = C.(P)° , can be the following: 1) Select an order for the 
, ; premises and call them A4, . . . , A4. 2) Define Ap by 
where conjectures appear as just those fuzzy sets A,,(Ao,(A3.A4)). Then, of course, all that has been 
that are not weak consequences instead of those formerly said depends on the way chosen to de- 
that are not refutations. Under this new definition, it fine Ap. 
is Ref.(P) C Conj = -(P). Notice that tefutations B For what refers to the restriction on the negation ’, 
can be defined without directly referring to C., by notice that it is already verified in the cases (usual 
í : in fuzzy logic) where it is strong: A” = A, for all 
Ref.(P) = {B € [0, 1}" ;Ap < B}. A € [0, 1]*, as it is the case in the standard algebras 
: : ae of fuzzy sets. 
. With all that, a tentative formal definition for 
a model of (Commonsense) Reasoning (CR) could 
16.7 Toward a Classification of Conjectures 
Provided it is Conj.(P) — C.(P) Æ Ø, it is clear that the with the symbol ne shortening not <-comparable with. 
left-hand difference set is equal to Notice that the second set in this union contains the 
{B € Conj.(P):B < Ap} U{B € Conj.(P): fuzzy sets B being neither empty, nor contradictory with 


Ap, and for which Zf B, then Ap. Consequently, it can be 
Ap is not < —comparable with B} . said that these fuzzy sets B explain Ap, or P, and, of 
That is course, they also explain any .-weak consequence of P: 
i If Ap < C, it follows B < C. Let us denote this set of 
Conj.(P) = C.(P) U {B € [0, 1 ; conjectures by Hyp.(P), and call it the set of explicative 
Ap É B' & Ao < B < Ap} conjectures or, for short, hypotheses for P. If C.(P) E€ 
F(.), it is clear that Hyp.(C.(P)) = Hyp.(P), as it also 

U {B € [0, 1]; Ap £ BY & Ap ne B} , happens with the strongest conjunction . = min. 
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16.7 Toward a Classification of Conjectures 


For what concerns the third set in the last union, call 
it Sp.(P), it is decomposable in the disjoint union 


{B € [0, 1]}*; Ap < BY & Ap ne B} U {B € [0, 1]* ; 
Ap ne B’ & ApncB} , 


whose elements will be called speculative conjectures 
or, for short, speculations. Let us, respectively, denote 
by Sp.ı(P) and Sp.2(P), the first and the second set 
in the decomposition. The elements in Sp.;(P) will be 
called type-i speculations (i = 1, 2). 

It should be pointed out that the symbol ne shows 
the jumps cited before, and that these jumps affect both 
types of speculations but, specially, those in the type-2. 
In the case in which B € Sp .ı (P), since Ap > B is equiv- 
alent to A} < B, B could be captured by going forward 
from Aj, but if B € Sp .2(P) a jump from either Ap, or 
from Ab, is necessary to reach B. 

It should also be pointed out that weak conse- 
quences are reached by moving forward from Ap, that 
hypotheses are reached by moving backward from Ap, 
but that for speculations a jump forward from either 
Ap, Or Ap, is required. It is clear, in addition, that 
Conj.(P) = Conj.({Ap}), C.(P) = C.({Ap}), Hyp.(P) = 
Hyp.({Ap}), and Sp. (P) = Sp.({Ap}), since {Ap} € F(.), 
and in this sense, Ap can be seen as the résumé of the 
information conveyed by P. 

With all that, the set of conjectures from P is com- 
pletely classified by the disjoint union 


Conj. (P) = C.(P) U Hyp. (P) U Sp.ı (P) U Sp.2(P) , 


since all the intersections C.(P)MHyp.(P),..., 
Sp.ı (P) N Sp.2(P), are empty. Hence, conjectures are 
either weak consequences, or hypotheses, or type-1 
speculations, or type-2 speculations. Consequently, 
and once given P € F(.), all the fuzzy sets in [0, 1]* 
are classified in refutations, .-weak consequences, 
hypotheses, speculations of type-1 and of type-2, with 
strong consequences being a part of the weak ones. It 
should be pointed out that, in some particular cases, the 
sets Hyp.(P), or Sp.(P), can be empty. 


Note 

It can be said that the above algebraic model for CR 
contains strong and weak deduction, abduction (the 
search for hypotheses), and also speculative reasoning. 


What happens with hypotheses and speculations for 
what relates to monotony? Since given two enchained 
sets of premises P C Q, both in F(.), the conjunction 


of their premises, call them respectively Ap and Bo, 
obviously verify Bg < Ap, it is Hyp.(Q) C Hyp.(P): 
the operator Hyp. : F(.) > [0, 1]*, is antimonotonic. 
Notice that Hyp. is not a consistent operator and, con- 
sequently, it cannot be supposed that Hyp.(P) always 
contains admissible information. It is risky to take a hy- 
pothesis as a new premise. 

Analogously, Sp. is also a mapping F(.) — [0, 1]*, 
that is nonconsistent, and, as it is easy to see by means 
of simple examples with crisp sets, it is neither mono- 
tonic, nor antimonotonic, nor elements in Sp.(P) can be 
always taken as admissible information. That is, since 
there is no law for the growing of Sp. with the growing 
of the premises, it can be said that speculations are pe- 
culiar among conjectures. Such peculiarity is somewhat 
clarified by what follows: If S € Sp.(P): 


a) If Sis such that Ap.S Æ Ap, it is Ap.S < Ap since if it 
were Ap.S = Ap, then follows Ap < S, or the absurd 
S € C.(P). From Ag < Ap.S < Ap, and provided Ap.S 
is a conjecture, it follows Ap.S € Hyp.(P). 

b) Since Ap < Ap + S, it is always Ap + S € C.(P). 

c) Provided the law of semiduality B’ + C’ < (B.C), 
holds in the BFA (0, 1]*,., +,’ ), as it happens with 
+ = max, and since Ap < Ap + S’ < (A},.S)’, it fol- 
lows that Aj,.S is a refutation. 


Hence, by means of speculations S, hypotheses Ap.S 
can be obtained provided Ap.S Æ Ao is a conjecture, 
Ap +S is always a weak consequence, and with semid- 
uality it is AZ.S a refutation. In this sense, speculations 
can serve as a tool for deducing, for abducing, and 
for refuting: They are auxiliary conjectures for either 
deploying what is hidden in the premises, or for refut- 
ing the premises, or to explain them. For this reason, 
to speculate is an important type of nonruled reason- 
ing, and whose mastering should be encouraged to be 
learned. 


A Remark on Heuristics 
Although the concept of heuristics is not yet for- 
malized, from last paragraphs, and under a few and 
soft constraints, speculations can be seen as auxiliary 
conjectures that intermediate for advancing reasoning. 
Since to reach a speculation S a jump from Ap should 
be taken, since there are no direct and step-by-step 
links between Ap and S, the process to arrive by the 
intermediary of S to either a consequence Ap + S, or 
a hypothesis Ap.S, or a refutation Aj,.S, is a typically 
heuristic one, perhaps obtained at each case by some 
nonstep-by-step rule of thumb. For instance, a hy- 
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pothesis like H = Ap. S can be reached after some 
heuristic path conducting to S by some jumping from 
Ap that, additionally, can be done in several and differ- 
ent steps. 

At this respect, the formal characterization of those 
hypotheses that are reducible to the form Ap.C for some 
fuzzy set C € [0, 1}* is yet an open problem. Since in 
the case of orthomodular lattices and, of course, of 
Boolean algebras, it was proven that all hypotheses are 
reducible [16.1], but that in the case of nonorthomodu- 
lar ortholattices nonreducible ones should exist, given P 
it can be analogously supposed the existence in [0, 1]* 
of nonreducible hypotheses. Consequently, if existing, 
such hypotheses can be seen as isolated ones that can- 
not be reached by the intermediary of a speculation, but 
only directly by a particular heuristic backward-track 
from Ap. 

For what concerns hypotheses, and except in formal 
deductive reasoning where, in principle, they can be ei- 
ther safely accepted, or refused through deduction, in 
CR a crucial point is to know how a hypothesis can be 
deductively or inductively refuted [16.28, 33]. As it is 
well known, the idea of refuting a hypothesis is central 
in scientific research [16.13], and it can be formalized 
in the current model as follows [16.32]. 


16.8 Last Remarks 


In former papers of the author [16.28, 32, 34], the con- 
cept of a conjecture was formalized in the settings of 
ortholattices, and De Morgan algebras. In the first, and 
since Ap < Aj, implies Ap = 0, it suffices to take Ap Æ 0. 
What lacked was the case of the standard algebras of 
fuzzy sets with a t-norm and a f-conorm different, re- 
spectively, of min and max, a case that is subsumed in 
what is presented in [16.28], and now is completed in 
this chapter. 

With only crisp sets, the résumé Ap of the in- 
formation conveyed by the premises in P, obviously 
verifies Ap < B’ & Ap.B = Ay & Ap.B < (Ap.B)’, but 
this chain of equivalences fails to hold if Ap or B are 
proper fuzzy sets. Consequently, and in addition to 
C.(P) = {B; Ap < B}, it can be also considered the two 
operators [16.32], 


C.'(P) = {B; Ap.B’ = Ao}, 
C.?(P) = {B; Ap.B’ < (Ap.B’)’} , 


and 


Let us suppose that there is a doubt between which 
one of the two statements: H € Hyp.(P), and H ¢ 
Hyp.(P), is valid but knowing that H £ H’, that the pre- 
sumed hypothesis is not self-contradictory: 


a) Provided the first statement is valid, since H < Ap 
and Ap < C imply H <C, it is C.(P) C C.({A}). 
Thus, if there is D € C.(P) such that D ¢ C.({H}), H 
cannot be a hypothesis for P: To weak-deductively 
refute H as a hypothesis for P, it suffices to find 
a weak consequence of P that is not a weak con- 
sequence of {H}. 

Of course, classical (strong) deductive refutation 
corresponds to the case in which it is possible to 
take . = min. 

b) In addition, it is C.(P) C Conj.({H}). Indeed, Ap < 
B, and H < Ap, do not imply B < H’ since, in this 
case it follows H < H’, itis B ¢ Conj.(P). Thus, it is 
B £ H’, or B € Conj.({H}). Consequently, to weak- 
inductively refute H as a hypothesis for P it suffices 
to find a weak consequence of P that is not conjec- 
turable from {H}. 

All that can be, mutatis mutandis, repeated with the 
strongest conjunction . = min, and the concept of 
the strong-inductive refutation of H is reached. 


with which the corresponding conjecture operators 
could also be defined by 


Conj. (P) = {Be [0,1%;B’ ¢ CİP), i= 1,2, 


by taking consistency in the other two forms cited in 
Sect. 16.5. 

Both operators C . are not extensive, but monotonic, 
at least Cla is consistent if the negation is functionally 
expressible, it is unknown which other C | (i= 1,2) is 
consistent, C.! (P) C C.? (P), and it is not actually clear 
if in both cases it is, or it is not, C.’ (P) C Conj.’ (P) ex- 
cept when . = min [16.28]. 

Consequently, the door is open to consider alterna- 
tive definitions for the concept of a conjecture depend- 
ing on the way of defining when an element B € [0, 1]* 
is consistent with the résumé Ap. Anyway, what it seems 
actually difficult is how to imagine a kind of nonconsis- 
tent deduction. 
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16.9 Conclusions 


It is well known that in CR conclusions are often 
obtained by some kind of analogy or similitude with 
a previously considered case. Without trying to com- 
pletely formalize analogical reasoning, let us introduce 
some ideas that, eventually, could conduct toward such 
formalization. 

Define B is analogous to A, if it exists a family K of 
mappings 


o : (0, 1]* — [0, 1]* ; 


such that B=ooA, for ø € K. Namely, B is K- 
analogous to A. At its turn the set of fuzzy sets that are 
analogous or similar to those in P can be defined by 


K(4p) = {B € [0, 1}"; 
B=ooAp,o E K}. 


16.9 Conclusions 


The establishment of a general framework for the sev- 
eral types of reasoning comprised in Commonsense 
Reasoning seems to be of an upmost importance. At 
least, it should be so for fuzzy researchers in the new 
field of Computing with Words (CWW) that, with a cal- 
culus able to simulate reasoning, tries to deal with 
sentences and arguments in natural language more com- 
plex than those considered in today’s current fuzzy 
logic. 

In the way toward a full development of CWW cov- 
ering as many scenarios as possible of the people’s ways 
of reasoning, it seems relevant not to forget the big 
amount of nondeductive reasoning people commonly 
do. This chapter just offers a wide framework to jointly 
consider the four modalities of deduction, abduction, 
speculation, and refutation, typical of both ordinary and 
also specialized reasoning, where deduction and refuta- 
tion are deductive modalities of reasoning, but where 
abduction and speculation can be considered its induc- 
tive modalities [16.35, 36]. 

There remain some unended questions that concern 
the proposed model and, among them, can be posed the 
two following ones: 


© The finding of rules in the Mill ’s style [16.37], 
for obtaining hypotheses and speculations from the 
premises. These rules could conduct to obtain com- 
puter programs or algorithms able, in some cases, 


Then: 


1) Ifid < o = Ap < ooAp => OOAp € C.(P) 

2) Ifo < id = coAp < Ap = ooAp € Hyp.(P) 

3) Ifid ne o = ooAp ne Ap => ooAp € Sp.(P) 

4) If id< o> Ap < o/oAp = (coAp)’ => (coAp)’ z 
Ref.(P). 


Depending on the possible ordering of the pairs (id, 
o), and (id, g’), with id the identity mapping in [0, 1]*, 
that is, id(A) = A, for all A € [0, 1]*, the fuzzy sets anal- 
ogous to Ap are either consequences or hypotheses, 
speculations, or refutations. Notice that in point [16.4] 
it is id o” equivalent to ø < id’, that is, oA < A’ for all 
fuzzy set A. 

Of course, for a further study of analogical types of 
reasoning it lacks to submit transformations ø to some 
restricting properties, surely depending on the concrete 
case under consideration. 


to find either hypotheses or speculations, and can be 
also useful for clarifying the concept of a heuristics. 

@ The study of what happens with the conjectures 
once new consistent information supplied by new 
premises, is added to the initial set P [16.38]. It 
should not to be forgot that ordinary reasoning is 
rarely made from a static initial set of items of in- 
formation, but that the information comes in a kind 
of flux under which conjectures can vary of number 
and of character. 


Like speculations facilitate heuristics for finding 
consequences, hypotheses, and refutations, analogical 
reasoning could constitute a good trick for obtaining 
conjectures and refutations, on the base of some earlier 
solved similar problems. It should be pointed out that 
what is not yet clear enough is how to compute the de- 
gree of liability an analogical conclusion could deserve. 

What is not addressed in this chapter is the theoretic 
and practical important problem of which is the best hy- 
pothesis or speculation to be selected at each particular 
case. This question seems linked with the translation 
into the conclusions of some numerical weights of 
confidence previously attributed to the premises, and 
that depend on the case into consideration. Because 
in this chapter such weights are neither considered for 
premises, nor for conclusions, the presented model can 
be qualified as a crisp one, but not yet as a fuzzy one, 
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since it fails taking into account the level of liability of 
the sentences represented in fuzzy terms. The seman- 
tics of the linguistic terms is here confined, through its 
most careful possible design [16.15, 16], to the contex- 
tual and purpose driven meaning of the involved fuzzy 
sets and fuzzy connectives, as well as to the possible lin- 
guistic interpretation of the accepted conclusions, but 
what is not yet taken into account is their degree of 
liability. 

By viewing formal theories as abstract construc- 
tions that could help us to reach a better understanding 
of a subject inscribed in some reality, this chapter rep- 
resents a formal theory of reasoning with precise and 
imprecise linguistic terms represented by fuzzy sets 
and already presented in [16.28]. It consists in a way 
of formalizing conjectures and refutations in the wide 
mathematical setting of BFAs whose axioms comprise 
the particular instances of ortholattices, De Morgan al- 
gebras, standard algebras of fuzzy sets, and, of course, 
Boolean algebras. As is shown by two of the forms 
allowing to define what weak deduction could be, de- 
pending on the kind of consistency chosen, and that, 
with crisp sets, also collapse in the classical case, this 
formalization cannot be yet considered as definitive, but 
open to further study. 

Nevertheless, this new theory should not be con- 
fused with the actual human reasoning, and only 
through some work of an experimental character on CR 
it will be possible to clarify which degree of agreement 
with the reality of reasoning either the selected way for 
defining weak deduction, or a different one, does show. 
Provided this kind of work could be done, the observa- 
tional appearance of some observed regularities in CR, 
or observed laws of the actual ordinary reasoning, and 
that can be predicted by some invariants in the model, 
is a very important topic for future research. 

Anyway, to advance toward a kind of Experimen- 
tal Science of CR, there are today no answers to some 
crucial questions like, for instance: 


@ Which regularities exist in natural language and in 
CR that, reflected by some invariants in the model, 
can be submitted to experimentation? 
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Fuzzy control is by far the most successful field of 
applied fuzzy logic. This chapter discusses human- 
inspired concepts of fuzzy control. After a short 
introduction to classical control engineering, three 
types of very well known fuzzy control concepts 
are presented: Mamdani-Assilian, Takagi-Sugeno 
and fuzzy logic-based controllers. Then three real- 
world fuzzy control applications are discussed. 
The chapter ends with a conclusion and a future 
perspective. 
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17.1 Knowledge-Driven Control 


With no doubt, the biggest achievement of fuzzy logic 
with respect to industrial and commercial applications 
has been obtained by fuzzy control. Since its first practi- 
cal use for a simple dynamic plant by Mamdani [17.1], 
over attention-getting applications such as the auto- 
matic train operation in Sendai, Japan [17.2], fuzzy 
control systems have become indispensable in the in- 
dustry today (Until today more than 60000 patents 
have been filed worldwide using the words fuzzy and 
control according to [17.3]). A wide range of real- 
world applications have also been described by Hi- 
rota [17.4], Terano and Sugeno [17.5], Precup and 
Hellendoorn [17.6]. 

Simply speaking, fuzzy control is a kind of defining 
a nonlinear table-based controller. Every entry in such 
a table can be seen as partial knowledge about the speci- 
fied input-output behavior [17.7]. However, knowledge 
does not have to exist for every input-output combina- 
tion. Thus, the transition function of a fuzzy controller 
is a typical nonlinear interpolation between defined re- 
gions of this knowledge. The knowledge is commonly 
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stored as imprecise rules consisting of imprecise terms 
such as small, big, cold, or warm. Consequently, these 
tules lead to an imprecisely defined transition func- 
tion that is eventually defuzzified if a crisp decision is 
needed. 

This procedure is sometimes advantageous when 
compared to classical control systems — especially for 
control problems that are usually solved intuitively by 
human beings, but not by computing machines, e.g., 
parking a car, riding the bike, boiling an egg [17.8]. 
This might be also a reason why fuzzy control did not 
originate from control engineering researchers. It had 
rather been inspired by Zadeh [17.9] who proposed 
rule-based systems for handling complex systems us- 
ing fuzzy sets — a concept which he introduced 8 years 
before [17.10]. 

The focus of this chapter is a profound discus- 
sion of such human-inspired concepts of fuzzy control. 
Other fuzzy control approaches based on a fuzzification 
of well-known methods of the classical control the- 
ory (fuzzy PID control, fuzzy adaptive control, stability 
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of fuzzy systems, fuzzy sliding mode, fuzzy observer, 
etc.) are only described briefly. For these topics, the 
reader is referred to other textbooks, e.g., Tanaka and 
Wang [17.11]. 

But before we formally present the concepts of 
fuzzy control, let us give a brief introduction of clas- 
sical control engineering in Sect. 17.2. Then, in detail, 
we cover fuzzy control in Sect. 17.3, including the 


17.2 Classical Control Engineering 


To introduce the problem of controlling a process, let us 
consider a technical system for which we dictate a de- 
sired behavior [17.12]. Generally speaking, we wish 
to reach a desired set value for a time-dependent out- 
put variable of the process. This output is influenced 
by a variable that we can influence, i.e., the control 
variable. Last but not least — to deal with unexpected 
influences — let a time-dependent disturbance variable 
be given that manipulates the output, too. 

Then the current control value is typically spec- 
ified by mainly two components, i.e., the present 
measurement values of the output variable €, the vari- 
ation of the output AE = ae and further variables 
which we do not specify here. We refer to n in- 
put variables é; € X1,..., En EX, of the controller 
(e.g., computed from the measured output variable 
of the process and its desired values) and one con- 
trol variable 7 € Y. Formally, the solution of a control 
problem is a desired control function g :Xı x---x 
X, — Y which sets a suitable control value y = g(x) 
for every input tuple ¥ = (x), x®,...,x®) eX, x 
-+-xX,. Controllers with multiple outputs are often 
handled as independent controllers with one output 
each. 

In classical control, we can determine ¢ using dif- 
ferent techniques. The most popular one for practical 
applications is the use of simple standard controllers 
such as the so-called PID controller. This controller 
uses three parameters to compute a weighted sum of 
proportional, integral, and derivative (PID) components 
of the error between the output variable and desired val- 
ues to compute the control variable. In many relevant 
cases, a good control performance of the closed-loop 
feedback system with controller and process can be 


most well-known approaches of Mamdani and Assil- 
ian, Takagi and Sugeno, and truly fuzzy-logic-based 
controllers in the Sects. 17.3.1, 17.3.2, and 17.3.3, re- 
spectively. We also talk about their advantages and 
limitations. We discuss some more recent industrial 
applications in Sect. 17.4 and automatic learning strate- 
gies in Sect. 17.5. Finally, we conclude our presentation 
of fuzzy control in Sect. 17.6. 


reached by simple tuning heuristics for these param- 
eters. This strategy is successful for many processes 
that can be described by linear differential equations 
for the overall behavior or at least near all relevant 
setpoints. More advanced controllers are used in cases 
with nonlinear process behavior, time-variant process 
changes, or complicated process dynamics. They re- 
quire both a mathematical process model based on a set 
of differential or difference equations and a fitness func- 
tion to quantify the performance. Here, many different 
strategies exist starting from a setpoint-dependent adap- 
tation of the PID parameters, additional feedforward 
components to react to setpoint changes or known 
disturbances, the estimation of internal process states 
by observers in state-space controllers, the online es- 
timation of unknown process parameters in adaptive 
controllers, the use of inverted process models as con- 
trollers, or robust controllers that can handle bounded 
parameter changes. For all these controllers, many elab- 
orate design and analysis techniques exist that guar- 
antee an optimal behavior based on a known process 
model, see e.g., Åström and Wittenmark [17.13], Good- 
win et al. [17.14]. 

However, it might be mathematically intractable or 
even impossible to define the exact differential equa- 
tions for the process and the controller g. For such 
cases, classical control theory cannot be applied at all. 
For instance, consider the decision process of human 
beings and compare it to formal mathematical equa- 
tions. Many of us have the great ability to control 
diverse processes without knowing about higher math- 
ematics at all — just think of a preschool child operating 
a bike or juggling a European football with its foot or 
even head. 
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Probably the simplest way to obtain a human control 
behavior for a given process is to find out — by ask- 
ing direct questions for instance — how a person would 
react given a specific situation. One alternative is to ob- 
serve the process to be controlled, e.g., using sensors, 
and then discover substantive information in these sig- 
nals. Both approaches can be seen as knowledge-based 
analysis which eventually provides us with a set of lin- 
guistic rules. We assume that these if—then rules are 
able to properly control the given process. 

Let us briefly outline the operating principle of 
a fuzzy controller based on such if—then rules. Each 
tule consists of an antecedent and consequent part. The 
former relates to an imprecise description of the crisp 
measured input, whereas the latter defines a suitable 
fuzzy output for the input. In order to enable a comput- 
ing machine to use such linguistic rules, mathematical 
terms of the linguistic expressions used in the rules need 
to be properly defined. Once a control input is present, 
more than one rule might (partly) fulfill the present con- 
cepts. Thus, there is a need for suitable accumulation 
methods for these instantiated rules to eventually com- 
pute one fuzzy output value. From this value, a crisp 
output value can be obtained if necessary. 

This knowledge-based model of a fuzzy controller 
is conceptually shown in Fig. 17.1. The fuzzification 
interface operates on the current input value Xo. Here 
too, X9 might be mapped to a desired space if it is 
necessary — one may want to normalize the input to 
the unit interval first. The fuzzification interface even- 
tually translates X into a linguistic term described by 
a fuzzy set. The knowledge base is head of the con- 
troller, it serves as database. Here, every essential piece 
of information is stored, i.e., the variable ranges, do- 
main transformations, and the definition of the fuzzy 
sets with their corresponding linguistic terms. Further- 
more, it comprises the rule base that is required to 
linguistically describe and control the process. The de- 
cision logic computes the fuzzy output value of the 
given measurement value by taking into account the 
knowledge base. Last but not least, the defuzzification 
interface computes a crisp output value from the fuzzy 
one. 

Two well-known and similar approaches have led 
to the tremendous use and the success of fuzzy con- 
trol. The Mamdani—Assilian and the Takagi-Sugeno 
approaches are motivated intuitively in Sects. 17.3.1 
and 17.3.2, respectively. They have in common that 
their interpretation of a linguistic rule diverges from 


mathematical implications. Both types of controllers 
rather associate an input specified as an antecedent 
part with the given output given as a consequent 
part. A mathematically formal approach to fuzzy con- 
trol as discussed in Sect. 17.3.3 leads to completely 
different computations. As it turns out in practice, 
controllers based on any kind of logical implications 
are usually too restrictive to suitably control a given 
process. 


17.3.1 Mamdani-Assilian Control 


Just one year after the publication of Zadeh [17.9], 
Ebrahim Abe Mamdani and his student Sedrak Assil- 
ian were the first who successfully controlled a simple 
process using fuzzy rules [17.1, 15]. They developed 
a fuzzy algorithm to control a steam engine based on 
human expert knowledge in an application-driven way. 
That is why today we refer to their approach as nowa- 
days Mamdani-Assilian control. 

The expert knowledge of a Mamdani—Assilian con- 
troller needs to be expressed by linguistic rules. There- 
fore, for every set X; of given values for an input, we 
define suitable linguistic terms that summarize or parti- 
tion this input by fuzzy sets. Let us consider the first set 
Xı for which we define p; fuzzy sets ae TP D € 
F(X). Each of these fuzzy sets is mapped to one 
preferable linguistic term. Thus, X, is partitioned by 
these fuzzy sets. To ensure a better interpretability of 
each fuzzy set, we recommend to use just unimodal 
membership functions. In doing so, every fuzzy set can 
be seen as imprecisely defined value or interval. We fur- 
thermore urge to choose disjoint fuzzy sets for every 


pepo ee ee Knowledge 
base 


v v 
Fuzzification Decision Defuzzification 
interface logic Fuzzy interface 


Controller 
output 


Measured 


values Controlled 


system 


Fig. 17.1 Architecture of a fuzzy controller 
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partition, i.e., the fuzzy sets shall satisfy to the rule’s output fuzzy set 
ixj => sup {min [uP w, PN <0.5. ps? O) = min {or,, 14;,(9)} - (17.3) 
xEX ` 
' When the decision logic does that for all œ, for r= 
When X; is partitioned into pı fuzzy sets ua, sea jes 1,...,, then it unifies all output fuzzy sets as by 
we can continue to partition the remaining sets e a t-conorm. The standard Mamdani-Assilian con- 
X2,...,Xn and Y in the same way. The eventual out- troller uses the maximum as t-norm. Thus, the ultimate 
come of this procedure (i. e., the linguistic terms asso- output fuzzy set 
ciated with the fuzzy sets for each variable) establishes s 
the database in our knowledge base. RO) = Pee ail „min {r MiO) - (17.4) 
The rule base of any Mamdani—Assilian controller ce 
is specified by rules of the form The whole process of evaluating Mamdani—Assilian 
rules is depicted in Fig. 17.2. 
if £ is A and ... and £, is A™ then nis B Of course, from a fuzzy set-theoretic and interpre- 
tational point of view, it suffices to keep 2(y) as the 
et) final output value. However, fuzzy controllers are used 
where A®,...,A® and B symbolize linguistic terms in real-control application where a crisp control value is 
which correspond to the fuzzy sets uo , p” ind for sure needed, e.g. to increase the electric current of 
u, respectively, according to the fuzzy partitions of X; x ĉ hotplate when boiling an egg. That is why the fuzzy 
.. xX, and Y. Thus, the rule base consists of k control control output u€ is Drócessed in the defuzzification 
üles interface. Depending on the implemented method of 
fuzzifying u£, a real value is ultimately obtained. Three 
R, : if & is a and ... and &, is AD methods are used in the literature extensively, i. e., the 
a ` max criterion method, the mean of maxima (MOM) 
then 7 is B}, r=1,...,k. method, and the center of gravity (COG) method. 
We again underline that these rules are not interpreted The max criterion method simply returns an ar- 
logically as mathematical implications. They rather bitrary value y € Y such that WO) obtains a maxi- 
specify the function n = 9(E1,...,) piecewise using mum membership degree. However, this arbitrary value 
the existing associations between the known input- picked at random typically results i s nondeterminis- 
output tuples, i. e., tic control behavior. That is usually undesired as the 
interpretability and repeatability of already produced 
B, ifia ~ AY a nd ... and & © A” | outcomes get lost. The MOM method returns the mean 
i in i value of the set of elements y € Y that have maximal 
nr): : membership degrees in the output fuzzy set. Using this 
B, iff ~ KOO and cand £, xA”. approach, it might happen that the defuzzified control 
me ne value 7 may not even belong to the points leading 
The control function g is thus composed of partial t° maximal membership degrees. Just consider for in- 
knowledge that we connect in a disjunctive manner, Stance a bimodal normal fuzzy set as a fuzzy output. 
Thus, Mamdani-Assilian control can be referred to That is why MOM may lead to control actions that you 
knowledge-based interpolation. would not await. Finally, the COG method returns the 
Now assume that we observe a measurement xe Value at the center of gravity of the fuzzy output area 
Xı X++- X Xn. Naturally, the decision logic applies each ME, i.e., 
rule R, separately to the measured input. It then com- 
putes the degree to which the input x fulfills the an- 
tecedent of R,, i. e., the degree of applicability J WEO) ydy / | EO) dy 
yey yey 
a, = min THE o uP "i : (17.2) (17.5) 


This degree of applicability literally cuts off the output 
fuzzy set u;, of the rule R, at the level œ, which leads 


Usually, the COG method is taken to defuzzify the 
fuzzy output as it typically leads to a smooth control 
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function. Nevertheless, it is possible to obtain an unrea- 
sonable output, too. We refer the gentle reader to Kruse 
et al. [17.16] for a richer treatment of defuzzification 
methods. 

Coming back to the theoretical background of 
Mamdani-—Assilian control, let us analyze its type of 
linguistic rules. Having a look at (17.3) again, we see 
that the minimum is used to serve as fuzzy implica- 
tion. But the minimum does not fulfill all truth value 
combinations of the propositional logic’s implication. 
To see this, let us consider p —> q and assume that p 
is false. Then p — q is true — regardless of the truth 
value of q. The minimum of 0 and q, however, is 0. 
This logical flaw could be seen as inconsistency of 
the standard Mamdani—Assilian controller. On the con- 
trary, it actually turns it into a very powerful technique. 
Created to solve a simple practical problem, we might 
speak from a heuristic method instead. When we do 
not see Mamdani—Assilian rules as logical implications 
but rather as association [17.17], then the controller is 
even theoretically sound: Every rule R, associates an 
output fuzzy set B;, with n input fuzzy sets AP for 
j=l,...,n. Consequently, we must use a fuzzy con- 
junction, e.g., the minimum f-norm. 

Mamdani and Assilian’s heuristics can be obtained 
by the extension principle, too [17.18, 19]. If the fuzzy 
relation R that relates to the input x” and the out- 
put y satisfies a couple of extensionality properties, 
then Mamdani and Assilian’s approach can also be 
obtained. Therefore, let E and F’ be two similarity re- 
lations defined on the domains X and Y of x and y, 


Fig. 17.2 The Mamdani—Assilian rule 
evaluation. Here we assume that the 
control process is described by two 
input variables 6, Ë and one output F. 
Let the controller be specified by two 
Mamdani—Assilian rules. Here the in- 
put tuple (25, —4) leads to the fuzzy 
output shown on the lower right side. 
If a crisp output is needed, any de- 
fuzzification method can eventually 
be applied 


respectively. The extensionality of R on X x Y indicates 
that 


Vx eX: Vy,y €Y:R(x,y) @E(yy) < Rwy), 
Yx, x EX: Vye Y: R(x, y) Q E(x, x’) < R(x’, y). 
(17.6) 


Thus, if (x,y) € R, then x is related to the neigh- 
borhood y. The same holds for y in relation to x. 
Then AY (x) = E(x,a) and B,(x) = E’(y,b,) can be 
regarded as fuzzy sets of values in the proximity 
of a” and b,, respectively. Hence Vr=1,...,k: 
R (a, sreg a”), b;) = 1. Applying this type of control 
definition to real-world problems, a practitioner must 
specify sensible similarity relations Æ; and Æ’ for each 
input § and output 7, respectively. Eventually, using the 
extension principle for R, we obtain 


RAY, ished x”) y) > max S 


Q (APD), AD (x), A,(9)) 


In addition, if we use the minimum t-norm for ®, 
then we get exactly the approach of Mamdani- 
Assilian. Boixader and Jacas [17.20], Klawonn and 
Castro [17.21] show that indistinguishability or sim- 
ilarity is the connection between the extensionality 
property and fuzzy equivalence relations. 

In practical applications, different t-norms and t- 
conorms as product instead of minimum and bounded 
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sum instead of maximum play an important role. The 
reason is a stepwise multilinear interpolation behavior 
between the rules that cause a smoother function y = 
(x) compared to minimum and maximum operators. 

Depending on the input and output variables, 
Mamdani—Assilian controllers can intentionally or un- 
intentionally copy concepts from classical control the- 
ory. As an example, a controller with inputs as pro- 
portional, integral and derivative errors between the 
output variable and desired values in combination with 
a symmetrical rule base has a similar behavior to a PID 
controller. Such a reinvention of established concepts 
with a more complicated implementation should be 
avoided though. 


17.3.2 Takagi-Sugeno Control 


Partitioning both the input and output domains seems 
to be reasonable from an interpretational point of view. 
Nevertheless, one might face control applications where 
a sufficient approximation quality can only be achieved 
using many linguistic terms for each dimension. This, in 
turn, will increase the potential number of rules which 
most probably worsen the ability to interpret the rule 
base of the fuzzy controller. For such control processes, 
we recommend to neglect the concept of partitioning 
the output domain and instead define functions that lo- 
cally approximate the control behavior. 

A controller that uses rules like this is called the 
Takagi—Sugeno controller [17.22]. The Takagi-Sugeno 
rules R, for r = 1,...,k are typically defined as 


R,: if & is an and ... and £, is Ay 
then n =f-(&, sey En) . 


Most commonly, linear functions f, can be found in 
many controllers, 1. e., 


n 
fœ = a® + yaa E 


i=1 


R, at Ei is D 
R, : if &; is A and &, is S 


3 9 4 13 


R; : if & is PARAN and & is Ze 


3 Qi) 18 4 13 


A and &, is Ze 


11 18 4 13 


Ry :kf ë is 


then 7; = 1 - E; + 0.5 - 2+ 1 
then y =-0.1-& +4- 2+ 1.2 
then %3 = 0.9 - E; + 0.7 - &2 +9 


then 44 = 0.2 - & + 0.1 - & + 0.2 


The rules of a Takagi—Sugeno controller share the 
same antecedent parts with the Mamdani—Assilian con- 
troller, so does the decision logic computes the same 
degree of applicability @,, i.e., using (17.2). An ex- 
ample of such a controller for two inputs is shown in 
Fig. 17.3. Eventually, all degrees are used to compute 
the crisp control value 


z T Qr -fœ 
E=% 


which is the weighted sum over all rule outputs. Ob- 
viously, a Takagi-Sugeno controller does not need any 
defuzzification method. 

Takagi—Sugeno controllers are not only popular for 
translating human strategies into formal descriptions, 
but they can also be combined with many well-known 
methods from classical control theory. For instance, 
many locally valid linear models of the process can 
be aggregated into one nonlinear model in the form 
of Takagi-Sugeno rules. In a next step, the desired 
behavior of the resulting closed-loop system is formu- 
lated. Stability can be defined in the strictest form as 
guaranteed convergence to the desired setpoint from 
any initial state, including robustness against bounded 
parameter uncertainties of the process model. Mathe- 
matical methods as Lyapunov functions in the form of 
linear matrix equations are now applied to design the 
Takagi—Sugeno fuzzy controller. For details about these 
concepts, we refer to Feng [17.23]. The main advantage 
of this strategy is the guaranteed performance, whereas 
many disadvantages come into play, too. This approach 
requires a model and sophisticated mathematical meth- 
ods. Also it usually leads to a limited performance due 
to conservative design results and the limited inter- 
pretability. That is also why many recent papers propose 
iterative improvements, e.g., by proposing ways to han- 
dle other types of uncertainties such as varying time 
delays [17.24], and by reducing the conservativeness of 
the solutions [17.25]. 


Fig. 17.3 A Takagi-Sugeno con- 
troller for two inputs and one output 
described by four rules. If a cer- 

tain clause 3; is Ay in any rule R, 

is missing, then the corresponding 
membership function p; (4) = 1 

for all linguistic values j,,.. In this 
example, consider for instance x2 in 
rule Rı. Thus, Ui (x2) = 1 for all ip; 
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17.3.3 Fuzzy Logic-Based Controller 


Both controllers that have been introduced so far in- 
terpret every linguistic rule as an association of an 
n-dimensional fuzzy input point with a fuzzy output. 
Thus, we can interpret the set of fuzzy rules as set- 
points of the control system. Recall, however, that this 
has nothing to do with logic inference since not all rules 
need to be activated by a given input. 

When all rules are evaluated in a conjunctive man- 
ner, we can regard each fuzzy rule as a fuzzy constrain 
on a fuzzy input-output relation. The inference oper- 
ation of such a controller is identical to approximate 
reasoning. Note that classical reasoning uses inference 
tules (so-called tautologies) to deductively infer crisp 
conclusions from crisp propositions. A generalization 
of classical reasoning is approximate reasoning applied 
to fuzzy propositions. Zadeh [17.9] proposed the first 
approaches to handle fuzzy sets in approximate rea- 
soning. The gentle reader is referred to further details 
explained in Zadeh [17.26,27]. The basic idea is to 
represent incomplete knowledge as possibility distribu- 
tions. 

Possibility theory [17.28] has been proposed to 
study and model imperfect descriptions of an existing 
element xp in a set A C X. It can be seen as a counterpart 
to probability theory. To formally define a possibility 
distribution JT : 2* — [0, 1] we need the following ax- 
ioms that seem to have similarities to the well-known 
Kolmogorov axioms: 


IT(@) =0, 
TT(A) < TI (B) if A C B and 
TI (A U B) = max{TI (A), TI (B)} for all A,B C X. 


The expression J (A) = 1 includes that xọ € A is un- 
conditional possible. If JI (A) = 0, then it is impossible 
that x9 E€ A. Zadeh [17.29] models uncertainty about 
xo by the possibility measure M : 2? —> [0, 1], IT(A) = 
sup{u (x) | x € A} when a fuzzy set jz : x > [0, 1] is the 
only known description of x. Then the possibility mea- 
sure is given by the possibility degrees of singletons, 
i.e., M(x} = u(x). 

Now, consider only one-dimensional input and out- 
put spaces. Then we must specify a suitable two- 
dimensional possibility distribution. Let the rule 


R : if € is A then ņ is B 


associate the input fuzzy set u4 with the output fuzzy 
set ug. We can express this rule by a possibility distri- 


bution 


nx, y (x, y) = (ma x), UaQ)) 


where J is an implication of any multivalued logic. So, 
we can compute the output by the composition of the 
input and the rule base, i. e., ug = Ha ° x,y Where the 
fuzzy rules are expressed by the fuzzy relation zy,y de- 
fined on X x Y. The composition of a fuzzy set jz with 
a fuzzy relation z is defined by 


por : Y — [0,1], y sup {min{ u(x), n(x, y)} . 


xEX 


We can easily see that this is a fuzzification of the 
standard composition o of two crisp sets M C X and 
RCXxY,i.e., 


def 


MoR={yEYlAvreX:(XEMA(x, VJ ER}CY. 


The challenge in fuzzy control applications using 
relational equations is to search for a fuzzy relation z 
that satisfies all equations ug, = ua, 0 x for every rule 
R, with r= 1,...,k. If multiple inputs X;,...,X,, are 
used, then u4 is defined on the product space X = 
X, X--++x X, as in (17.2). The fuzzy relation m can be 
found by determining the Gödel relation for every given 
relational equation, i. e., 


(x,y) € yey 4 (xX € Ha > y € up). 


Here the implication arrow — represents the Gödel im- 
plication 


1 ifa<b, 
b ifa>b. 


a>b= 


So, actually a linguistic rule expresses the gradual rule 
in terms of the more 4, the more upg. Hence it con- 
strains the fuzzy relation x by the inequality 


min(ua (x), w(x, y)) < May) 


for all (x,y) € X x Y. The Gödel implication is theo- 
retically not the only way to represent 2. Dubois and 
Prade [17.30,31] give a variety of good conclusions, 
however, not to take another but the Gödel implication. 

If the system of relational equations jig, = Ma, © 7 
for r= 1,...,k is solvable, then the intersection of all 
rule’s Gödel relations 


k 
xÊ = N 1S (ua, (x), HB, O)) 


r=l 
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From the air cleaner 


is a solution with N being the minimum t-norm. Due to 
the mathematical properties of the Gödel implication, 
the Gödel relation z“ is the greatest solution in terms 
of elementwise membership degrees. 

To conclude this type of controller, we recall that 
the relation 


T(x, y)}) = x,y) 


approximates if it is possible to assign the output value 
y to the input tuple x. Besides the overall conjunctive 


nature of the rules does softly constrain the control 
function g. It might thus happen in practical appli- 
cations that these constraints lead to contradictions if 
very narrow output fuzzy sets are assigned to overlap- 
ping input fuzzy sets. In such a case, the controller’s 
output will be the empty fuzzy set which corresponds 
to no solution. One way to overcome this problem is 
to specify both narrow input fuzzy sets and broader 
output fuzzy sets. This procedure, however, limits the 
expressiveness and thus applicability of fuzzy logic- 
based controllers. 


17.4 A Glance at Some Industrial Applications 


Shortly after big success stories in the 1980s, mainly in 
Japan [17.2], many real-world control applications have 
been greatly solved using the Mamdani—Assilian ap- 
proach all around the world. So did the research group 
of the paper’s third author initiate the development of 
some automobile controllers. 

We want to discuss two of these control pro- 
cesses that have been developed with Volkswagen AG 


Air bypass to 
the throttle 


Auxiliary air regulator 


© 


Air flow sensor 
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Fig. 17.4 Principle of the engine idle speed control 
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and >| range > 
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Pilot value for air conditioning system 


Fig. 17.5 Structure of the fuzzy controller 


(VW), i.e., the engine idle speed control [17.18] and 
the shift-point determination of an automatic transmis- 
sion [17.32]. Both of these very successful Mamdani- 
Assilian controllers can nowadays be still found in 
VW automobiles. The idle speed controller is based 
on similarity relations which facilitates to interpret the 
control function as interpolation of a point-wise impre- 
cisely known function. The shift-point determination 
continuously adapts the gearshift schedule between two 
extremes, 1.e., economic and sporting. This controller 
determines a so-called sport factor and individually 
adapts the gearshift movements of the driver. 


17.4.1 Engine Idle Speed Control 


This controller shall adjust the idle speed of a spark 
ignition engine. Usually, a volumetric control is used 
to control the spark ignition engine. The principle is 
shown in Fig. 17.4. Here an auxiliary air regulator dif- 
fers the cross-section of a bypass to the throttle. 

The controller’s task is to adjust the auxiliary air 
regulator’s pulse width. In the case of a rapid fall of 
the number of revolutions, the controller shall drive 
the auxiliary air regulator to broaden the bypass cross- 
section. This increase of the air flow rate is measured 
by an air flow sensor which serves as controller signal. 
Then a new amount for the fuel injection has to be deter- 
mined, and with a higher air flow rate the engine yields 
more torque. This, in turn, leads to a higher number of 
revolutions which could be decreased correspondingly 
by narrowing the bypass cross-section. 

The ultimate goal is to reduce both fuel consump- 
tion and pollutant emissions. It is straightforward to 
achieve this goal by slowing down the idle speed. On 
the contrary, some automobile facilities, e.g., the air- 
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conditioning system, are very often switched on and 
off which forces the number of revolutions to drop. So 
avery flexible controller is needed to adjust this process 
properly. Schröder et al. [17.32] even point out other 
problems of this control application. 

As it turned out, the engineers who defined the sim- 
ilarity relations to model indistinguishability/similarity 
of two control states did not experience any big diffi- 
culties. Remember that the control expert must define 
a set of k input—output tuples ((x?, ade ,x®)) syr). So, 
for each r = 1,...,k the output value y, seems appro- 
priate for the input xO, nae ,x®)), Like that the control 
expert specifies a partial control function gp. According 
to (17.6), we directly obtain a Mamdani—Assilian con- 
troller by determining the extensional hull of pọ given 
the similarity relations. We thus obtain the rules from 
the partial control function gp as 


R, : if & is approximately x) and... 
and &, is approximately x? ) 


then 7 is approximately y, . 


Klawonn et al. [17.18] explain the more detailed theory 
of this approach. 

Eventually, only two input variables are needed to 
control the engine idle speed controller, i. e.: 


dAARCUR 


Fig. 17.6 Performance characteristics 
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Fig. 17.7 Deviation dREV of the number of revolutions 


1. The deviation dREV [rpm] of the number of revolu- 
tions to the set value. 

2. The gradient gREV [rpm] of the number of revolu- 
tions between two ignitions. 


There exists just one output variable which is the 
change of current (AARCUR for the auxiliary air reg- 
ulator. The final controller is shown in Fig. 17.5. 

The control rules of the engine idle speed con- 
troller have been found from idle speed exper- 
iments. The partial control function @p : Xarevy X 
X(erev) > Yaaarcur) is depicted in the upper half of 
Tab. 17.1. 

The fuzzy controller has been defined by a similar- 
ity relation and the partial control mapping go. With the 
center of area (COA) method, it yields a control sur- 
face as shown in Fig. 17.6. The function values here 
are evaluated in a grid of equally sampled input points. 
The respective Mamdani—Assilian controller has been 
found by relating each point of go to a linguistic term, 
e.g., negative big (nb), negative medium (nm), negative 
small (ns), and approximately zero (az). The resulting 
fuzzy partitions of (REV, gREV, and d(AARCUR are 
displayed in Figs. 17.7, 17.8, 17.9, respectively. So, we 
obtain linguistic rules from go like 


if AREV is A and gREV is B then dAARCUR is C. 


The complete set of rules is given in the lower part of 
Tab. 17.1. 

Klawonn et al. [17.18], Schröder et al. [17.32] show 
that this Mamdani—Assilian controller leads to a very 
smooth and thus better control behavior when com- 
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Fig. 17.8 Gradient gREV of the number of revolutions 
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Fig. 17.9 Change of current (AARCUR for the auxiliary 
air regulator 
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Fig. 17.10 Flowing shift-point deter- 


Classification of driver/driving situation Gearshift ae ; ; 
by fuzzy logic computation mination with fuzzy logic 
Fuzzification Inference Defuzzi- Interpolation 
machine fication 
Accelerator pedal 
Filtered speed of ee 
accelerator pedal Determination 
of speed limits 
Number of a Sport for shifting Gear 
changes in factor (t) into higher or selection 
pedal direction / lower gear 


Sport factor (t-1) / 


pared to classical controllers. Still, for this application, 
it has been much simpler to define a fuzzy controller 
than a one based on higher mathematics. Moreover, 
they found that this fuzzy controller reaches the de- 
sired setpoint precisely and fast. Last but not least, 
even increasing the load slowly does not change the 
control behavior significantly. So it is nearly impossi- 
ble to experience any vibration, even after drastic load 
changes. 


17.4.2 Flowing Shift-Point Determination 


Conventional automatic transmissions select gears 
based on the so-called gearshift diagrams. Here, the 
gearshift simply depends on the accelerator position 


Table 17.1 The partial control mapping go (upper table) 
and its corresponding fuzzy rule base (lower table) 


gREV 


0) a ON a LO 3 6 40 
=W 15 15 10 10 5 5 
—50 20 15 10 LO FS 5) 0 
=0 | 15 w ES 5 3 0 0 
dREV 0 5 5 0 0 0 =O] = 
300 E0 0 0 = [=> | 10] —5 
W | © =) |= | H10] =15 | =15 | —20 
70 =) | =S | =10] =| =15] =15 | =15 


ns pb pm ps ps az az 
dREV az ps ps az az az nm ns 
ps az az az ns ns nm nb 
pai faz ns ns nm nb nb nh 
pb ns ns nm nb nb nb nb 


depending on 
sport factor 


and the velocity. A lagging between the up and down 
shift avoids oscillating gearshift when the velocity 
varies slightly, e.g., during stop-and-go traffic. For in- 
stance, if the driver kicks gas with half throttle, the 
gearshift will start with the first gear. For a standard- 
ized behavior, a fixed diagram works well. Until 1994, 
the VW gear box had two different types of gearshift 
diagrams, i.e., economic ECO and sporting SPORT. 
An economic gearshift diagram switches gears at a low 
number of revolutions to reduce the fuel consumption. 
A sporting one leads to gearshifts at a higher number 
of revolutions. Since 1991 it was a research issue at 
VW to develop an individual adaption of shift-points. 
No additional sensors should be used to observe the 
driver. 

The idea was that the car observes the driver [17.32] 
and classifies him or her into calm, normal, sportive 
(assigning a sport factor € [0, 1]), or nervous (to calm 
down the driver). A test car from VW was operated 
by many different drivers. These people were classified 
by a human expert (passenger). Simultaneously, 14 at- 
tributes were continuously measured during test drives. 
Among them were variables such as the velocity of the 
car, the position of the acceleration pedal, the speed of 
the acceleration pedal, the kick down, or the steering 
wheel angle. 

The final Mamdani controller was based on four in- 
put variables and one output. The basic structure of the 
controller is shown in Fig. 17.10. In total, seven rules 
could be identified at which the antecedent consists of 
up to four clauses. The program was highly optimized: 
It used 24 byte RAM and 702 byte ROM, i. e., less than 
1 KB. The runtime was 80ms which means that 12 
times per second a new sport factor was assigned. The 
controller is in series since January 1995. It shows an 
excellent performance. 


Fuzzy Control | 17.5 Automatic Learning of Fuzzy Controllers 


17.5 Automatic Learning of Fuzzy Controllers 


The automatic generation of linguistic rules plays an 
important role in many applications, e.g., classifica- 
tion [17.33-36], regression [17.37—39], and image pro- 
cessing [17.40,41]. Since fuzzy controllers are based 
on linguistic rules, automatic ways to tune and learn 
them have been developed for control applications as 
well [17.18, 19, 42—44]. 

How can a computer learn fuzzy rules from data to 
explain or support decisions like people do? We think 
that the fuzzy analysis of data can answer this ques- 
tion sufficiently [17.45]. The easiest and most common 
way is to use fuzzy clustering which automatically de- 
termines fuzzy sets from data. 

Before we talk about the generation of linguistic 
rules from fuzzy clustering, however, let us briefly list 
some of the very diverse methods of fuzzy data analy- 
sis. Grid-based approaches define fixed fuzzy partitions 
for every variable. Every cell in that multidimensional 
grid may correspond to one rule [17.39]. Most well 
known are hybrid methods to induce fuzzy rules. There- 
fore, a fuzzy system is combined with computational 
intelligence techniques. For instance, evolutionary al- 
gorithms are used for guided searching the space of 
possible rule bases [17.46] or fuzzifying and thus 
summarizing a crisp set of rules [17.47]. Neuro-fuzzy 
systems use learning methods of artificial neural net- 
work (e.g., backpropagation) to tune the parameters of 
a network that can be directly understood as a fuzzy 
system [17.48]. Standard rule generation methods have 
been fuzzified as well (e.g., separate-and-conquer rule 
learning [17.49], decision trees [17.50], and support 
vector machines [17.51,52]). 

Using fuzzy clustering to learn fuzzy rules from 
data, we only refer to the standard fuzzy c-means al- 
gorithm (FCM) [17.53,54]. Consider the input space 
X C R” and the output space Y C R. We observe m pat- 
terns (x, yj) ES CX x Y where j= 1,...,m. Running 
FCM on that dataset S leads to c cluster prototypes 


t= (0...) 


with i= 1,...,c that can be seen as concatenation of 
both the input values of, j=1,...,n and the output 
value c? y, Thus, every prototype represents one linguis- 
tic rule 


R; : if x is close to CE meee ef?) 
(y) 


i 


then y is close toc 


Using the membership degrees U, we can rewrite these 
tules as 


Ri : if EŒ) then 2 (y). (17.7) 


The only problem is that FCM returns the membership 
degrees u;(x, y) of the product space X x Y. To obtain 
rules like (17.7), we must project ù; onto Ù and w. If 
x and y are restricted to [Xmin, Xmax] and [Ymin, Vmax], re- 
spectively, the projections are given by 
O= sup ÜG), 
YE [ymin -Ymax ] 
u(y) = sup 


xE [min Xmax] 


uj(X, y) ‘ 


We can also project ú; onto each single input variable 
Xı aaa Xn by 


k >x? 
ui (x! )) = sup w (x) 
2) ERGP She] 
a def 
for k=1,...,n where x = (x) GD, 


xETD x), We may thus write (17.7) in the form 
of a Mamdani-Assilian rule (17.1) as 


R; if /\ ux(x) then wO). (17.8) 
k=1 


For one rule, the output value of an unseen input x € R” 
will be equivalent to (17.2) if the minimum t-norm is 
used as conjunction ^. The overall output of the com- 
plete rule base is given by a disjunction V of all rule 
outputs (cf. (17.4) if V is the t-conorm maximum). 

A crisp output can then again be computed by de- 
fuzzification, e.g., using the COG method (17.5). Since 
this computation is rather costly, the output member- 
ship functions w are commonly replaced by singletons, 
iê; 

0) 


i , 


1 ify=c 
wy) = 
i0) 0 otherwise. 


Since each rule consequently comprise the component 
co? of the cluster prototype, we can rewrite (17.8) as 


the Sugeno-Yasukawa rule [17.55] 


n 
R; 2 if VAN u(x) then y = ®., 
k=1 
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Fig. 17.11 Fuzzy rules and induced imprecise areas 


These rules strongly resemble the neurons of an radial 
basis function (RBF) network. This will become clear 
if every membership function is Gaussian, i. e., 


= 2 

> X-L: 

iG) = exo A) 
Oi 


and if there are normalized, i. e., 


yl @) =1 forall ZER”. 
i=1 
Note that this link is used in neuro-fuzzy systems 


for both training fuzzy rules with backpropagation and 
initializing RBF networks with fuzzy rules [17.34]. 


17.5.1 Transfer Passenger Analysis 
Based on FCM 


In this section, we present another real-world con- 
trol problem that deals with the control of passenger 
movements and flows in terminal areas on an air- 
port’s land side. Especially during mass events such as 
world championships, concerts, or in the peak season 
in touristic areas, the capacities of passenger airports 
reach their upper limits. The conflict-free allocation, 
e.g., using intelligent destination boards and signpost, 
can increase the safety and security of passengers and 
airport employees. Thus, it is rational to study intel- 
ligent control approaches to allocate passengers from 
their arrival terminal to their departure terminal. 

To evaluate different controllers, the German 
Aerospace Center (DLR) implemented a macroscopic 


passenger flow model that simulates passenger move- 
ments. Here, probabilistic distributions are used to 
describe passenger movements in terminal areas. The 
approach of Keller and Kruse [17.56] constructs a fuzzy 
rule base using FCM to describe the transfer passen- 
ger amount between aircraft. These rules can be used as 
control feedback for the macroscopic simulation. 

The following attributes of passengers are used to 
for analysis: 


@ The maximal amount of passengers in a certain air- 
craft (depending on the type of the aircraft) 

@ The distance between the airport of departure and 
the airport of destination (in three categories: short-, 
medium-, and long-haul) 

© The time of departure 

@ The percentage of transfer passengers in the aircraft. 


The number of clusters is determined by validity 
measures [17.41,57] evaluating the whole partition of 
all data. The clustering algorithm is run for a varying 
number of clusters. The validity of the resulting parti- 
tions is then compared by different measures. 

An example of resulting fuzzy clusters is given in 
Fig. 17.11. Here, every fuzzy cluster corresponds to 
one fuzzy rule. The color intensity indicates the firing 
strength of a specific rule. The imprecise areas are the 
fuzzy clusters where the color intensity indicates the 
membership degree. The tips of the fuzzy partitions are 
obtained in every domain by projections of the mul- 
tidimensional cluster centers (as explained before in 
Sect. 17.5). 

The fuzzy rules obtained by FCM are simplified 
through several steps. First, similar fuzzy sets are com- 
bined to one fuzzy set. Fuzzy sets similar to the univer- 
sal fuzzy set are removed. Fuzzy rules with the same 
input clauses are either combined if they also share 
the same output clauses or else they are removed from 
the rule base. Eventually, FCM and the rule-simplifying 
process yield five rules. 

Among them are the two following rules. If an 
aircraft with a relatively small amount of maximal 
passengers (80—200) has a short- or medium-haul des- 
tination departing late at night, then usually this flight 
has a high amount of transfer passengers (80—90%). 
If a flight with a medium-haul destination and a small 
aircraft (about 150 passengers) starts about noon, then 
it carries a relatively high amount of transfer passen- 
gers (ca. 70%). We refer the gentle reader to Keller and 
Kruse [17.56] for further details about this real-world 
control application. 
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17.6 Conclusions 


In this chapter, we introduced fuzzy control — a human- 
inspired way to control a nonlinear process as an im- 
precisely defined function. We talked about classical 
control engineering and its limitations which also moti- 
vates the need for a human knowledge-based approach 
of control. This knowledge is typically represented 
as either Mamdani—Assilian rules or Takagi-Sugeno 
tules. We presented both types of fuzzy controllers, 
and also discussed the shortcomings of logic-based 
controllers, although they are mathematically well de- 
fined. We thoroughly presented two successful indus- 
trial applications of fuzzy control. We also stressed 
the necessity for automatic learning and tuning al- 
gorithms. We mentioned the most known approaches 
briefly and rule induction from fuzzy clustering in de- 
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18. Interval Type-2 Fuzzy PID Controllers 


Tufan Kumbasar, Hani Hagras 


The aim of this chapter is to present a general 
overview about interval type-2 fuzzy PID (propor- 
tional-integral-derivative) controller structures. 
We will focus on the standard double input direct 
action type fuzzy PID controller structures and their 
present design methods. It has been shown in 
various works that the type-1 fuzzy PID controllers, 
using crisp type-1 fuzzy sets, might not be able to 
fully handle the high levels of uncertainties asso- 
ciated with control applications while the type-2 
fuzzy PID controller using type-2 fuzzy sets might 
be able to handle such uncertainties to produce 
a better control performance. Thus, we will clas- 
sify and examine the handled fuzzy PID controllers 
within two groups with respect to the fuzzy sets 
they employ, namely type-1 and interval type-2 
fuzzy sets. We will present and examine the con- 
troller structures of the direct action type-1 fuzzy 
PID and interval type-2 fuzzy PID controllers on 


18.1 Fuzzy Control Background 


It is a known fact that the conventional PID controllers 
are the most popular controllers used in industry due 
to their simple structure and cost efficiency [18.1, 2]. 
However, the PID controller being linear is not suited 
for strongly nonlinear and uncertain systems. Thus, 
fuzzy logic controllers (FLCs) are extensively used as 
an alternative to PID control in processes where the sys- 
tem dynamics is either very complex or exhibit highly 
nonlinear characteristics. FLCs have achieved a huge 
success in real-world control applications since it does 
not require the process model and the controller can be 
constructed based on the human operator’s control ex- 
pertise. 

In the fuzzy control literature, fuzzy PID controllers 
(FPID) are often mentioned as an alternative to the con- 
ventional PID controllers since they are analogous to 
the conventional PID controllers from the input-output 
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a generic, a symmetrical 3 x 3 rule base. We will 

present general information about the type-1 fuzzy 
PID and interval type-2 fuzzy PID controllers tuning 
parameters and design strategies. Finally, we will 
present a simulation study to evaluate the control 
performance of the type-1 fuzzy PID and interval 

type-2 fuzzy PID on a first-order plus time-delay 
benchmark process. 


relationship point of view [18.3—6]. The FPID con- 
trollers can be classified into three major categories as 
direct action type, fuzzy gain scheduling type, and hy- 
brid type fuzzy PID controllers [18.6]. The direct action 
type can also be classified into three categories accord- 
ing to the number of inputs as single input, double input, 
and triple input direct action FPID controllers [18.6]. 
In the literature, researchers mainly focused on and ana- 
lyzed double input direct action FPID controllers [18.6- 
10]. Numerous techniques have been developed in the 
literature for analyzing and designing a wide variety of 
FPID control systems. After the pioneer study by Qiao 
and Mizumoto [18.7], the main research was focused 
on type-1 fuzzy PID controllers (T1-FPID); however, 
a growing number of techniques have been developed 
for interval type-2 fuzzy PID controllers (IT2-FPID) 
controllers, recently. 
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It has been demonstrated that type-2 fuzzy logic 
systems are much more powerful tools than ordinary 
(type-1) fuzzy logic systems to represent highly nonlin- 
ear and/or uncertain systems. As a consequence, type-2 
fuzzy logic systems have been applied in various ar- 
eas especially in control system design. The internal 
structure of the interval type-2 fuzzy logic controllers 
(IT2-FLC) is similar to the type-1 counterpart. How- 
ever, the major difference is that at least one of the input 
fuzzy sets (FSs) is an interval type-2 fuzzy set (IT2- 
FS) [18.11]. Thus, a type reducer is needed to convert 
type-2 sets into a type-1 fuzzy set before a defuzzifica- 
tion procedure can be performed [18.12]. Generally, in- 
terval type-2 fuzzy logic systems achieve better control 
performance because of the additional degree of free- 
dom provided by the footprint of uncertainty (FOU) in 
their membership functions [18.13]. Consequently, IT2- 
FLCs have attracted much research interest, especially 
in control applications, since they are a much more pow- 


erful to handle uncertainties and nonlinearities [18.14]. 
Thus, several applications employed successfully IT2- 
FLCs such as pH control [18.13], liquid-level process 
control [18.15], autonomous mobile robots [18.14, 16, 
17], and bioreactor control [18.18]. 

In this chapter, we will focus on the most commonly 
used double input direct action type FPID controller 
structures. We will first present the general structure 
of the FPID controller and then classify the FPID con- 
trollers within two groups with respect to the fuzzy sets 
they employ, namely type-1 and interval type-2 fuzzy 
sets. Thus, we will present and examine the structures 
of the T1-FPID and T2-FPID controllers on a generic, 
a symmetrical 3 x 3 rule base. We will present detailed 
information about their internal structures and design 
strategies presented in the literature. Finally, we will 
evaluate the control performance of the T1-FPID and 
IT2-FPID on a first-order plus time delay benchmark 
process. 


18.2 The General Fuzzy PID Controller Structure 


In this section, we present the general structure of the 
two input direct action type FPID controllers formed 
using a fuzzy PD controller with an integrator and 
a summation unit at the output [18.7—10]. The standard 
FPID controller is constructed by choosing the inputs 
to be error (e) and derivative of error (Ae) as shown 
and the output is the control signal (u) as illustrated in 
Fig. 18.1. The output of the FPID is defined as 


u=aU +B | var, (18.1) 


where U is the output of the fuzzy inference system. 
The design parameters of the FPID controller struc- 
ture can be summarized within two groups, structural 
parameters and tuning parameters [18.6]. The structural 
parameters include input/output variables to fuzzy in- 
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Fig. 18.1 Illustration of the FPID controller structure 


ference, fuzzy linguistic sets, type of membership func- 
tions, fuzzy rules, and the inference mechanism, i. e., 
the fuzzy logic controller. In the handled FPID struc- 
ture, the FLC is constructed as a set of heuristic control 
rules, and the control signal is directly deduced from 
the knowledge base and the fuzzy inference as done 
in diagonal rule base generation approaches [18.7—10]. 
More detailed information about the internal structure 
of the FPID will be presented in the following sub- 
sections. Usually the structural parameters of the FPID 
controller structure are determined during an off-line 
design. 

The tuning parameters include input/output scaling 
factors (SFs) and parameters of membership functions 
(MFs). As can be seen from Fig. 18.1, the handled 
FPID controller structure has two input and two out- 
put scaling factors [18.7—10]. The input SFs K, (for 
error (e)) and Ka (for the change of error (Ae)) normal- 
ize the inputs to the common interval [—1, 1] in which 
the membership functions of the inputs are defined 
(thus e(t) and Ae(t))) are converted after normalization 
into E and AE). While the FLC output (U) is mapped 
onto the respective actual output (u) domain by out- 
put SFs œ and f. Usually, the tuning parameters can 
be calculated during offline design process as well as 
online adjustments of the controller to enhance the pro- 
cess performance [18.9, 10]. 


Interval Type-2 Fuzzy PID Controllers 


18.2 The General Fuzzy PID Controller Structure 


In the following subsections, we will examine the 
structures of the T1-FPID and IT2-FPID controllers. 
We will present detailed information about the T1-FPID 
and IT2-FPID internal structures, design parameters, 
and tuning strategies. Finally, we will present a com- 
parative simulation results to show the superiority of 
the interval type-2 fuzzy PID controller compared to its 
type-1 counterparts. 


18.2.1 Type-1 Fuzzy PID Controllers 


In this subsection, we will start by presenting the in- 
ternal structure of the handled T1-FPID controller and 
then we will present brief information about the design 
strategies for the T1-FPID controller structure in the lit- 
erature. 


The Internal Structure 

of the Type-1 Fuzzy PID Controller 
In the handled T1-FPID structure, a symmetrical 3 x 
3 rule base is used as shown in Table 18.1. The rule 
structure of the type-1 fuzzy logic controller (T1-FLC) 
is as follows 

Rm: If E is Ay, and AE is Ay then U is Gm, (18.2) 
where E (normalized error) and AE (normalized change 
of error) are the inputs, U is the output of FLC, G,, is 
the consequent crisp set (f = 1... F = 9), and F is the 
number of rules. Here, A;, and Az, represent the type- 
1 membership functions (T1-MFs) (k = 1,2, K = 3; 
l= 1,2, L=3), K, and L are the number of MFs that 
cover the universe of discourse of the inputs E and AE, 
respectively. In this chapter, we will employ three tri- 
angular type T1-MFs for each input domain (E and 
AE) and denote them as N (negative), Z (zero), and P 
(positive). The T1-MFs of the T1-FLC are defined with 
the three parameters (lj, cy, rj; i= 1, I= 2; j= 1,2, 
J = 3), as shown in Fig. 18.2a. Here, Z is the total 
number of the inputs (I = 2) and J (J = K = L = 3) is 
the total number of MFs. The outputs of the FLC are 


a) u b) u 


Table 18.1 The rule base of the FPID controller 


E/AE N Z P 
N N NM Z 
Z NM Z PM 
P Z PM P 


defined with five crisp singleton consequents (negative 
(N) = yn, negative medium (NM) = ynm, zero (Z) = 
yz, positive medium (PM) = ypm, positive (P) = yp) as 
illustrated in Fig. 18.2b. The implemented T1-FLCs use 
the product implication and the center of sets defuzzi- 
fication method. Thus, the output (U) of the T1-FLC is 
defined as 


M 
U= net mGm (18.3) 


pees 


where fn is the total firing strength for the m-th rule is 
defined as 

Jin = Haw * Han - (18.4) 
Here, x represents the product implication (the t-norm) 


and ju4,, and ua, are the membership grades of the Aj, 
and Az; T1-FMs, respectively. 


Type-1 Fuzzy PID Design Strategies 
In the design of the handled T1-FPID controller struc- 
ture with a 3 x3 rule base, the parameters to be de- 
termined are the scaling factors and the parameters of 
the antecedent and consequent membership functions. 
The antecedent MFs of the T1-FPID controller that 
are labeled as the N and P are defined for each in- 
put with two parameters each which are cj, rj (for 
N) and l3, c (for P) (i= 1,2), respectively, while 
the linguistic label Z is defined with three parameters 
which are lj, cn, rn, (i = 1,2). Hence for two inputs, 
the total number of the antecedent membership func- 
tion parameters to be designed for the T1-FPID is then 
2x 7= 14. Moreover, five output consequent parame- 
ters (YN, YNM; YZ, YPM, yp) have to be determined. Thus, 


Fig. 18.2a,b Illustration of the 
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in total 19 MF parameters have to be tuned. Besides the a type-2 membership function jj (x, u), i. e., 
input and output scaling factors (Ke, Ka, œ and £) of 7 
the T1-FPID controller must also be determined. Thus, A= {((x,u), L(x, u)) |YxEX, 
there are 19 MF and four SF parameters (19 + 4 = 23) 
that have to be tuned for the handled T1-FPID con- VES E10, T) ee 


Crisp 
input 


troller. 

In the fuzzy control literature, one method for 
the T1-FPID controller design is by employing evo- 
lutionary algorithms [18.19,20]. Besides Ahn and 
Truong [18.21] used a robust extended Kalman fil- 
ter to tune the membership functions of the fuzzy 
controller during the system operation process to im- 
prove the control performance in an online manner. 
Moreover, various heuristic and nonheuristic scaling 
factor tuning algorithms have been presented in the 
case of the systems that own nonlinearities, parameter 
changes, modeling errors, disturbances [18.7, 9, 10, 22— 
25]. 


18.2.2 Interval Type-2 Fuzzy PID Controllers 


In this subsection, we will start by presenting the in- 
ternal structure of the handled IT2-FPID controller and 
then we will present brief information about the design 
strategies for the IT2-FPID controller structure in the 
literature. 


The Internal Structure of the Interval Type-2 

Fuzzy PID Controller 
In the handled IT2-FPID structures, the same 3 x 3 rule 
base presented for T1-FPID controller is used which is 
presented in Table 18.1. The internal structure of the 
IT2-FPID is similar to the type-1 counterpart. How- 
ever, the major differences are that IT2-FLCs employ 
IT2-FSs (rather than type-1 fuzzy sets) and the IT2- 
FLCs process interval type-2 fuzzy sets (IT2-FSs) and 
thus the IT2-FLC has the extra type-reduction pro- 
cess [18.12, 14]. 

Type-2 fuzzy sets are the generalized forms of type- 
1 fuzzy sets. A type-2 fuzzy set (A) is characterized by 


Output 
Rules processing Crisp 
——— output 
Defuzzifier 


x 


Fuzzifier 
Type-reducer 
~ > Inference 
Type-2 intput Type-2 output 
fuzzy sets fuzzy sets 


Fig. 18.3 Block diagram of the IT2-FLC 


in which 0 < uz(x, u) < 1. 
For a continuous universe of discourse, A can be 
also expressed as 


A= | f wwie, J, © [0,1]; (18.6) 


xEX uel, 


where ff denotes union over all admissible x and 
u [18.12, 14]. J, is referred to as the primary member- 
ship of x, while uz (x,u) is a type-1 fuzzy set known 
as the secondary set. The uncertainty in the primary 
membership of a type-2 fuzzy set A is defined by a re- 
gion named footprint of uncertainty (FOU). The FOU 
can be described in terms of an upper membership 
function (mz) and a lower membership function (u3). 
The primary membership is called Jy, and its associ- 
ated possible secondary membership functions can be 
trapezoidal, interval, etc. When the interval secondary 
membership function is employed an IT2-FS (such as 
the ones shown in Fig. 18.4a) is obtained [18.12, 14]. 
In other words, when u3 (x, u) = 1 for Y u € Jy C [0, 1], 
an IT2-FS is constructed. 

The internal structure of the IT2-FLC is given in 
Fig. 18.3. Similar to a T1-FLC, an IT2-FLC includes 
fuzzifier, rule-base, inference engine, and substitutes 
the defuzzifier by the output processor comprising 
a type reducer and a defuzzifier. The IT2-FLC uses 
interval type-2 fuzzy sets (such as the ones shown in 
Fig. 18.4a) to represent the inputs and/or outputs of the 
FLC. In the interval type-2 fuzzy sets all the third di- 
mension values are equal to one. The use of IT2-FLC 
helps to simplify the computation (as opposed to the 
general type-2 FLC where the third dimension of the 
type-2 fuzzy sets can take any shape). 

The IT2-FLC works as follows: the crisp inputs are 
first fuzzified into input type-2 fuzzy sets; singleton 
fuzzification is usually used in IT2-FLC applications 
due to its simplicity and suitability for embedded pro- 
cessors and real-time applications. The input type-2 
fuzzy sets then activate the inference engine and the 
rule base to produce output type-2 fuzzy sets. The IT2- 
FLC rule base remains the same as for the T1-FLC but 
its MFs are represented by interval type-2 fuzzy sets 
instead of type-1 fuzzy sets. The inference engine com- 
bines the fired rules and gives a mapping from input 
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type-2 fuzzy sets to output type-2 fuzzy sets. The type-2 
fuzzy output sets of the inference engine are then pro- 
cessed by the type reducer which combines the output 
sets and performs a centroid calculation which leads to 
type-1 fuzzy sets called the type-reduced sets. There are 
different types of type-reduction methods. In this pa- 
per, we will be using the center of sets type reduction 
as it has reasonable computational complexity that lies 
between the computationally expensive centroid type- 
reduction and the simple height and modified height 
type-reductions which have problems when only one 
rule fires [18.12]. After the type-reduction process, the 
type-reduced sets are defuzzified (by taking the average 
of the type-reduced sets) to obtain crisp outputs. More 
information about the type-2 fuzzy logic systems and 
their benefits can be found in [18.11, 12, 14]. 

The rule structure of the interval type-2 fuzzy logic 
controller is as follows 


R„: If E is Äi; and AE is Ay, then U is Gp», (18.7) 


where E (normalized error) and AE (normalized change 
of error) are the inputs, U is the output of IT2-FLC, 
Gm is the consequent interval set (Gn = = l8, i gnl m = 
1,...,M = 9) and M is the number of rules. The an- 
tecedents of the IT2-FLC are defined interval type-2 
membership functions (IT2-MFs) (Aix, A21) for the in- 
puts E and AE, respectively, which can be simply ob- 
tained by extending/blurring the T1-MFs (Aj,, A21) of 
the T1-FLC. Here, the IT2-MF is defined with four pa- 
rameters (Jj, cj, Fj, 63 i = 1,2, j = 1,2, 3), as shown in 
Fig. 18.4a. Since the input IT2-FS is described in terms 
of an upper membership function (mz) and a lower 
membership function (m ), the total firing strength for 
the mth rule is 


h= [EA iF (18.8) 
where fn is the total firing interval and is defined as 

Iam ta Ba (18.9) 

Ím = Pay, * Mia (18.10) 


YN YN YnmM YNM = yz 


YPM Yem yp Ye 


Here, x represents the product implication (the t-norm) 
and Ma Lia and Mz,» Mä, are the lower and upper 


membership grades of the A 1; and Ady IT2-FMs, respec- 
tively. 

The consequent membership functions of the 
IT2-FLC are defined with five interval consequents 
and label them as negative (N) = Dy YJ, negative 
medium (NM) = [Y m nm]; Zero (Z) = [y, yz], pos- 
itive medium (PM) = Diane Ypy]; and positive (P) = 
A Yp] as shown in Fig. 18.4b. 

The implemented IT2-FLC uses the center of sets 
type reduction method [18.12]. It has been demon- 
strated that the defuzzified output of an IT2-FLC can 
be calculated as 


Ui + U: 


U= 
2 


(18.11) 


where U; and U, are the left- and right-end points, 
respectively, of the type reduced set, are defined as fol- 
lows 


U Sita F De. f, Gin 
=) a oe 
eee A F Yoo len 


m= f Gm F (mee 
U, = Ermi, Gn + Eat l (18.13) 


Lail + La iP 


—mMm 


(18.12) 


The typed reduced set can be calculated by using the 
iterative Karnik and Mendel method (KM), which is 
given in Table 18.2 [18.26]. 


Interval Type-2 Fuzzy PID Design Strategies 
In the interval type-2 fuzzy PID control design strategy, 
the scaling factors, the parameters of the antecedent, 
and consequent membership functions of the IT2- 
FPIDs have to be determined. The antecedent IT2-FSs 
of IT2-FPID controller that are labeled as the P and N 
for each input are defined with three parameters each 
which are (cj, ri, ôn), and (l3, ciz, 633, i= 1,2), re- 
spectively, while the IT2-FS labeled as Z is defined with 


289 


T'SI | Hed 


290 PartB 


T'8SL | d Hed 


Fuzzy Logic 


Table 18.2 Calculation of the two end points of the type reduced set 


Steps The Karnik Mendel algorithm for computing U1 


il. Sort En m= 1,...,M ) in increasing order such that 


GSB Sooo SB, . Match the corresponding weights f mtg 


(with their noe corresponds to the renumbered E? ») 


The Karnik Mendel algorithm for computing Ur 

Sort g (m = 1,2,...,M) in increasing order such that 

2) <2 <- < Zy. Match the corresponding weights f, de. 
(with their index corresponds to the renumbered g,,,) 


2 Initialize f by setting Initialize fin by setting 
Ja. Poel 
fn= oer m=1,2,...M fu= m=1,2,...,M 
Compute Compute 
M 
ü= See, ve 2E mEn 
M 
See D 
3. Find the switch point L (1 < L < M — 1) such that Find the switch point R (1 < L < M — 1) such that 
EE U Sgi Br <US Boil 
fa ZIL m<R 
4. Set fn = Fm EF Set fn = Ln Zz 
and compute and compute 
y'= 2 mE m Wa DR 
EEM a F M 
Dea r 
5, Check if U = V’. If not go to step 3 and set U’ = U. If yes Check if U = U’. If not go to step 3 and set U’ = U. If yes 


stop and set U; = U’. 


four parameters (l2, Cn, rn, ôn, i = 1, 2). Consequently, 
for the two inputs the total numbers to be designed is 
2x 10 = 20. Moreover, 10 output consequent param- 
eters i »YNo Yyy’ YNM? Y7 YZ: YPM? YPM: Yp Yp) have to 
be determined T Hence, in total 30 MF parameters have 
to be tuned. Besides, the input and output scaling fac- 
tors (Ke, Ka, œ, and f) of the controller must also be 
determined. Thus, there are 30 MF and four SF param- 
eters (30 + 4 = 34) have to be tuned for the handled 
IT2-FPID controller. It is obvious that, the IT2-FPID 
has 11 more tuning parameters, i.e., extra degrees of 
freedom, than the T1-FPID controller structure (23 
parameters). 

The systematic design of type-2 fuzzy controllers is 
a challenging problem since the output cannot be pre- 
sented in a closed form due to the KM-type reduction 
method. To overcome this bottleneck, alternative type 
reduction algorithms which are closed-form approxi- 
mations to the original KM-type reduction algorithm 
have been proposed and employed in controller de- 
sign [18.27,28]. However, the main difficulty is to 
tune the relatively big number of parameters of the 


stop and set U,= U’. 


IT2-FPID controller structure. Thus, several studies 
have employed various techniques for the design prob- 
lem including genetic algorithms [18.15, 29], particle 
swarm optimization [18.30], and ant colony optimiza- 
tion [18.31]. 

In practical point of view, the IT2-FPID con- 
troller design problem can be simply solved by blur- 
ring/extending MFs of an existing T1-FPID con- 
troller [18.16, 28,32]. Moreover, each rule consequent 
can also be chosen as a crisp number (g, = gpn) to 
reduce the number of parameters to be “determined. 
It is also common to set the consequent parameters 
and the scaling factors to the same value of a pre- 
designed T1-FPID controller. Thus, in the IT2-FPID 
design only the antecedent membership parameters 
have to be tuned [18.28]. This design approach will 
reduce the parameters to be tuned from 34 to 20 
since only the parameters of the antecedent mem- 
bership functions must be designed. The design of 
the antecedent MFs can be solved by extensively 
trial and error procedures or employing evolutionary 
algorithms [18.33]. 
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In this section, we will compare the performances 
of the IT2-FPID controller with the T1-FPID controller 
for the following first-order plus-time delay process 


K 
G(s) = oe 18.14 
(s) a r ( ) 


where K is the gain, L is the time delay, and t is the time 
constant of the process. The nominal system parameters 
are K =1,L=landt=1. 

We will first design a T1-FPID controller and then 
extend the type-1 fuzzy controller to design an IT2- 
FPID controller structure since type-2 fuzzy logic the- 
ory is a generalization and extension of type-1 fuzzy 
logic theory. We will characterize each input domain 
(E and AE) of the T1-FPID controllers with three uni- 
formly distributed symmetrical triangular MFs. The pa- 
rameters of the MFs are tabulated in Table 18.3. We will 


a) Nominal process (K = 1, T= 1, L= 1) 
> 
a 1 
2 
g 0.8 
g ; 
ep OMS) i 
0.4 i 
i =: Reference 
0.2 H ---- T1-FPID 
i — IT2-FPID 
o ! 
0 10 20 30 40 
Time (s) 
c) Perturbed process-2 (K = 1,7 SIRE 2) 
eee 
> IA 
3 A 
S i 
2 i 
Oe 
3 
a 
nan 
0.5 
—-—- Reference 
---- T1-FPID 
— IT2-FPID 
0 T > 
0 10 20 30 40 
Time (s) 


set the consequent parameters as yy = —1.0 (nega- 
tive), ynm = —0.75 (negative medium), yz = 0.0 (zero), 
ypm = 0.75 (positive medium), yp = 1.0 (positive) to 
obtain a standard diagonal rule base. Then, the scal- 
ing factors of the T1-FPID have been chosen such that 
to obtain a fast and satisfactory output response for 


Table 18.3 The antecedent MF parameters of the T1-FPID 
and IT2-FPID controllers 


T1-FPID IT2-FPID 
il c r I c r ô 
E N —1.0 0.0 =10| 00 | Ow 
Z =10) on 10 | = GO] 10 | O® 
P 0.0 1.0 0.0 1.0 0.2 
AE N —1.0 0.0 =I] OO | Oe 
Z —1.0 00 10 —10 00 10 0.9 
B 0.0 1.0 0.0 1.0 0.2 
) 
= 1.4 
> 
a 12 
= 
5 1 
| 
2 0.8 
S 
n 
0.6 
0.4 = Reference 
---- T1-FPID 
0.2 T2-FPID 
0 
0 10 20 30 40 
Time (s) 
) Perturbed process-3 (K = 1, T= 1, L= 1) 
5 
3 
5 
© 
| 
2 
A 
an 


=-— Reference 
---- T1-FPID 
T2-FPID 


0 10 20 30 40 
Time (s) 


Fig. 18.5a-d Illustration of the step responses: (a) nominal process (b) perturbed process-1 (c) perturbed process-2 


(d) perturbed process-3 
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Table 18.4 Control performance comparison of the FPID controllers 


Nominal process Perturbed process-1 Perturbed process-2 Perturbed process-3 

(eibe Sik =i) (K=1,r=2,L=1) (===) (K=2,r = 150 =") 

os is ITAE OS Ii ITAE OS Is ITAE OS Ts ITAE 
ER PID ss 2 9.8 27.87 25 18.6 53.38 43 23.8 80.83 47 15.5 43.74 
IT2-FPID 4 6.0 26.63 16 14.8 47.83 30 18.2 SOLS 36 11.9 32.17 


a unit step input. The scaling factors are set as Ke = 1, 
Ka = 0.1, œ = 0.1, and $ = 0.5. 

As it has been asserted, the IT2-FPID controller de- 
sign will be accomplished by only blurring/extending 
antecedent MFs of the T1-FPID controller. Thus, we 
will set the consequent parameters and the scaling 
factors as the same values of its type-1 counterpart. 
The antecedent MFs of the IT2-FPID parameters are 
presented in Table 18.3. This setting will give the op- 
portunity to illustrate how the extra degrees of freedom 
provided by FOUs affect the control system perfor- 
mances. 

In the simulation studies, both FPID controllers 
are implemented as the discrete-time versions obtained 
with the bilinear transform with the sampling time t, = 
0.1s. The simulations were done on a personal com- 
puter with an Intel Pentium Dual Core T2370 1.73 GHz 
processor, 2.99 GB RAM, and software package MAT- 
LAB/Simulink 7.4.0. Note that the simulation solver 
option is chosen as ode5 (Dormand-prince) and the step 
size is fixed at a value of 0.1 s. 

The unit step response performances of the type-1 
and type-2 fuzzy PID control systems are investigated 
for the nominal parameter set K=1, t=1, L=1 
(nominal process) and for three perturbed parameter 
sets which are K = 1, t = 2, L = 1 (perturbed process- 
1), K =1,t = 1, L = 2 (perturbed process-2), and K = 
2, tT = 1, L = 1 (perturbed process-3) to examine their 


18.4 Conclusion 


The aim of this chapter is to present a general overview 
about FPID controller structures in the literature since 
fuzzy sets are recognized as a powerful tool to han- 
dle the faced uncertainties within control applications. 
We mainly focused on the double input direct ac- 
tion type fuzzy PID controller structures and their 
state-of-the-art design methods. We classified the fuzzy 
PID controllers in the literature within two groups, 
namely T1-FPID and IT2-FPID controllers. We ex- 
amined the internal structures of the T1-FPID and 
IT2-FPID on a generic, a diagonal 3 x3 rule base. 


robustness against parameter variations. In this context, 
we will consider three performance measures namely, 
the settling time (7,), the overshoot (OS%), and the in- 
tegral time absolute error (ITAE). 

The system performances of the nominal and per- 
turbed systems are illustrated in Fig. 18.5 and the 
performance measures are given in Table 18.4. As it can 
be clearly seen in Fig. 18.5, the IT2-FPID controller 
produces superior control performance in comparison 
to its type-1 controller counterpart. For instance, if we 
examine the results for nominal process, as compared 
to T1-FPID, the IT2-FPID control structure reduces 
the overshoot by about 66%; it also decreases the 
settling time by about 39% and the total IAE value 
by about 8%. Moreover, if we examine the results 
of perturbed process-2 (the time delay (L) has been 
increased 100%) it can be clearly seen that the T1- 
FPID control system response is oscillating while the 
IT2-FPI was able to reduces the overshoot by about 
30%, the settling time by about 24% and the to- 
tal ITAE value by about 27%. Similar comments can 
be made for presented other two perturbed system 
performances. 

It can be concluded that the transient state perfor- 
mance of the IT2-FPID control structure is better than 
the T1-FPID controllers while it appears to be more ro- 
bust against parameter variations in comparison to the 
type-1 counterpart. 


We presented detailed information about their inter- 
nal structures, design parameters, and tuning strate- 
gies presented in the literature. Finally, we evaluated 
the control performance of the T1-FPID and IT2- 
FPID on a first-order plus time delay benchmark 
process. We illustrated that the T1-FPID controller 
using crisp type-1 fuzzy sets might not be able to 
fully handle the high levels of uncertainties while 
IT2-FPID using type-2 fuzzy sets might be able to 
handle such uncertainties to produce a better control 
performance. 
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incompleteness, and ambiguity. Soft computing 
techniques allow for coping more efficiently with 
such kinds of imperfection when handling data in 
information systems. In this chapter, we give an 
overview of selected soft computing techniques for 
database management. The chapter is subdivided 
in two parts which deal with the soft computing 
techniques, respectively, for information mod- 
elling and querying. A considerable part of the 
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19.1 Challenges for Modern Information Systems 


Database systems nowadays form a basic component of 
almost every information system and their role is getting 
more and more important. Almost every person or com- 
pany keeps track of a large amount of digital data and 
this amount is still growing everyday. Many ICT man- 
agers declare that big data is a point of attention for the 
coming years. Despite the fact that the concept of big 
data is not clearly defined, one can generally agree that 
it refers to the increasing need for efficiently storing and 
handling large amounts of information. 

However, it is easy to observe that information is 
not always available in a perfect form. Just consider the 
fact that human beings communicate most of the time 
using vague terms hereby reflecting the fact that they do 
not know exact, precise values with certainty. In gen- 
eral, imperfection of information might be due to the 
imprecision, uncertainty, incompleteness, or ambiguity. 


Our life and society have changed in such a way that we 
simply cannot neglect or discard imperfect information 
anymore. To be competitive, companies need to cope 
with all information that is available. Efficiently storing 
and handling imperfect information without introduc- 
ing errors or causing data loss is therefore considered 
as one of the main challenges for information manage- 
ment in this century [19.1]. 

Soft computing offers formalisms and techniques 
for coping with imperfect data in a mathematically 
sound way [19.2]. The earliest research activities in this 
area dates back to the early eighties of the previous cen- 
tury. In this chapter, we present an overview of selected 
results of the research on soft computing techniques 
aimed at improving database modelling and database 
access in the presence of imperfect information. The 
scope of the chapter is further limited to database access 
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techniques that are based on querying and specifying 
and handling user preferences in query formulations. 
Other techniques, not dealt with in this chapter, include: 


@ Self-query auto completion systems that help users 
in formulating queries by exploiting past queries, as 
used in recommendation systems [19.3]. 

@ Navigational querying systems that allow intelligent 
navigation through the database [19.4]. 

@ Cooperative querying systems that support indirect 
answers such as summaries, conditional answers, 
and contextual background information for (empty) 
results [19.5]. 


19.2 Some Preliminaries 


In order to review and discuss main contributions to 
the research area of soft computing in database and 
information management, we have to introduce the ter- 
minology and notation related to the basics of database 
management and fuzzy logic. 


19.2.1 Relational Databases 


The techniques presented in this work will be described 
as general as possible, so that they are in fact applica- 
ble to multiple database models. However, due to its 
popularity and mathematical foundations, the relational 
database model has been used as the original formal 
framework for many of these techniques. For that rea- 
son, we opt for using the relational model as underlying 
database model throughout the chapter. 

A relational database can in an abstract sense be 
seen as a collection of relations or, informally, of ta- 
bles which represent them. Informally speaking, the 
columns of a table represent its characteristics, whereas 
its rows reflect its content [19.6]. From a formal point 
of view, each relation R is defined via its relation 
schema [19.7] 

R(A, : Dy,...,An: Dn), (19.1) 
where A; : D;,i=1,...,n are the attributes (columns) 
of the relation. Each attribute A; : D; is characterized by 
its name A; and its associated data type (domain) Dj, 
to be denoted also as domy,;. The data type D; deter- 
mines the allowed values for the attribute and the basic 
operators that can be applied on them. Each relation (ta- 
ble) represents a set of real world entities, each of them 


The remainder of the chapter is organized as fol- 
lows. In Sect. 19.2, we give some preliminaries on 
(relational) databases, which are used as a basis for 
illustrating the described techniques. The next two 
Sects. 19.3 and 19.4 form the core of the chapter. 
In Sect. 19.3, an overview of soft computing tech- 
niques for the modelling and handling of imperfect 
data in databases is presented. Whereas, in Sect. 19.4 
the main trends in soft computing techniques for flexi- 
ble database querying are discussed. Both, querying of 
regular databases and querying of databases contain- 
ing imperfect data are handled. The conclusions of the 
chapter are stated in Sect. 19.5. 


being modeled by a tuple (row) of the relation. Rela- 
tions schemas are the basic components of a database 
schema. In this way, a table contains data describing 
a part of the real world being modeled by the database 
schema. 

The most interesting operation on a database, from 
this chapter’s perspective, is the retrieval of data satis- 
fying certain conditions. Usually, to retrieve data, a user 
forms a query specifying these conditions (criteria). The 
conditions then reflect the user’s preferences with re- 
spect to the information he or she is looking for. The 
retrieval process may be meant as the calculation of 
a matching degree for each tuple of relevant relation(s). 
Classically, a row either matches the query or not, i. e., 
the concept of matching is binary. In the context of soft 
computing, flexible criteria, soft aggregation, and soft 
ranking techniques can be used, so that tuple matching 
becomes a matter of a degree. 

Usually two general formal approaches to the 
querying are assumed: the relational algebra and the 
relational calculus. The former has a procedural char- 
acter: a query consists here of a sequence of operations 
on relations that finally yield requested data. These op- 
erations comprise five basic ones: union (U), difference 
(\), projection (zr), selection (a), and cross product (x) 
that may be combined to obtain some derived opera- 
tions such as, e.g., intersection (N), division (+), and 
join (x). The latter approach, known in two flavours 
as the tuple relational calculus (TRC) or the domain 
relational calculus (DRC), is of a more declarative na- 
ture. Here a query just describes what kind of data is 
requested, but how it is to be retrieved from a database 
is left to the database management system. The exact 
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form of queries is not of an utmost importance for our 
considerations. However, some reported research in this 
area employs directly the de-facto standard querying 
language for relational databases, i.e. SQL (structured 
query language) [19.7,8]. Thus, we will also some- 
times refer to the SELECT-FROM-WHERE instruction 
of this language and more specifically consider its 
WHERE clause, where query conditions are specified. 


19.2.2 Fuzzy Set Theory 
and Possibility Theory 


We will use the following concepts and notation con- 
cerning fuzzy set theory [19.9]. A fuzzy set F in the 
universe U is characterized by a membership function 


bre: U > (0, 1]: u> wr). (19.2) 


For each element u € U, up (u) denotes the membership 
grade or extent to which u belongs to F. The origins 
of membership functions may be different and depend- 
ing on that they have different semantics [19.10]. With 
their traditional interpretation as degrees of similarity, 
membership grades allow it to appropriately represent 
vague concepts, like tall man, expensive book, and large 
garden, taking into account the gradual characteris- 
tics of such a concept. Membership grades can also 
express degrees of preference, hereby expressing that 
several values apply to a different extent. For exam- 
ple, the languages one speaks can be expressed by 
a fuzzy set {(English, 1), (French, 0.7), (Spanish, 0.2)}, 
and then the membership degrees represent skill levels 
attained in a given language. A fuzzy set can also be 
interpreted as a possibility distribution, in which case 
its membership grades denote degrees of uncertainty. 
Then, it can be used to represent, e.g., the uncertainty 
about the actual value of a variable, like the height 
of a man, the price of a book and the size of a gar- 
den [19.11,12]. This interpretation is related to the 
concept of the disjunctive fuzzy set. 

Possibility distributions are denoted by z. The no- 
tation zy is often used to indicate that the distribution 
concerns the value of a variable X, 


my: U > [0, 1] : ub ayx(u), (19.3) 


where X takes its value from a universe U. 
Possibility and necessity measures can provide for 
the quantification of such an uncertainty. These mea- 


sures are denoted by JI and N, respectively, i. e., 


IT: 6(U) > [0,1]: A I(A) and 
N: (U) > [0, 1]:A N(A), (19.4) 


where the fuzzy power set (U) stands for the family of 
fuzzy sets defined over U. Assuming that all we know 
about the value of a variable X is a possibility distribu- 
tion zy, these measures, for a given fuzzy set F, assess 
to what extent, respectively, this set is consistent (JI) 
and its complement is inconsistent (N) with our knowl- 
edge on the value of X. More precisely, if zy is the 
underlying possibility distribution, then 


ITy(F) = Supra (rae) ur(u)), (19.5) 
Nx(F) = inf max(1 — xx (u), pr (u)) . (19.6) 


Sometimes the interval [Ny (F), My(F)] is used as an es- 
timate of the consistency of F with the actual value of 
X. The possibility (necessity) that two variables X and 
Y, whose values are given by possibility distributions, 
mtx and sry, are in relation 0 — e.g., equality — is com- 
puted as follows. The joint possibility distribution, ztxy, 
of X and Y on U x U (assuming noninteractivity of the 
variables) is given by 


Ttxy (u, w) = min(zy(u), ty(w)) . (19.7) 


The relation 6 can be fuzzy and represented by a fuzzy 
set F € (U x U) such that ur (u, w) expresses to what 
extent u is in relation 0 with w. The possibility (resp. 
necessity) measure associated with zyy will be denoted 
by Ixy (resp. Nyy). Then, we calculate the measures of 
possibility and necessity that the values of the variables 
are in relation 0 as follows 


I1(X 0 Y) = Iky(F) 
= oe min(szx(u), my(w), urlu, w)) A 
Uwe 
(19.8) 


N(X 6 Y) = My(F) = inf ,max(1 —mx(u) , 
1—zy(w), pplu, w)). (19.9) 


Knowing the possibility distributions of two variables 
X and Y, one may also be interested on how these dis- 
tributions are similar to each other. Obviously, (19.8)— 
(19.9) provide some assessment of this similarity, 
but other indices of similarity are also applicable. 
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Table 19.1 Special EPTVs 


1 p) Interpretation 
(T,1) p is true 

(F, 1) p is false 

(T, 1), (F,1) p is unknown 
(1, 1) p is inapplicable 


(7,1), (,),(L,) Information about p is not available 
This leads to a distinction between representation- 
based and value-based comparisons of possibility dis- 
tributions [19.13]. We will discuss this later on in 
Sect. 19.4.2. 

An important class of possibility distributions are 
extended possibilistic truth values (EPTV) [19.14]. An 
EPTV is defined as a possibility distribution (a dis- 
junctive fuzzy set) in the universe /* = {T, F, L} that 
consists of the three truth values T (true), F (false) 
and | (undefined). The set of all EPTVs is denoted as 
go(/*). They are meant to represent uncertainty as to the 
truth value of a proposition p € P, where P denotes the 
universe of all propositions, in particular in the context 
of database querying. Thus, a valuation 7* is assumed 
such that 


T: P—>pU):p t (p). (19.10) 


In general the EPTV 7* (p) representing (the knowledge 
of) the truth of a proposition p € P has the following 
format 


t (p) = (T, HT p) (T)), (F, HUT p) (F)), 
(L, uy (p) (L))]. (19.11) 


19.3 Soft Computing in Information 


Soft computing techniques make it possible to grasp im- 
perfect information about a modeled part of the world 
and represent it directly in a database. If fuzzy set 
theory [19.9] or possibility theory [19.12] are used to 
model imperfect data in a database, the database is 
called a fuzzy database. Other approaches include those 
that are based on rough set theory [19.15] and on prob- 
ability theory [19.16] and resulting databases are then 
called rough databases and probabilistic databases, re- 
spectively. 

In what follows, we give an overview of the most 
important soft computing techniques for modeling im- 
perfect information which are based on fuzzy set theory 


Hereby, [7*(p)(T), U) (F) and uo) (L), respec- 
tively denote the possibility that p is true, false, or 
undefined. The latter value is also covering cases where 
p is not applicable or not supplied. EPTVs extend the 
approach based on the possibility distributions defined 
on just the set {T, F} with an explicit facility to deal 
with the inapplicability of information as can for ex- 
ample occur with the evaluation of query conditions. In 
Table 19.1, some special cases of EPTVs are presented: 
These cases are verified as follows: 


© If it is completely possible that the proposition is 
true and no other truth values are possible, then it 
means that the proposition is known to be true. 

© If it is completely possible that the proposition is 
false and no other truth values are possible, then it 
means that the proposition is known to be false. 

© If it is completely possible that the proposition is 
true, it is completely possible that the proposition 
is false and it is not possible that the proposition 
is inapplicable, then it means that the proposition 
is applicable, but its truth value is unknown. This 
EPTV will be called in short unknown. 

© If it is completely possible that the proposition is 
inapplicable and no other truth values are possible, 
then it means that the proposition is inapplicable. 

© If all truth values are completely possible, then 
this means that no information about the truth of 
the proposition or its applicability is available. The 
proposition might be inapplicable, but might also be 
true or false. This EPTV will be called in short un- 
available. 


Modeling 


and possibility theory. Hereby, we distinguish between 
basic techniques (Sect. 19.3.1) and more advanced tech- 
niques (Sect. 19.3.2). As explained in the preliminaries, 
we use the relational database model [19.6] as the 
framework for our descriptions. This is also motivated 
by the fact that initial research in this area has been 
done on the relational database model and this model is 
nowadays still the standard for database modeling. Soft- 
computing-related research on other database models 
like the (E)ER model, the object-relational model, the 
XML-model, and object-oriented database models ex- 
ists. Overviews can, among others, be found in [19.17— 
22]. 


Soft Computing in Database and Information Management 


19.3 Soft Computing in Information Modeling 


19.3.1 Modeling of Imperfect Information - 
Basic Approaches 


In view of the correct handling of information, it is 
of utmost importance that the available information 
that has to be stored in a database is modeled as ade- 
quate as possible so as to avoid the information loss. 
The most straightforward application of fuzzy logic 
to the classical relational data model is by assuming 
that the relations in a database themselves are also 
fuzzy [19.23]. Each tuple of a relation (table) is as- 
sociated with a membership degree. This approach is 
often neglected because the interpretation of the mem- 
bership degree is unclear. On the other hand, it is worth 
noticing that fuzzy queries, as will be discussed in 
Sect. 19.4, in fact produce fuzzy relations. So, we will 
come back to this issue when discussing fuzzy queries 
in Sect. 19.4. 

Most of the research on modeling imperfect infor- 
mation in databases using soft computing techniques 
is devoted to a proper representation and processing of 
an attribute value. Such a value, in general, may not be 
known perfectly due to many different reasons [19.24]. 
For example, due to the imprecision, as when the paint- 
ing is dated to the beginning of the fourteenth century; 
or due to the unreliability, as when the source of infor- 
mation is not fully reliable; or due to the ambiguity, as 
when the provided value may have different meanings; 
or due to the inconsistency, as when there are multi- 
ple different values provided by different sources; or 
due to the incompleteness, as when when the value is 
completely missing or given as a set of possible alter- 
natives (e.g., the picture was painted by Rubens or van 
Dyck). These various forms of information imperfec- 
tion are not totally unconnected as well as may occur 
together. From the viewpoint of data representation they 
may be primarily seen as yielding uncertainty as to the 
actual value of an attribute and as such may be properly 
accounted for by a possibility distribution. 

It is worth noticing that then the assignment of 
the value to an attribute may be identified with 
a Zadeh’s [19.25] linguistic expression X is A, where 
X is a linguistic variable corresponding to the attribute 
while A is a (disjunctive) fuzzy set representing imper- 
fect information on its value. Then, various combined 
forms of information imperfection may be represented 
by appropriate qualified linguistic expressions such as, 
e.g., X is A with certainty at least œ [19.26]. Such qual- 
ified linguistic expressions may be in turn transformed 
into a X is B expression where a fuzzy set B is a func- 
tion of A and other possible parameters of a qualified 


expression, like œ in the previous example. Thus, the 
basic linguistic expression X is A indeed plays a fun- 
damental role in the representation of the imperfect 
information. 

The work of Prade and Testemale [19.27] is the 
most representative for the approaches to imperfect in- 
formation modeling in a database based on the possibil- 
ity theory. Other works in this vein include [19.27-32]. 
On the other hand, Buckles and Petry [19.33] as well as 
Anvari and Rose [19.34] assume the representation of 
attributes’ values using sets of alternatives which may 
be treated as a simple binary possibility distribution. 
However, their motivation is different as they assume 
that domain elements are similar/indistinguishable to 
some extent and due to that it may be difficult to deter- 
mine a precise value of an attribute. We will first briefly 
describe the approach of Prade and Testemale and, then, 
the model of Buckles and Petry. 


Possibilistic Approach 
In the possibilistic approach, disjunctive fuzzy sets are 
used to represent the imprecisely known value of an 
attribute A. Hence, such a fuzzy set is interpreted as 
a possibility distribution z4 and is defined on the do- 
main dom, of the attribute. The (degree of) possibility 
that the actual value of A is a particular element x of 
the domain of this attribute, x € dom, equals m,4(x). 
Every domain value x € dom, with m4(x) Æ 0 is thus 
a candidate for being the actual value of A. Together 
all candidate values and their associated possibility de- 
grees reflect what is actually known about the attribute 
value. Thus, if the value of an attribute is not known 
precisely then a set of values may be specified (rep- 
resented by the support of the fuzzy set used) and, 
moreover, particular elements of this set may be indi- 
cated as a more or less plausible values of the attribute 
in question. 

A typical scenario in which such an imprecise value 
has to be stored in a database is when the value of an 
attribute is expressed using a linguistic term. For exam- 
ple, assume that the value of a painting is not known 
precisely, but the painting is known to be very valuable. 
Then, this information might be represented by the pos- 
sibility distribution 


Tyalue (x) = Hyery_valuable(X) 
0 if x < 10M 
x—10 


= Tn if 10M <x < 20M 


1 if x > 20M . 
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The term uncertainty is in information management of- 
ten used to refer to situations where one has to cope 
with several (distinct) candidate attribute values com- 
ing from different information sources. For example, 
one information source can specify that the phone num- 
ber of a person is X, while another source can specify 
that itis Y. This kind of uncertainty can also be handled 
with the possibilistic approach as described above. In 
that case, a possibility distribution 74 over the domain 
of the attribute is used to model different options for the 
attribute’s actual value. However, in general possibility 
distributions are less informative than probability dis- 
tributions. They only inform the user about the relative 
likeliness of different options. Probability distributions 
provide more information and led to the so-called prob- 
abilistic databases [19.35—39]; cf. also [19.40]. 

Imprecision at the one hand and uncertainty at the 
other hand are orthogonal concepts: they can occur at 
the same time, as already mentioned earlier. For exam- 
ple, it might be uncertain whether the value of a painting 
is 3M, around 2M, or much cheaper, where the latter 
two options are imprecise descriptions. Using the regu- 
lar possibilistic modeling approach in such a case would 
yield in a single possibility distribution over the domain 
of values and would result in a loss of information on 
how the original three options were specified. Level-2 
fuzzy sets, which are fuzzy sets defined over a domain 
of fuzzy sets [19.41, 42], can help to avoid this informa- 
tion loss [19.43]. 

In traditional databases, missing information is 
mostly handled by means of a pseudovalue, called a null 
value [19.44,45]. In fact, information may be miss- 
ing for many different reasons: the data may exist but 
be unknown (e.g., the salary of an employee may be 
unknown); the data may not exist nor apply (e.g., an un- 
employed person earns no salary) [19.46]. For the han- 
dling of nonapplicability a special pseudovalue is still 
required, but the case of unknown information can be 
adequately handled by using possibility theory. Indeed, 
as studied in [19.47], in the so-called extended possi- 
bilistic approach, the domain dom, of an attribute A can 
be extended with an extra value L4 that is interpreted as 
regular value not applicable. Missing information can 
then be adequately modeled by considering the follow- 
ing three special possibility distributions UNK, N/A, 
and TUNA: 


@ Unknown value 


1, if x € dom, \ {La} 
0, ifx= ly 


TTUNK (x) = 


@ Value not applicable 


0, if x € dom, \ {La} 
1, ifx= L, 


zya (x) = 


@ No information available 
zuna (x) = 1, Vx € dom . 


Similarity-Based Approach 
The basic idea behind this approach [19.33] is that 
while specifying the value of a database attribute one 
may consider similar values as also being applicable. 
Thus, in general, the value of an attribute A is assumed 
to be a subset of its domain dom,. Moreover, the do- 
main dom is associated with a similarity relation S4 
quantifying this similarity for each pair of elements 
x, y E doma. The values S4(x, y) taken by S4 are in the 
unit interval [0, 1], where 0 corresponds to totally differ- 
ent and 1 to totally similar. Hence, S4 is a fuzzy binary 
relation that associates a membership grade to each 
pair of domain values. This relation is assumed to be 
reflexive, symmetric and satisfying some form of transi- 
tivity. This requirements have been found too restrictive 
and some approaches based on a weaker structure have 
been proposed in [19.48], where the proximity relation 
is used and all attractive properties of the original ap- 
proach are preserved. Among these properties, the most 
important is the proper adaptation of the redundancy 
concept and of the relational algebra operations. 

It has been quickly recognized that the rough sets 
theory [19.15] offers effective tools to deal with and an- 
alyze the indistinguishability/equivalence relation and 
the similarity-based approaches evolved into the rough- 
sets-based database model [19.49, 50]. 

There are also a number of hybrid models pro- 
posed in the literature. Takahashi [19.51] has proposed 
a model for a fuzzy relational database assuming pos- 
sibility distributions as attribute values. Moreover, in 
his model fuzzy sets are used as tuples’ truth values. 
For example, a tuple t, accompanied by such a truth 
qualification, may express that Jt is quite true that the 
paintings origin is the beginning of the fifteenth century. 

Medina etal. [19.52] proposed a fuzzy database 
model called GEFRED (generalized fuzzy relational 
database) in an attempt to integrate both approaches: 
the possibilistic and similarity based one. The data are 
stored as generalized fuzzy relations that extend the re- 
lations of the relational model by allowing imprecise 
information and a compatibility degree associated with 
each attribute value. 
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19.3.2 Modeling of Imperfect Information - 
Selected Advanced Approaches 


In this section, we will focus on some extensions to the 
possibility-based approach described in Sect. 19.3.1. As 
argued earlier, a disjunctive fuzzy set may very well 
represent the situation when an attribute A of a tuple 
for sure takes exactly one value from the domain dom, 
(as it should also due to the classical relational model) 
but we do not know exactly which one. The complete 
ignorance is then modeled by the set dom,. However, 
very often we can distinguish the elements of dom, with 
respect to their plausibility as the actual value of the 
attribute. This information, based on some evidence, is 
represented by the membership function of a disjunctive 
fuzzy set which is further identified with a possibility 
distribution. However, the characteristic of the available 
evidence may be difficult to express using regular fuzzy 
sets. In the literature dealing with data representation 
there are first attempts to cover such cases using some 
extensions to the concept of the fuzzy set. We will now 
briefly review them. 


Imprecise Membership Degrees 
Prade and Testemale in their original approach [19.31] 
assume that the membership degrees of a mentioned 
disjunctive fuzzy set are known precisely. On the other 
hand, one can argue that they may be also known only 
in an imprecise way. It may be the case, in particular, 
when the value of an attribute is originally specified us- 
ing a linguistic term. For example, if a painting is dated 
to the beginning of the fifteenth century then assigning, 
e.g., the degree of 0.6 to the year 1440 may be chal- 
lenging for an expert who is to define the representation 
of this linguistic term. He or she may be much more 
comfortable stating that it is something, e.g., between 
0.5 and 0.7. Some precision is lost and a second level 
uncertainty is then implied but it may better reflect the 
evidence actually available. 

Type 2 fuzzy sets [19.25] make it possible to model 
the data in the case described earlier. In particular, their 
simplest form, the interval valued fuzzy sets (IFVSs), 
may be here of interest. In the case of interval-valued 
fuzzy sets [19.25] a membership degree is represented 
as an interval, as in the example given earlier. Thus an 
interval-valued fuzzy set X over a universe of discourse 
U is defined by two functions 


uk, we U = [0,1], 
such that 
0< ua aal, VxeU, (19.12) 


and may be denoted by 


X = [< x, ux(x), wea) > (ee UNA 
(0 < wy (x) < uya) < I]. (19.13) 


Constraint (19.12) reflects that Lh (x) and u4 (x) are, re- 
spectively, interpreted as a lower and an upper bound 
on the actual degree of membership of x in X. 

Basically, the representation of information using 
(disjunctive) IVFS is conceptually identical with the 
original approach of Prade and Testemale while it pro- 
vides some more flexibility in defining the meaning of 
linguistic terms. Some preliminary discussion on their 
use may be found in [19.53]. 


Bipolarity of Information 
Bipolarity is related to the existence of the positive 
and negative information [19.54—58]. It manifests it- 
self, in particular, when people are making judgments 
about some alternatives and take into account their 
positive and negative sides. From this point of view, 
bipolarity of information may play an important role 
in database querying and is discussed from this per- 
spective in Sect. 19.4.1. Here we will briefly discuss 
the role of bipolarity in data representation. We will 
mostly follow in this respect the work of Dubois and 
Prade [19.58]. 

The value of an attribute may be not known pre- 
cisely but some information on it may be available in 
the form of both positive and negative statements. In 
some situations positive information is provided, stat- 
ing what values are possible, satisfactory, permitted, 
desired, or considered as being acceptable. In other sit- 
uations, negative statements express what values are 
impossible, rejected or forbidden. 

Different types of bipolarity can be distin- 
guished [19.58, 59]: 


© Type I, symmetric univariate bipolarity: positive 
and negative information are considered as being 
exact complements of each other as in, e.g., the 
probabilility theory; for instance, if the probability 
that a given painting is painted in the eighteenth 
century is stated to be 0.7 (positive information), 
then the probability that this painting was not 
painted in the eighteenth century equals 0.3 (neg- 
ative information); this simple form of bipolarity is 
well supported by traditional information systems; 
this bipolarity is quantified on a bipolar univariate 
scale such as the intervals [0, 1] or [—1, 1]; 
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© Type II, symmetric bivariate bipolarity: another, 
more flexible approach is to consider positive and 
negative information as being dual concepts, mea- 
sured along two different scales but based on the 
same piece of evidence. The dependency between 
them is modeled by means of some duality relation. 
This kind of bipolarity is, among others, used in 
Atanassov’s intuitionistic fuzzy sets [19.60] where 
each element of a set is assigned both a member- 
ship and nonmembership degree which do not have 
to sum up to 1 but their sum cannot exceed 1. For 
example, it could be stated that Rubens is a good 
candidate to be an author of a given painting to a de- 
gree 0.6 (due to some positive information) while at 
the same time he is not a good candidate to a de- 
gree 0.2 (due to some negative information); this 
bipolarity is quantified on a unipolar bivariate scale 
composed of two unipolar scales such as, e.g., two 
intervals [0, 1]; 

© Type III, asymmetric/heterogeneous bipolarity: in 
the most general case, positive and negative in- 
formation is provided by two separate bodies of 
evidence, which are to some extent independent of 
each other and are of a different nature. A con- 
straint to guarantee that the information does not 
contain contradictions can exist but beside of that 
both statements are independent of each other and 
hence giving rise to the notion of the heterogeneous 
bipolarity; this bipolarity is quantified as in the case 
of Type II bipolarity. 


Type III bipolarity is of special interest from the 
point of view of data representation. Dubois and 
Prade [19.58] argue that in this context the heterogene- 
ity of bipolarity is related to to the different nature of 
two bodies of evidence available. Namely, the negative 
information corresponds to the knowledge which puts 
some general constraints on the feasible values of an 
attribute. On the other hand, the positive information 
corresponds to data, i.e., observed cases which justify 
plausibility of a given element as the candidate for the 
value of an attribute. 


19.4 Soft Computing in Querying 


The research on soft computing in querying has al- 
ready a long history. It has been inspired by the success 
of fuzzy logic in modeling natural language proposi- 
tions. The use of such propositions in queries, in turn, 


This type of bipolarity is proposed to be represented 
for a tuple ż by two separate possibility distributions, 
da) and maq), defined on the domain of an attribute, 
domg [19.58] (and earlier works cited therein). A pos- 
sibility distribution m4), as previously, represents the 
compatibility of particular elements x € dom, with the 
available information on the value of the attribute A for 
a tuple t. This compatibility is quantified on a unipolar 
negative scale identified with the interval [0, 1]. The ex- 
treme values of x4, i.e., maq (x) = 1 and macy (x) = 0 
are meant to represent, respectively, that x is potentially 
fully possible to be the value of A at ¢ (1 is a neu- 
tral element on this unipolar scale) and that x is totally 
impossible to be the value of A at t (0 is an extreme neg- 
ative element on this unipolar scale). On the other hand, 
a possibility distribution 4q) expresses the degree of 
support for an element x € domy to be the value of A at 
t provided by some evidence. In this case, 54(, (x) = 0 
denotes the lack of such a support but is meant as just 
a neutral assessment while 6,(,) (x) = 1 denotes full sup- 
port (1 is an extreme positive element on this scale). 

For example, when the exact dating of a painting is 
unknown, one can be convinced it has been painted in 
some time range (e.g., related to the time period its author 
lived in) and also there may be some evidence support- 
ing a particular period of time (e.g., due to the fact that 
other very similar paintings of a given author are known 
from this period). Thus, the former is a negative infor- 
mation, excluding some period of time while the latter is 
a positive information supporting given period. 

Thus, d4() and maç) are said to represent, respec- 
tively, the set of guaranteed/actually possible and the 
set of potentially possible values of an attribute A for 
the tuple t. These possibility distributions are related by 
a consistency constraint: 74(1) (x) > day (x) as x have to 
be first nonexcluded before it may be somehow sup- 
ported by the evidence. 

For a given attribute A and tuple f, x4) is based on 
the set of nonimpossible values N4) while 64(,) relates 
to the set of actually possible values Gy ,). The querying 
of such a bipolar database may be defined in terms of 
these sets [19.58]. 


seems to be very natural for human users of any in- 
formation system, notably the database management 
system. Later on, the interest in fuzzy querying has 
been reinforced by the omnipresence of network-based 
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applications, related to buzzwords of modern informa- 
tion technology, such as e-commerce, e-government, 
etc. These applications evidently call for a flexible 
querying capability when users are looking for some 
goods, hotel accommodations, etc., that may be best 
described using natural language terms such as cheap, 
large, close to the airport, etc. Another amplification 
of the interest in fuzzy querying comes from develop- 
ments in the area of data warehousing and data mining 
related applications. For example, a combination of 
fuzzy querying and data mining interfaces [19.61, 62] 
or fuzzy logic and the OLAP (online analytical pro- 
cessing) technology [19.63] may lead to new, effective 
and more efficient solutions in this area. More recently, 
big data challenges can be seen as driving forces for 
research in soft querying. Indeed, efficiently querying 
huge quantities of heterogeneous structured and un- 
structured data is one of the prerequisites for efficiently 
handling big data. 


19.4.1 Flexible Querying 
of Regular Databases 


As a starting point, we consider a simplified form of 
database queries on a classical crisp relational database. 
Hereby a query is assumed to consist of a combina- 
tion of conditions that are to be met by the data sought. 
Introducing flexibility is done by specifying fuzzy pref- 
erences. This can be done inside the query conditions 
via flexible search criterion and allows to express that 
some values are more desirable than others in a gradual 
way. Query conditions are allowed to contain natu- 
ral language terms. Another option is to specify fuzzy 
preferences at the level of the aggregation. By assign- 
ing grades of importance to (groups of) conditions it 
can be indicated that the satisfaction of some query 
conditions is more desirable than the satisfaction of 
others. 


Basic Approaches 

One of the pioneering approaches in recognizing the 
power of fuzzy set theory for information retrieval pur- 
poses in general is [19.64]. The research on the appli- 
cation of soft computing in database querying research 
proper dates back to an early work of Tahani [19.65], 
proposing the modeling of linguistic terms in queries 
using elements of fuzzy logic. An important enhance- 
ment of this basic approach consisted in considering 
flexible aggregation operators [19.10, 66-68]. Another 
line of research focused on embedding fuzzy constructs 
in the syntax of the standard SQL [19.21, 69-74]. 


Fuzzy Preferences Inside Query Conditions. 
Tahani [19.65] proposed to use imprecise terms typical 
for natural language such as, e.g., high, young etc., 
to form conditions of an SQL-like querying language 
for relational databases. These imprecise linguistic 
terms are modeled using fuzzy sets defined in attributes 
domains. The binary satisfaction of a classical query is 
replaced with the matching degree defined in a straight- 
forward way. Namely, a tuple ¢ matches a simple 
(elementary) condition A = /, where A is an attribute 
(e.g., price) and / is a linguistic term (e.g., high) to 
a degree y(A = /, t) such that 


y(A=1,t) = w(AQ). (19.14) 


where A(t) is the value of the attribute A at the tuple t 
and u;(-) is the membership function of the fuzzy set 
representing the linguistic term /. The matching degree 
for compound conditions, e.g., price = high AND (date 
= beginning-of-17-century OR origin = south-europe) 
is obtained by a proper interpretation of the fuzzy logi- 
cal connectives. For example 


y((Ai = 4) AND (A2 = h), t] 
= min[u; (A1 (8), Hp (A2(0))] - (19.15) 


The relational algebra has been very early adapted 
for the purposes of fuzzy flexible querying of regu- 
lar relational databases. The division operator attracted 
a special attention and its many fuzzy variants has 
been proposed, among other by Yager [19.75], Dubois 
and Prade [19.76], Galindo etal. [19.77], and Bosc 
et al. [19.78, 79]. Takahashi [19.80] was among the first 
authors to propose a fuzzy version of the relational 
calculus. His fuzzy query language (FQL) was meant 
as a fuzzy extension of the domain relational calculus 
(DRC). 


Fuzzy Preferences Between Query Conditions. Of- 
ten, a query is composed of several conditions of vary- 
ing importance for the user. For example, a customer 
of a real-estate agency may be looking for a cheap 
apartment in a specific district of a city and located 
not higher that a given floor. However, for he or she 
the low price may be much more important than the 
two other features. It may be difficult to express such 
preferences in a traditional query language. On the 
other hand, it is very natural for flexible fuzzy query- 
ing approaches due to the assumed gradual character 
of the matching degree as well as due to the existence 
of sophisticated preference modeling techniques devel- 
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oped by fuzzy logic community. Thus, most approaches 
make it possible to assign to a condition an importance 
weight, usually represented by a number from the [0, 1] 
interval. 

The impact of a weight can be modeled by first 
matching the condition as if there is no weight and only 
then modifying the resulting matching degree in ac- 
cordance with the weight. A modification function that 
strengthens the match of more important conditions and 
weakens the match of less important conditions is used 
for this purpose. 

The evaluation of a whole query against a tuple may 
be seen as an aggregation of the matching degrees of el- 
ementary conditions comprising the query against this 
tuple. Thus, an aggregation operator is involved which, 
in the case of a simple conjunction or disjunction of 
the elementary conditions is usually assumed to take 
the form of the minimum and maximum operator, re- 
spectively. If weights are assigned to the elementary 
conditions connected using conjunction or disjunction 
then, first, the matching degrees of these conditions are 
modified using the weights [19.81] and then they are 
aggregated as usual. 

On the other hand, some special aggregation oper- 
ators may be explicitly used in the query and then they 
guide the aggregation process. Kacprzyk et al. [19.66, 
67] were first to propose the use in queries of an 
aggregation operator in the form of a linguistic quan- 
tifier [19.82]. Thus, the user may require, e.g., most 
of the elementary conditions to be fulfilled instead 
of all of them (what is required when the conjunc- 
tion of the conditions is used) or instead of just 
one of them (what is required in the case of the 
disjunction. For example, the user may define paint- 
ings of his interest as those meeting most of the 
following conditions: not expensive, painted in Italy, 
painted not later than in seventeenth century, ac- 
companied by an attractive insurance offer etc. The 
overall matching degree of a query involving a lin- 
guistic quantifier may be computed using any of the 
approaches used to model these quantifiers. In [19.66, 
67], Zadeh’s original approach is used [19.82] while 
in [19.83] Yager’s approach based on the OWA oper- 
ators is adopted [19.84]. Further studies on modeling 
sophisticated aggregation operators, notably linguistic 
quantifiers, in the flexible fuzzy queries include the pa- 
pers by Bosc et al. [19.85, 86], Galindo et al. [19.21] 
and Vila et al. [19.87]. 

A recent book by Bosc and Pivert [19.88] contains 
a comprehensive survey of the sophisticated flexible 
database querying techniques. 


Bipolar Queries 
An important novel line of research concerning ad- 
vanced querying of databases addresses the issue of 
the bipolar nature of users preferences. Some psycho- 
logical studies (e.g., sources cited by [19.56]) show 
that while expressing his or her preferences a human 
being is separately considering positive and negative 
aspects of a given option. Thus, to account for this 
phenomenon, a query should be seen as a combina- 
tion of two types of conditions: the satisfaction of one 
of them makes a piece of data desired while the sat- 
isfaction of the second makes it to be rejected. Such 
a query will be referred to as the bipolar query and will 
be denoted as a pair of conditions (C, P), where C, for 
convenience, denotes the complement of the negative 
condition and P denotes the positive condition. The re- 
lations between these two types of conditions may be 
analyzed from different viewpoints, and the conditions 
itself may be expressed in various ways. In Sect. 19.3.2, 
we have already introduced the concept of bipolarity in 
the context of data representation. Now, we will briefly 
survey different approaches to modeling the bipolar- 
ity with a special emphasis on the context of database 


querying. 


Models of Bipolarity. Various scales may be used to 
express bipolarity of preferences. Basically, two models 
based on: a bipolar univariate scale and a unipolar bi- 
variate scale [19.89] are usually considered. The former 
assumes one scale with three main levels of negative, 
neutral, and positive preference degrees, respectively. 
These degrees are gradually changing from one end of 
the scale to another accounting for some intermediate 
levels. In the second model, two scales are used which 
separately account for the positive and negative prefer- 
ence degrees. Often, the intervals [—1, 1] and [0, 1] are 
used to represent the scales in the respective models of 
bipolarity. 

From the point of view of database querying, the 
first model may be seen as assuming that the user as- 
sesses both positive and negative aspects of a given 
piece of data (an attribute value or a tuple) and is in 
a position to come up with an overall scalar evaluation. 
This is convenient with respect to the ordering of the 
tuples in the answer to a query. 

The second model is more general and makes it 
possible for the user to separately express his or her 
evaluation of positive and negative aspects of a given 
piece of data. This may be convenient if the user can- 
not, or is not willing to, combine his or her evaluations 
of positive and negative features of data. Obviously it 
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requires some special means to order the query answer 
dataset with respect to a pair of evaluations. 


Levels at Which the Bipolarity May be Expressed. 
Bipolar evaluations may concern the domain of an at- 
tribute or the whole set of tuples. This is a distinction of 
a practical importance, in particular if the elicitation of 
user preferences is considered. 

In the former case, the user is supposed to be will- 
ing and in a position to partition the domains of selected 
attributes into (fuzzy) subsets of elements with posi- 
tive, negative, and neutral evaluations. For example, the 
domain of the price attribute, characterizing paintings 
offered during an auction at a gallery, may be in the 
context of a given query subjectively partitioned us- 
ing fuzzy sets representing the terms cheap (positive 
evaluation), expensive (negative evaluation), and some 
elements with a neutral evaluation. 

In the case of bipolar evaluations at the tuples level 
a similar partitioning is assumed but concerning the 
whole set of tuples (here, representing the paintings). 
Usually, this partition will be defined again by (fuzzy) 
sets, this time defined with reference to possibly many 
attributes, i.e., defined on the cross product of the 
domains of several attributes. For example, the user 
may identify as negative these paintings which satisfy 
a compound condition expensive and modern. Thus, the 
evaluations in this case have a comprehensive charac- 
ter and concern the whole tuples, taking implicitly into 
account a possibly complex weighting scheme of par- 
ticular attributes and their interrelations. 

Referring to the models of bipolarity, it seems 
slightly more natural for the bipolar evaluations ex- 
pressed on the level of the domain of an attribute to 
use a bipolar univariate scale while the evaluations on 
the level of the whole set of tuples would rather adopt 
a unipolar bivariate scale. 


A General Interpretation of Bipolarity in the Con- 
text of Database Querying. In the most general in- 
terpretation, we do not assume anything more about 
the relation between positive and negative conditions. 
Thus, we have two conditions and each tuple is evalu- 
ated against them yielding a pair of matching degrees. 
Then an important question is how to order data in 
an answer to such a query. Basically, while doing that 
we should take into account the very nature of both 
matching degrees, i.e., the fact that they correspond 
to the positive and negative conditions. The situation 
here is somehow similar to that of decision making 
under risk. Namely, in the latter context a decision 


maker who is risk-averse may not accept actions lead- 
ing with some nonzero probability to a loss. On the 
other hand, a risk-prone decision maker may ignore 
the risk of an even serious loss as long as there are 
prospects for a high gain. Similar considerations may 
apply in the case of bipolar queries. Some users may be 
more concerned about negative aspects and will reject 
a tuple with a nonzero matching degree of the negative 
condition. Some other users may be more oriented on 
the satisfaction of the positive conditions and may be 
ready to accept the fact that given piece of data satis- 
fies to some extent the negative conditions too. Thus, 
the bipolar query should be evaluated in a database in 
a way strongly dependent on the attitude of the user. 
In the extreme cases, the above-mentioned risk-averse 
and risk-prone attitudes would be represented by lexi- 
cographic orders. In the former case, the lexicographic 
ordering would be first nondecreasing with respect to 
the negative condition matching degree and then nonin- 
creasing with respect to the positive condition matching 
degree. The less extreme attitudes of the users may be 
represented by various aggregation operators producing 
a scalar overall matching degree of a bipolar query. 

An approach to a comprehensive treatment of so 
generally meant bipolar queries has been proposed 
by Matthé and De Tré [19.90], and further developed 
in [19.91]. In this approach, a pair of matching de- 
grees of the positive and negative conditions is referred 
to as a bipolar satisfaction degree (BSD). The respec- 
tive matching degrees are denoted as s and d, and 
called the satisfaction degree and the dissatisfaction de- 
gree, respectively. The ranking of data retrieved against 
a bipolar query in this approach may be obtained in var- 
ious ways. One of the options is based on the difference 
s— d of the two matching degrees. In this case, a risk- 
neutral attitude of the user is modeled: he or she does 
not favor neither positive nor negative evaluation. 


The Required/Desired Semantics. Most of the re- 
search on bipolar queries has been so far focused on 
a special interpretation of the positive and negative con- 
ditions. Namely, the data items sought have to satisfy 
the complement of the latter condition, i.e., the con- 
dition denoted earlier as C, unconditionally while the 
former condition, i. e., the condition denoted as P, is of 
somehow secondary importance. For example, a paint- 
ing one is looking for should be from seventeenth 
century and, if possible should be painted by one of 
the famous Flemish painters. The C condition is here 
painted in the seventeenth century (the original nega- 
tive condition is of course painted not in the seventeenth 
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century) while the positive condition P is painted by one 
of the famous Flemish painters. Thus, the condition C is 
required to be satisfied condition while the condition P 
may be referred to as a desired condition. Anyway, we 
still have two matching degrees, of conditions C and 
P, and the assumed relation between them determines 
the way the tuples should be ordered in the answer to 
a query. 

The simplest approach is to use the desired con- 
dition’s matching degree just to order the data items 
which satisfy the required condition. However, if the 
required condition is fuzzy, i.e., may be satisfied to 
a degree, it is not obvious what should it mean that it is 
satisfied. Some authors [19.56, 92] propose to adopt the 
risk-averse model of the user and use the correspond- 
ing lexicographic order with the primary account for 
the satisfaction of the condition C. This interpretation 
is predominant in the literature. 

Another approach consists in employing an aggre- 
gation operator, which combines the degrees of match- 
ing of conditions C and P in such a way so that the 
possibility of satisfying both conditions C and P is ex- 
plicitly taken into account, i. e., the focus is on a proper 
interpretation of the following expression which is iden- 
tified with the bipolar query (C, P) 


C and possibly P . (19.16) 


Aggregation operators of this type have been studied 
in the literature under different names and in vari- 
ous contexts. In the framework of database querying it 
were Lacroix and Lavency [19.93] who first proposed 
it. It has been proposed independently in the context 
of default reasoning by Yager [19.94] and by Dubois 
and Prade [19.95]. The concept of this operator was 
also used by Bordogna and Pasi [19.96] in the con- 
text of textual information retrieval. Recently, a more 
general concept of a query with preferences and a corre- 
sponding new relational algebra operator, winnow, were 
introduced by Chomicki [19.97]. 

Zadrozny and Kacprzyk [19.98,99] proposed a di- 
rect fuzzification of the concept of the and possi- 
bly operator, implicit in the work of Lacroix and 
Lavency [19.93]. In their approach, the essence of the 
and possibly operator modeling consists in taking into 
account the whole database (set of tuples) while com- 
bining the required and desired conditions matching 
degrees. Namely: 


© If there is a tuple which satisfies both conditions 
then and only then it is actually possible to satisfy 


both of them and each tuple have to meet both of 
them, i. e., the and possibly turns into a regular con- 
junction C A P, 

© If there is no such tuple then it is not possible to 
satisfy both conditions and the desired one can be 
disregarded, i. e., the query reduces to C. 


These are however two extreme cases and actually 
it may be the case that the two conditions may be simul- 
taneously satisfied to a degree. Then, the (C, P) query 
may be also matched to a degree which is identified 
with the truth of the following formula 


C(t) and possibly P(t) 
= C(t) AAs(C(s) A P(s)) > P(t) (19.17) 


This formula has been proposed by Lacroix and 
Lavency [19.93] for the crisp case. Its fuzzy coun- 
terpart [19.98-100] requires to choose a proper inter- 
pretation of the logical connectives, and may take the 
following form 


C(t) and possibly P(t) 


= min foo, max f — max min(C(s),P(s)- P| 
) (19.18) 


where J” denotes the whole set of tuples being queried. 

Formula (19.16) is derived from (19.17) using the 
classical fuzzy interpretation of the logical connec- 
tives via the max and min operators. Zadrozny and 
Kacprzyk [19.100-102] studied the properties of the 
counterparts of (19.18) obtained using a broader class 
of the operators modeling logical connectives. 

It is worth noting that if the required/preferred se- 
mantics is assumed and the bipolar evaluations are 
expressed at the level of an attribute domain then it 
is reasonable to impose some consistency conditions 
on the form of both fuzzy sets representing condition 
C and P. Namely, it may be argued that a domain el- 
ement should be first acceptable, i.e., should satisfy 
the required condition C, before it may be desired, 
i.e., satisfy the condition P. Such consistency condi- 
tions between fuzzy sets C and P may be conveniently 
expressed using the concepts of twofold fuzzy sets 
or Afanassov intuitionistic fuzzy sets/interval-valued 
fuzzy sets, referred to earlier in Sect. 19.3.2. For an 
in-depth discussion of such consistency conditions the 
reader is referred to [19.56, 92, 103]. 

The growing interest in modeling bipolarity of user 
preferences in queries resulted recently in some further 
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studies and interpretations of the and possibly operator 
as well as in the concept of new similar operators such 
as the or at least operator. For more details, the reader 
is referred to [19.104, 105]. 


19.4.2 Flexible Querying of Fuzzy Databases 


Possibilistic Approach 

The possibilistic approach to data modeling is based 
on the sound foundations of a well-developed theory, 
i.e., the possibility theory. Thus, the standard rela- 
tional algebra operations have their counterparts in an 
algebra for retrieving information from a fuzzy pos- 
sibilistic relational database, proposed by Prade and 
Testemale [19.31]. Let us consider the selection op- 
eration o. In the classical relational algebra it is an 
unary operation which for a given relation R returns an- 
other relation o (R), comprising these tuples of R which 
satisfy a condition c (such a condition is a kind of a pa- 
rameter of the selection operations). In the possibilistic 
approach, the selection operation has to be redefined 
so as to make it compatible with the assumed data 
representation. To this end, two types of elementary 
conditions are considered: 


(i) A 0 a, where A is an attribute, 0 is a comparison 
operator (fuzzy or not) and a is a constant (fuzzy or 
not); 

(ii) A; 0 Aj, where A; and A; are attributes. 


In general, an exact value of an attribute is un- 
known and, thus, the matching degree is defined as 
the possibility and necessity of the match between this 
value and the constant (case (i) above) or the value of 
another attribute (case (ii) above). Hence, the formu- 
las (19.17)-(19.18) are used to compute the possibility 
and necessity in the following way. 

In case (i), the possibility distribution 74, (+) repre- 
senting the value of the attribute A at a tuple ¢ is used 
to compute the possibility measure of a set F, crisp or 
fuzzy, of elements of dom, being in the relation 0 with 
elements representing the constant a. The membership 
function of the set F is 


r(x) = sup min(He(x,y),Maly)), x Edoma, 


yEdoma 
(19.19) 


where Hal) is the membership function of the constant 
a and uo(:) represents the fuzzy comparison operator 
(fuzzy relation) 0. Then, the pair (Mae (F), Naw (F)) 


represents the membership of a tuple f to the relation 
being the result of the selection operator. 

In case (ii), the joint possibility distribution 
Taia; 0) C) is used to compute the possibility mea- 
sure of a subset F of the Cartesian product of domains 
of A; and A; comprising the pairs of elements being in 
relation 0. The membership function of the set F is de- 
fined as follows 


ur x,y) = hex y), x€dony;, y € dom, . 
(19.20) 


Then, the pair Waway) E), Naio.) (F)) repre- 
sents the membership degree of a tuple ¢ to the relation 
being the result of the selection operator. If the at- 
tributes A; and A; are noninteractive [19.27] then the 
computing of the possibility measure is simplified. 
Namely 


T(Aj(1).A()) (XY) = mina X), Tay (Y)) - 
(19.21) 


It is worth noting that Prade and Testemale [19.27], in 
fact, consider the answer to a query as composed of two 
fuzzy sets of tuples: 


@ Those which necessarily match the query; the mem- 
bership function degree for each tuple of this set is 
defined by Nain (F)); 

@ Those which possibly match the query; the mem- 
bership function degree for each tuple of this set is 
defined by Miao (F)). 


Prade and Testemale [19.27] consider also the case 
when the selection operation is used with a compound 
condition C, i.e., C=C; ACr or C=C; VC or C= 
—=C,. Due to the fact that, in general, a calculus of un- 
certainty, exemplified by the possibility theory, cannot 
be truth functional [19.106], it is not enough to compute 
the possibility and necessity measures for elementary 
conditions using possibility distributions representing 
values of particular attributes and then combine them 
using an appropriate operator. In order to secure effec- 
tive computing of the result of the selection operator 
it is thus assumed that the attributes referred to in ele- 
mentary conditions are noninteractive. In such a case, 
truth-functional combination of the obtained possibility 
and necessity measures is justified. 

Dubois and Prade [19.58] propose a technique of 
querying bipolar data using bipolar queries. A tuple 
may be classified to many categories with respect to an 
answer to such a query. 
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In the extended possibilistic approach [19.107] the 
matching degree of an elementary condition against 
a tuple f is expressed by an EPTV (cf., page 298). This 
EPTV represents the extent to which it is (un)certain 
that t belongs to the result of a flexible query. Let us 
consider a query condition of the form A is /, where A 
denotes an attribute and / denotes a fuzzy set represent- 
ing a linguistic term used, in a query such as, e.g., low 
in Price is low in a query. Then, the EPTV representing 
the matching degree will be computed as 


My (a is p(T) = sup min(z4(x), Wi(x)), (19.22) 
x€dom, 
Me(aisn(F)= sup min(m (x), 1- wiQ)), 
xEdoma— {L} 
(19.23) 
«(a is (L) = min(m (L), 1 — u(L)), (19.24) 


where z4(-) denotes the possibility distribution repre- 
senting the value of the attribute A (to simplify the 
notaion we omit here a reference to a tuple f). In the case 
of a compound query condition, the resulting EPTV can 
be obtained by aggregating the EPTVs computed for the 
elementary conditions. Hereby, generalizations of the 
logical connectives of the conjunction (A), disjunction 
(v), negation (~), implication (—), and equivalence 
(<>) can be applied according to [19.14, 108]. 

Baldwin et al. [19.109] have implemented a system 
for querying a possibilistic relational database using 
semantic unification and the evidential support logic 
rule to combine matching degrees of the elementary 
conditions. The queries are composed of one or more 
conditions, the importance of each condition, a filter- 
ing function (similar to the notion of quantifier) and 
a threshold. The particularity of their work is the pro- 
cess, semantic unification, used for matching the fuzzy 
values of the criteria with the possibility distributions 
representing the values of the attributes. As a result, one 
obtains an interval [n, p], where, similarly to the previ- 
ous case, n (necessity) is the certain degree of matching 
and p (possibility) is the maximum possible degree of 
matching. However, this time the calculations are based 
on the mass assignments theory developed by Baldwin. 
In this approach, an interactive iterative process of the 
querying is postulated. 

Bosc and Pivert [19.110] proposed another type of 
a query against a possibilistic database. Namely, the 
user may be interested in finding tuples which have 
a specific features of the possibility distribution repre- 
senting the value of an attribute. Thus, the condition of 
such a query does not refer to the value of an attribute 


itself but to the characteristics of its possibility distribu- 
tion. This new type of queries may be illustrated with 
the following examples: 


I. Find tuples such that all the values a), az,.. 
possible for an attribute A. 

II. Find tuples such that more than n values are possible 
to a degree higher than A for an attribute A. 

HI. Find tuples where for attribute A the value a, is 
more possible than the value ap. 

IV. Find tuples where for attribute A only one value is 
completely possible. 


., dy are 


The matching degree for such queries is computed 
in a fairly straightforward way. For the query of type I. it 
may be computed as min(74 (a1), 74 (a2), .. . , T4 (an)). 

The reader is referred for more details to the fol- 
lowing sources on fuzzy querying in the possibilistic 
setting [19.28, 29, 32, 111]. 


Similarity-Based Approach 
The research on querying in similarity-based fuzzy 
databases is best presented in a series of papers by 
Petry etal. [19.112-114]. A complete set of opera- 
tions of the relational algebra has been defined for 
the similarity relation-based model. These operations 
result from their classical counterparts by the replace- 
ment of the concept of equality of two domain values 
with the concept of their similarity. The conditions of 
queries are composed of crisp predicates as in a regular 
query language. Additionally, a set of level thresholds 
may be submitted as a part of the query. A threshold 
may be specified for each attribute appearing in query’s 
condition. Such a threshold indicates what degree of 
similarity of two values from the domain of a given at- 
tribute justifies to consider them equal. The concept of 
the threshold level also plays a central role in the defini- 
tion of the redundancy in this database model and thus 
is important for the relational algebra operations as they 
are usually assumed to be followed by the reduction 
of redundant tuples. In this model, two tuples are re- 
dundant if the values of all corresponding attributes are 
similar (to a level higher than a selected degree) rather 
than equal, as it is the case in the traditional relational 
data model. 

Hybrid models, mentioned earlier, are usually ac- 
companied by their own querying schemes. For exam- 
ple, the GEFRED model is equipped with a generalized 
fuzzy relational algebra. Galindo et al. [19.115] have 
extended the GEFRED model with a fuzzy domain 
relational calculus (FDRC) and in [19.116] the fuzzy 
quantifiers have been included. 
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19.5 Conclusions 


In this chapter, we have presented an overview of se- 
lected contributions in the areas of data representation 
and querying. We have focused on approaches rooted 
in the relational data model. In the literature, many ap- 
proaches have been also proposed for, e.g., fuzzy object 
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20. Application of Fuzzy Techniques 


to Autonomous Robots 


Ismael Rodriguez Fdez, Manuel Mucientes, Alberto Bugarin Diz 


The application of fuzzy techniques in robotics has 
become widespread in the last years and in dif- 

ferent fields of robotics, such as behavior design, 
coordination of behavior, perception, localization, 
etc. The significance of the contributions was high 
until the end of the 1990s, where the main aim in 
robotics was the implementation of basic behav- 
iors. In the last years, the focus in robotics moved 
to building robots that operate autonomously in 

real environments; the actual impact of fuzzy tech- 
niques in the robotics community is not as deep 
as it was in the early stages of robotics or as it is in 
other application areas (e.g., medicine, processes 
industry ...). In spite of this, new emerging ar- 

eas in robotics such as human-robot interaction, 
or well-established ones, such as perception, are 
good examples of new potential realms of appli- 
cations where (hybridized) fuzzy approaches will 

surely be capable of exhibiting their capacity to 

deal with such complex and dynamic scenarios. 


20.1 Robotics and Fuzzy Logic 


Although many other classical definitions could be 
stated, an autonomous robot may be defined as a ma- 
chine that collects data from the environment through 
sensors, processes these data taking into account its pre- 
vious knowledge of the world, and acts according to 
a goal. This definition is general and covers the different 
types of robots available today: indoor wheeled robots, 
autonomous cars, unmanned aerial vehicles (UAVs), 
autonomous underwater vehicles, robotic arms, hu- 
manoid robots, robotic heads, etc. 

Between the early 1990s and today robotics has 
evolved a lot. From our point of view, it is possible 
to distinguish three stages in robotics research in the 
last years. At the beginning, the focus was on endow- 
ing robots with a number of simple behaviors to solve 
basic tasks like wall-following, obstacle avoidance, 
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entering rooms, etc. Later, the objective in robotics 
moved to building truly autonomous robots, which re- 
quired the mapping of the environment, the localization 
of the robot, and navigation or motion planning. Al- 
though these topics are still open, the focus of current 
robotics is starting to move to a third stage where much 
higher level and integrated capabilities, such as ad- 
vanced perception, learning of complex behaviors, or 
human-robot interaction, are involved. 

Within this context, fuzzy logic has been widely 
used in robotics for several purposes. The main ad- 
vantage of using fuzzy logic in robotics is its ability 
to manage the uncertainty due to sensors, actuators, 
and also in the knowledge about the world. Until the 
end of the 1990s, the contributions of fuzzy logic were 
mainly in three fields [20.1]: behaviors, coordination 


313 


v 
o 
æ] 
Co 
[se] 
N 
i=) 
° 
= 


314 Part B 


Fuzzy Logic 


7°02 | d Hed 


Table 20.1 Distribution of the 98 publications considered 
in the respective rankings: Thomson-Reuters Web of Sci- 
ence (after [20.6]; JCR 2012-WoK) for journals and Mi- 
crosoft Academic Search (MAS 2013; after [20.5] for 
conferences 


Quartile Journals %papers Conferences %papers 
no.papers no. papers 

Q1 68 81 10 12 

Q2 10 12 1 7 
Q3 2 2 1 7 

Q4 1 1 0 0 
No 3 4 2 14 
ranking 

Total 84 100 14 100 


of behaviors, and perception. The design of behaviors 
for solving specific and simple tasks in robotics has 
been undoubtedly the most successful application of 
fuzzy logic in robotics. As examples of these behaviors 
we have wall-following, navigation, trajectory track- 
ing, moving objects tracking, etc. Also, the selection 
and/or combination of these basic behaviors has been 
solved with fuzzy logic [20.2—4]. Finally, fuzzy tech- 
niques have contributed to perception in two lines: i) 
for the preprocessing of sensor data, prior to their use 
as input to the behavior and ii) for modeling the uncer- 
tainty both in occupancy and feature-based maps. 

In this chapter we describe and analyze the contri- 
butions of fuzzy techniques to robotics in the period 
2003-2013. A number of 98 references related to the 


20.2 Wall-Following 


When autonomous robots navigate within indoor envi- 
ronments (e.g., industrial or civil buildings), they have 
to be endowed with a number of basic capabilities that 
allow the robot to perform specific tasks during oper- 
ation. These basic capabilities are usually referred to 
in the robotics literature as behaviors. Some examples 
of usual robot behaviors are the ability to move along 
corridors, to follow walls at a given distance, to turn cor- 
ners, and to cross open areas in rooms. Wall-following 
behavior is one of the most relevant ones and has been 
very widely dealt with in the robotics literature, since it 
is one of the basic behaviors to be executed when the 
robot is exploring an unknown area, or when it is mov- 
ing between two points in a map. 

The characteristic that makes a fuzzy controller use- 
ful for the implementation of this and other behaviors is 


topic fuzzy and robotics have finally been selected for 
being categorized and described with the aim to fo- 
cus on recent papers describing uses of fuzzy-based 
or hybrid methods in relevant tasks. Both basic be- 
haviors and also high-level tasks of the robotics area 
as it is understood nowadays were considered. Ta- 
ble 20.1 describes the relevance of the papers consid- 
ered in terms of their position in the well-known rank- 
ings Thomson-Reuters Web of Knowledge for journals 
and Microsoft Academic Search (MAS 2013) [20.5] 
for conferences. We decided to consider the MAS 
2013 ranking since other conference rankings such 
as CORE-ERA (Computing Research and Education- 
Excellence in Research for Australia) were not up 
to date at the moment and do not extensively cover 
the robotics area. It can be seen that a vast major- 
ity of the papers are ranked in the Q1 of the re- 
spective lists (81% for journals and 72% for confer- 
ences). 

All the references were revised and classified ac- 
cording to the robotics area they mainly addressed and 
were included into one of the 12 categories described 
in Sects. 20.2—20.13. Also the fuzzy technique they use 
(together with its hybridization with other soft comput- 
ing techniques if this is the case) was annotated in order 
to assess which are the most active areas of soft com- 
puting in the field of robotics and also to evaluate their 
actual impact in the field. 

The results of this revision, classification, and anal- 
ysis are presented in the sections that follow. 


the ability that fuzzy controllers have to cope with noisy 
inputs. This noise appears when the sensors of the robot 
detect the surrounding environment and is an inherent 
feature to the whole field of robotic sensors. 

The importance of wall-following behavior for 
a car-like mobile robot was pointed out in [20.7]. In this 
work, a fuzzy logic control system was used in order to 
implement human-like driving skills by an autonomous 
mobile robot. Four different sensor-based behaviors 
were merged in order to synthesize the concepts of 
the maneuvers needed. These behaviors were: wall- 
following, corner control, garage-parking, and parallel- 
parking. A description of the design and implementa- 
tion of a velocity controller for wall-following can be 
found in [20.8, 9]. In [20.8] fuzzy temporal rules were 
used in order to filter the sensorial noise of a Nomad 
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200 mobile robot and to endow the rule base with high 
expressiveness. The use of these types of rules has no- 
ticeably improved the robustness and reliability of the 
system. 

Wall-following behavior has also been used as 
a testing benchmark for automatic learning of fuzzy 
controllers. In [20.10] an evolutionary algorithm was 
used to automatically learn the fuzzy controller, taking 
into account the tradeoff between complexity and accu- 
racy. Continuing this work, in [20.11] the focus was on 
reducing the expert knowledge demanded for designing 
the controller. No restrictions are placed either on the 
number of linguistic labels or on the values that define 
the membership functions. Finally, in [20.12] a simple 
but effective learning methodology was presented. This 
methodology was proposed in order to not only gener- 
ate fuzzy controllers with good behavior in simulated 
experiments but also to use them directly in the real 
robot, with no further post-processing or implementing 
a tuning stage. 

More recent studies have addressed the use of type- 
2 fuzzy logic. These proposals are mostly motivated by 
the claim that type-1 fuzzy sets cannot handle the high 
levels of uncertainty that are usually present in real- 
world applications to the same extent that type-2 fuzzy 
sets can. 

As for type-1 fuzzy logic controllers (FLCs), this 
behavior has been used for testing the automatic learn- 
ing of type-2 FLCs. In [20.13] a genetic algorithm 
was used for tuning the type-2 fuzzy membership 
functions of a previously defined controller. A more 
complex learning scheme was presented in [20.14]. 
The antecedent is learned using a type-2 fuzzy cluster- 
ing based on examples and without expert knowledge. 
The actions in the consequent part of rules is selected 


20.3 Navigation 


Robot navigation consists of a series of actions, which 
are summarized in the ability of the robot to go from 
a starting point to a goal without a planned route. Navi- 
gation is one of the main issues that a mobile robot must 
solve in order to operate. 

In this behavior the ability to work in dynamic 
environments whose structure is unknown, with great 
uncertainty, with moving objects or objects that 
may change their position is of great importance. 
The capacity to work under these conditions is the 
best motivation to use fuzzy logic in this particular 
task. 


from a set of defined control actions through a hybrid 
method composed of a reinforcement learning algo- 
rithm and ant colony optimization. Both works used 
a real Pioneer robot to demonstrate the viability of the 
controllers. 

In order to show the advantages of using type-2 
fuzzy logic, a comparative analysis of type-1 and inter- 
val type-2 fuzzy controllers was presented in [20.15]. 
A particle swarm optimization algorithm was used to 
optimize a type-1 FLC. Next, the interval type-2 fuzzy 
controller was constructed, blurring the membership 
functions. The results obtained by a real mobile robot 
showed that the interval type-2 fuzzy controller can 
cope better with dynamic uncertainties in the sensory 
inputs due to the softening and smoothing of the output 
control surface. 

However, these works only focus on interval type- 
2 fuzzy logic. The high computational complexities 
associated with general type-2 fuzzy logic systems 
(FLSs) have, until recently, prevented their application 
to real-world control problems. In [20.16] this problem 
was addressed by introducing a complete representa- 
tion framework, which is referred to as zSlices-based 
general type-2 fuzzy systems. As a proof-of-concept ap- 
plication, this framework was implemented for a mobile 
robot, which operates in a real-world outdoor environ- 
ment using the wall-following behavior. In this case, the 
proposed approach outperformed type-1 and interval 
type-2 fuzzy controllers in terms of errors in the dis- 
tance to the wall. For this behavior, type-2 approaches 
exhibit a better performance when compared to type- 
1 approaches. Nevertheless, from a robotics point of 
view, type-2 proposals still did not outperform in gen- 
eral other fuzzy approaches and have a very limited 
impact on this area. 


Some work in this area has been done in simu- 
lated environments. In [20.17] a multi-sensor fusion 
technique was used to integrate all types of sensors 
and combine them to obtain information about the 
environment. In this way, the environment can be per- 
ceived comprehensively, and the ability of the FLC is 
improved. In [20.18, 19] the navigation behavior was 
studied for a set of robots that navigate in the same en- 
vironment. Both works used a Petri net to negotiate the 
priority of the robots. Also, in [20.19] different FLCs 
were compared. Each controller used a different num- 
ber of labels for each variable as well as a different 
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shape of the fuzzy sets. It was concluded that utilizing 
a Gaussian membership function is better for naviga- 
tion in environments with a high number of moving 
objects. 

Moreover, there are several works where this behav- 
ior has been successfully implemented in a real robot. 
A hardware implementation of a FLC for navigation 
in mobile robots was presented in [20.20]. The design 
methodology allows to transform a FLC into a system 
that is suitable for easy implementation on a digital 
signal processor (DSP). This methodology was tested 
with good results in a ROMEO 4R car-like vehicle 
for a parking problem. Moreover, in [20.21] the de- 
sign of a new fuzzy logic-based navigation algorithm 
for autonomous robots was illustrated. It effectively 
achieves correct environment modeling, and processes 
noisy and uncertain sensory data on a low-cost Khepera 
robot. 

The ability to avoid dead-end paths was studied 
in [20.22,23]. In these works the minimum risk ap- 
proach was used to avoid local minima. A novel path- 
searching behavior was developed to recommend the 
local direction with minimum risk, where the risk was 
modeled using fuzzy logic. Another approximation to 
solve this problem was presented in [20.24]. While the 
fuzzy logic body of the algorithm performs the main 
tasks of obstacle avoidance and target seeking, an ac- 
tual/virtual target switching strategy solves the problem 
of dead-ends on the way to the target. 

In [20.25] a new approach was proposed that em- 
ploys a fuzzy discrete event system to implement the 
behavior coordinator mechanism that selects relevant 
behaviors at a particular moment to produce an appro- 
priate system response. The possible transition from 
one state to another when an event occurs was modeled 
using fuzzy sets. 

In addition to the use of conventional FLCs, dif- 
ferent approaches have been used in the last decade. 
In [20.26] a novel reactive type-2 fuzzy logic architec- 
ture was used. Type-2 fuzzy logic was used for both 
implementing the basic navigation behaviors and also 
the strategies for their coordination. The proposed ar- 
chitecture was implemented in a robot and successfully 
tested in indoor and outdoor environments. 

In the same way as several learning approaches 
have been used for wall-following behavior, different 
works focused on automatically learning navigation 
skills. In [20.27] a novel fuzzy Q-learning approach was 
presented, where the weights of the fuzzy rules of the 
controller were learned through a reinforcement algo- 
rithm. 


In [20.28] a neuro-fuzzy network that is able to 
add rules to the rule base was presented. The criteria 
for adding rules was based on a performance eval- 
uation within a genetic algorithm that explores the 
new situations to add. A comparison of three different 
neuro-fuzzy approaches with classical fuzzy controllers 
can be found in [20.29]. It is shown that neuro-fuzzy 
approaches perform better and that the best results 
were obtained by an optimization made by a genetic 
algorithm for both Mamdani and Takagi-Sugeno ap- 
proaches. 

Although mobile wheeled robots are the most com- 
mon area of application for navigation, other types of 
robots also use this behavior. One of the most im- 
pressive types of robot that implement the navigation 
behavior are the unmanned aerial vehicles (UAVs). 
In [20.30] two fuzzy controllers (one for altitude and 
the other for latitude—longitude) were combined in or- 
der to control the navigation of a small UAV. In [20.31] 
the design of a Takagi-Sugeno controller for an un- 
manned helicopter was presented. The controller pro- 
posed is a fuzzy gain-scheduler used for stable and 
robust altitude, roll, pitch, and yaw control. Testing 
in both papers was performed in simulated environ- 
ments to show the results obtained by the controllers 
and, therefore, real testing on real UAVs was not 
reported. 

Another type of robot that demands navigation 
capabilities are robotic manipulators. Simulated manip- 
ulators were used in [20.32, 33]. The strategy followed 
in [20.32] was to use a fuzzy inference process to tune 
the gain of a sliding-mode control and the weights of 
a neural network controller in the presence of distur- 
bance or big tracking errors. It was shown that the 
combination of these controllers can guarantee stability. 
In [20.33] a genetic algorithm was presented in order 
to optimize the controllers of two robotic manipulators 
working on the same environment. 

Other examples of applications of navigation are: 
a robotic fish motion control algorithm [20.34] en- 
dowed with an orientation control system based on 
a FLC; the data transmission latency or data loss 
considered in [20.35] for internet-based teleopera- 
tion of robots (when data transmission fails, the 
robot automatically moves and protects itself using 
a fuzzy controller optimized using a co-evolutionary 
algorithm); and the stabilization of a unicycle mo- 
bile robot described in [20.36], where a type- 
2 FLC was used, and computer simulations con- 
firmed its good performance in different navigation 
problems. 
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20.4 Trajectory Tracking 


Tracking refers to the ability of a robot to follow a pre- 
determined series of movements or a predefined path. 
Tracking is a relevant behavior in the industrial field, 
where robots usually have to perform a repeated pattern 
or trajectory with high precision. 

Developing accurate analytical models for such 
systems and hence reliable controllers based on such 
models is extremely difficult and in general unfeasi- 
ble even for not very complex trajectories. Applying 
fuzzy control strategies for such systems seems appro- 
priate, since with these systems the nonlinear system 
identification methodologies can be exploited with the 
help of inherent knowledge. Furthermore, suitable sta- 
bility conditions for such controllers to guarantee global 
asymptotic stability can be determined. 

In [20.37] a control structure that makes possible 
the integration of a kinematic controller and an adap- 
tive fuzzy controller was developed for mobile robots. 
A highly robust and flexible system that automati- 
cally follows a sequence of discrete way-points was 
presented in [20.38]. In [20.39] the combination of a ve- 
locity controller with a simple fuzzy system that limits 
on-line the advancing speed of the vehicle to allow it 
to follow an assigned path in compliance with the kine- 
matic constraints was presented. 

More recent studies in tracking for mobile robots 
have focused on more advanced and complex sys- 
tems. The work in [20.40] focused on the design of 
a dynamic Petri recurrent fuzzy neural network. This 
network structure was applied to the path-tracking con- 
trol of a non-holonomic mobile robot for verifying 
its validity. Also, in [20.41] the tracking control of 
a mobile robot with uncertainties in the robot kine- 
matics, the robot dynamics, and the wheel actuator 
dynamics was investigated. A robust adaptive con- 
troller was proposed for back-stepping a FLC. Fi- 
nally, [20.42] proposed a complete control law com- 
prising an evolutionary programming-based kinematic 
control and an adaptive fuzzy sliding-mode dynamic 
control. 

Although mobile robots have a great utility in indus- 
try, the robotic manipulators are the most used type of 
robot in this sector. One main issue is to develop con- 
trollers that deal with uncertainties. [20.43] addressed 
trajectory tracking problems of robotic manipulators 
using a fuzzy rule controller designed to deal with un- 
certainty. In [20.44] two adaptive fuzzy systems are 
employed to approximate the nonlinear and uncertain 
terms occurring in the robot arm and the joint motor dy- 


namics. Other adaptive controllers for motion control of 
multi-link robot manipulators can be found in [20.45— 
47]. 

Other works focused on the ability to adapt the 
fuzzy control to different demands or requisites over 
time with different approaches. In [20.48] a fuzzy 
sliding mode controller was proposed for robotic ma- 
nipulators. The membership functions of the control 
gain are updated on-line and, therefore, the controller 
is not a conventional fuzzy controller but an adap- 
tive one. In [20.49] a design method that constructs 
the fuzzy rule base from a conventional proportional- 
integral-derivative (PID) controller in an incremental 
way using recursive feedback was proposed. A direct 
fuzzy control system for the regulation of robot ma- 
nipulators was presented in [20.50]. The bounds of the 
applied torques in this case are adjusted by means of the 
output membership functions parameters in such a way 
that the maximum torque demanded by the controller 
always ranges between the limits given by the manu- 
facturer. 

A different perspective of robotic manipulators, 
but also a very common one, is the scheme of dif- 
ferent systems that need to communicate. In [20.51] 
a new observer-controller structure for robot manipu- 
lators with model uncertainty using only the position 
measurements is proposed. In this method, adaptive 
fuzzy logic is used to approximate the nonlinear and 
uncertain robot dynamics in both the observer and the 
controller. 

The effects of network-induced delay and data 
packet dropout for a class of nonlinear networked 
control systems for a flexible arm was investigated 
in [20.52]. The non-linear networked control systems 
were approximated by linear networked Takagi-Sugeno 
fuzzy models. An iterative algorithm for constructing 
the fuzzy model was proposed. Also, in [20.53], the 
delay transmission of a signal through an internet and 
wireless module was studied. 

As well as for the case of mobile robots, in the 
recent years of research in robotic manipulators both 
the automatic design and the learning of different 
parts of the controllers have been addressed. For ex- 
ample, in [20.54] a novel tracking control design for 
robotic systems using fuzzy wavelet networks was pre- 
sented. Fuzzy wavelet networks were used to estimate 
unknown functions and, therefore, to solve the prob- 
lem of demanding prior knowledge of the controlled 
plant. 
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Intelligent control approaches such as neural net- 
works for the approximation of nonlinear systems have 
also received considerable attention. They are very ef- 
fective in coping with structured parametric uncertainty 
and unstructured disturbance by using their powerful 
learning capability. Thus, neuro-fuzzy network con- 
trollers are the usual choice for robot manipulators. 

In [20.55] the position control of modular and re- 
configurable robots was addressed. A neuro-fuzzy con- 
trol architecture was used for tuning the gains inside 
the PID controller. An improvement was achieved with 
respect to classic controllers in terms of error of the 
trajectory that is tracked. Another neuro-fuzzy robust 
tracking control law was implemented in [20.56]. In this 
work, the controller guaranteed transient and asymp- 
totic performance. 

Other different approaches of neuro-fuzzy con- 
trollers have also been developed. In [20.57] a sta- 
ble discrete-time adaptive tracking controller using 
a neuro-fuzzy dynamic-inversion for a robotic manip- 
ulator was presented. The dynamics of the manipulator 
were approximated by a dynamic Takagi—Sugeno fuzzy 
model. With the aim of improving the robustness of 
the controller, in [20.58] a novel parameter adjustment 
scheme using a neuro-fuzzy inference system architec- 
ture was presented. 

More real environment applications of neuro-fuzzy 
networks approaches have also been discussed in the 
literature [20.59]. In [20.60, 61] a robust neural-fuzzy- 
network control was investigated for the joint posi- 
tion control of an n-link robot manipulator for peri- 
odic motion, with the aim of achieving high-precision 
position tracking. In [20.62] an approximate Takagi- 


20.5 Moving Target Tracking 


Service robots have to be endowed with the capacity 
of working in dynamic environments with high uncer- 
tainty. Typical environments with these characteristics 
are airports, hallways of buildings, corridors of hos- 
pitals, domestic environments, etc. One of the most 
important factors when working in real environments 
are the moving objects in the surrounding of the robot. 
The knowledge of the position, speed, and heading of 
the moving objects is fundamental for the execution of 
tasks like localization, route planning, interaction with 
humans, or obstacle avoidance. 

In [20.68] a module that allows the mobile robot to 
localize the target precisely in the environment was pre- 


Sugeno type neuro-fuzzy state-space model for a flex- 
ible robotic arm was presented. The model was trained 
using a particle swarm optimization technique. 

In the last years, more advanced learning algorithms 
have been used. In [20.63] an algorithm to learn the 
path-following behavior for multi-link mobile robots 
was presented. In this approach the learning complex- 
ity of the path-following behavior is reduced, as long 
paths are divided into a set of small motion primitives 
that can reach almost every point in the neighbor- 
hood. In [20.64], a robot manipulator was controlled 
by a FLC, where the parameters of the Gaussian mem- 
bership functions were optimized with particle swarm 
optimization. Also, in [20.65] the authors described 
the application of ant colony optimization and particle 
swarm optimization to the optimization of the member- 
ship function parameters of a FLC. The aim in this case 
was to find the optimal trajectory tracking controller for 
an autonomous wheeled mobile robot. 

Interval type-2 FLCs have been described in several 
case studies to handle uncertainties. However, one of 
the main issues in adopting such systems on a larger 
scale is the lack of a systematic design methodol- 
ogy. [20.66] presented a novel design methodology 
of interval type-2 Takagi-Sugeno—Kang (TSK) FLCs 
for modular and reconfigurable robot manipulators for 
tracking purposes with uncertain dynamic parameters. 
Moreover, [20.67] provided a problem-driven design 
methodology together with a systematic assessment 
of the performance quality and uncertainty robustness 
of interval type-2 FLCs. The method was evaluated 
on the problem of position control of a delta parallel 
robot. 


sented. Both direction and distance to the target were 
measured using infrared sensors. Also, a fuzzy target 
tracking control unit was proposed. This control unit 
comprises a behavior network for each action of the 
tracking control and a gate network for combining the 
information of the infrared sensors. It was shown that 
the proposed control scheme is, indeed, effective and 
feasible through some simulated and real examples of 
the behavior. 

In [20.69, 70] a pattern classifier system for the de- 
tection of moving objects using laser range finders data 
was presented. An evolutionary algorithm was used to 
learn the classifier system based on the quantified fuzzy 
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temporal rules model. These quantified fuzzy temporal 
rules are able to analyze the persistence of the fulfill- 
ment of a condition in a temporal reference by using 
fuzzy quantifiers. Moreover, in [20.71] the authors pre- 
sented a deep experimental study on the performance of 


20.6 Perception 


Perception is an essential part of any robotic system, 
since it is the functionality through which the robot 
incorporates information from the environment. Sev- 
eral sensors have been typically used in mobile robots 
over the years. Ultrasound, laser range finders, acous- 
tic signals, or cameras are some examples of the most 
used sensors. The information obtained by these sensors 
can be used in order to solve various problems such as 
object recognition, collision avoidance, navigation, or 
some particular objects. 

The perception of landmarks is a very useful and 
meaningful strategy for helping in localization or nav- 
igation. Furthermore, it is a quite common strategy to 
detect known landmarks that are present in the envi- 
ronments instead of adding artificial landmarks (e.g., 
visual beacons, radio frequency identification (RFID) 
labels,...) in order to preserve environments with 
the less possible external intervention or manipula- 
tion. Within this context, one of the most commonly 
used landmarks in indoor environments are doors, 
since they indicate relevant points of interaction and 
also for their static nature. In [20.72] fuzzy temporal 


20.7 Planning 


For more complex behaviors in robotics, systems must 
be able to achieve certain higher-level goals. In order to 
do that, the robot needs to make choices that maximize 
the utility or value of the available alternatives. The pro- 
cess by which this objective is solved is called planning. 
When the robot is not the only actor (as is usual), it must 
check periodically if the environment matches with the 
predictions made and change its plan accordingly. 

Path planning is a typical task that is needed in most 
mobile robots that work in unknown environments. The 
problem consists in determining the path to be followed 
by the robot in order to reach a goal. In the case when 
not only the path, but also the movements of the robot 
at each instant are determined, planning is referred 


different evolutionary fuzzy systems for moving object 
following. Several environments with different degrees 
of complexity and a real environment were used in 
order to show the applicability of the methodologies 
presented. 


rules were used for detecting doors using the informa- 
tion obtained from ultrasound sensors. This paradigm 
was used to model the temporal variations of the 
sensor signals together with the model of the nec- 
essary knowledge for detection. A different approach 
for door detection using computer vision was pre- 
sented in [20.73]. Doors are found in gray-level images 
by detecting the borders and are distinguished from 
other similar shapes using a fuzzy system designed 
using expert knowledge. Also, a tuning mechanism 
based on a genetic algorithm was used to improve the 
performance of the system according to the particu- 
larities of the environment in which it is going to be 
employed. 

In robotic soccer games perception also plays 
a principal role. The work presented in [20.74] de- 
scribes a type-2 FLC to accurately track a mobile 
object, in this case a ball, from a robot agent. Both 
players and ball positions must be tracked using a low- 
computational cost image processing algorithm. The 
fuzzy controller aims to overcome the uncertainty 
added by this image processing. 


to as motion planning. In [20.75] a two-layered goal- 
oriented motion planning strategy using fuzzy logic was 
developed for a Koala mobile robot navigating in an un- 
known environment. The information about the global 
goal and the long-range sensorial data are used by the 
first layer of the planner to produce an intermediate goal 
in a favorable direction. The second layer of the planner 
takes this sub-goal and guides the robot to reach it while 
avoiding collisions using short-range sensorial data. 
Other path planning approaches that use fuzzy logic 
can be found in more recent years. A cooperative con- 
trol in a multi-agent architecture was applied in [20.76] 
in order to implement high cognitive capabilities like 
planning. The agents provided basic behaviors (such 
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as moving to a point) sharing the robot resources and 
negotiating when conflicts arose. A new proposal to 
solve path planning was also proposed in [20.77]. It 
was based in an ant colony optimization to find the 
best route with a fuzzy cost function. In [20.78] fuzzy 
logic was used in order to discretize the environment in 
relation to a soccer robot. Then, a multi-objective evo- 
lutionary algorithm was designed in order to optimize 
the actions needed in order to reach the ball. 

Some research has also been reported in the field 
of robot soccer, for high-level planning of team be- 


20.8 SLAM 


Simultaneous localization and mapping (SLAM) is 
a field of robotics that has as its main objective the 
construction of a map of the environment while at the 
same time keeping track of the current location of the 
robot inside the map that is being built. Mapping con- 
sists of integrating the information gathered with the 
robot’s sensors into a given representation. In contrast 
to this, localization is the problem of estimating where 
the robot is placed on a map. In practice, these two 
problems cannot be solved independently of each other. 
Before a robot can answer the question of how the envi- 
ronment looks like given a set of observations, it needs 
to know from which locations these observations have 
been made. At the same time, it is hard to estimate 
the current position of a robot (or any vehicle) without 
a map. 

Different fuzzy approximations have been used in 
order to help to solve the SLAM problem. In [20.81] 
the development of a new neuro-fuzzy-based adaptive 
Kalman filtering algorithm was proposed. The neuro- 
fuzzy-based supervision for the Kalman filtering al- 
gorithm is carried out with the aim of reducing the 
mismatch between the theoretical and the actual covari- 
ance of the innovation sequences. To do that, it attempts 
to estimate the elements of the covariance matrix at 


20.9 Cooperation 


A multi-agent system is a system composed of multi- 
ple interacting intelligent agents within an environment. 
When the agents are robots, these systems lead to 
a more challenging task because of their implicit real- 
world environment, which is presumably difficult to 


havior. In [20.79] an extensive fuzzy behavior-based 
architecture was proposed. The behavior-based archi- 
tecture decomposes the complex multi-robotic sys- 
tem into smaller modules of roles, behaviors, and 
actions. Each individual behavior was implemented 
using a FLC. The same approach was used for co- 
ordinating the various behaviors and select the most 
appropriate role for each robot. Continuing this work, 
in [20.80] an evolutionary algorithm approach was 
used to optimize each FLC for each layer of the 
architecture. 


each sampling instant when a measurement update step 
is carried out. Also, a fuzzy adaptive extended informa- 
tion filtering scheme was used in [20.82] for ultrasonic 
localization and pose tracking of an autonomous mobile 
robot. The scheme was presented in order to improve 
the estimation accuracy and robustness for the proposed 
localization system with a system having a lack of in- 
formation and noise. 

A novel hybrid method for integrating fuzzy logic 
and genetic algorithms (genetic fuzzy systems, GFSs) 
to solve the SLAM problem was presented in [20.83]. 
The core of the proposed SLAM algorithm searches for 
the most probable map such that the associated poses 
provide the robot with the best localization informa- 
tion. Prior knowledge about the problem domain was 
transferred to the genetic algorithm in order to speed 
up convergence. Fuzzy logic is employed to serve this 
purpose and allows the algorithm to conduct the search 
starting from a potential region of the pose space. The 
underlying fuzzy mapping rules infer the uncertainty 
in the location of the robot after executing a motion 
command and generate a sample-based prediction of its 
current position. The robustness of the proposed algo- 
rithm has been shown in different indoor experiments 
using a Pioneer 3AT mobile robot. 


model. There are two fundamental needs for multi- 
agent approaches. On one hand, some problems can 
be naturally too complex or impossible to be accom- 
plished by a single robot. On the other hand, there 
can be benefits for using several simple robots be- 
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cause they are cheaper and more fault tolerant than 
having a single complex robot. Also, multi-agent 
systems can be helpful for social and life science 
problems. 

An illustrative example of this type of system is 
shown in [20.84], where a mobile sensor network ap- 
proach composed by robots that cooperate was pre- 
sented. The objective of the sensor network was the 
localization of hazardous contaminants in an unknown 
large-scale area. The robots have a swarm controller 
that controls the behavior for the localization for each 
robot, whose actions are based on a fuzzy logic control 
system that is identical for all robots. 

Control cooperation of robots has been of great in- 
terest in the last years. Fuzzy logic controllers have 
obtained good performance in some simulation exper- 
iments. The idea of applying fuzzy controllers comes 
from the fact that soft computing techniques have 


20.10 Legged Robots 


Wheeled robots have dominated the state of the art 
of mobile robots. However, in the last decade, there 
has been an interest to find alternatives for those en- 
vironments in which wheeled robots are not able to 
operate. When the terrain is variable and unprepared, 
adding legs to robots might be a solution. Legged 
robots can navigate on and adapt to any kind of sur- 
faces (such as rough, rocky, sandy, and steep terrains) 
and step over obstacles; they can adapt. Moreover, 
legged robots help the exploration of human and ani- 
mal locomotion. 

One of the main differences between legged and 
wheeled robots is that legged robots require the system 
to generate an appropriate gait to move, whereas wheels 
just need to roll. To clarify this, gait is the movement 
pattern of limbs in animals and humans used for loco- 
motion over a variety of surfaces. The same concept is 
used to design the pattern of movement of robots on dif- 
ferent surfaces. In [20.87], the learning of a biped gait 
was solved using reinforcement learning. The aim of 
this work was to improve the learning rate through the 
incorporation of expert knowledge using fuzzy logic. 
This fuzzy logic was incorporated in the reinforcement 
system through neuro-fuzzy architectures. Moreover, 
fuzzy rule-based feedback is incorporated instead of 
numerical reinforcement signals. A different approxi- 
mation was carried out in [20.88], where two different 
genetic algorithm approaches were used in order to im- 


proved to be efficient for poorly defined system op- 
timization and multi-agent coordination. In [20.85] 
a multi-agent control system was proposed, based on 
a fuzzy inference system for a group of two wheeled 
mobile robots executing a common task. An application 
of this control system is the control of robotic forma- 
tions moving on the plane such as a group of guard 
robots taking care of an area and dealing with poten- 
tial intruders. The use of fuzzy logic in this work allows 
easy expression of rules, and the multi-agent structure 
supports separation of team and individual knowledge. 
Another example can be found in [20.86]. In this work, 
a collision free target tracking problem of a multi-agent 
robot system was presented. Game theory provides an 
effective tool to solve this problem. To enhance robust- 
ness, a fuzzy controller tunes the cost function weights 
directly for the game theoretic solution and helps to 
achieve a prescribed value of cost function components. 


prove the performance of a FLC designed to model the 
gait generation problem of a biped robot. In both works, 
computer simulations were done in order to compare 
the different approaches of the control systems in terms 
of stability. 

The work in [20.89] focused on the design of a leg 
for a quadrupedal galloping machine. For that, two in- 
telligent strategies, a fuzzy and an heuristic controller, 
were developed for verification on a one-legged system. 
The fuzzy controller consists of a fuzzy rule base with 
an adaptation mechanism that modifies the rule output 
centers to correct velocity. These techniques were suc- 
cessfully implemented for operating one leg at speeds 
necessary for a dynamic gallop. It was shown that the 
fuzzy controller outperformed the heuristic controller 
without relying on a model of the system. 

Finally for the gait problems, in [20.90] a fuzzy 
logic vertical ground reaction force controller was de- 
veloped for a robotic cadaveric gait, which altered 
tendon forces in real time and iteratively adjusted the 
robotic trajectory in order to track a target reaction. 
This controller was validated using a novel dynamic 
cadaveric gait simulator. The fuzzy logic rule-based 
controller was able to track the target with a very low 
tracking error, demonstrating its ability to accurately 
control this type of robot. 

Besides the gait problem, the biped robotic sys- 
tem contains a great deal of uncertainties associ- 
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ated with the mechanism dynamics and environment 
parameters. In [20.91] it was suggested that type- 
2 fuzzy logic control systems could be a better 
way to deal with the uncertainty in a robotic sys- 
tem. A novel type-2 fuzzy switching control sys- 
tem was proposed for biped robots, which includes 
a type-2 fuzzy modeling algorithm. As in the pre- 
vious work, simulated experiments were used in or- 
der to compare the performance of the controller 
proposed with other dynamical intelligent control 
methods. 


In [20.92] a fuzzy controller, consisting of a fuzzy 
prefilter (designed by a genetic algorithm) in the feed- 
forward loop and a PID-like fuzzy controller in the 
feed-back loop, was proposed for foot trajectory track- 
ing control of a hydraulically actuated hexapod robot. 
A COMET-III real robot was used in this work and 
the experimental results exhibit that the proposed con- 
troller manifests better foot trajectory tracking per- 
formance compared to an optimal classical controller 
like the state feedback linear-quadratic regulator (LQR) 
controller. 


20.11 Exoskeletons and Rehabilitation Robots 


The latest advances in assistive robotics has had a great 
impact in different fields. For instance, in military appli- 
cations, these technologies can allow soldiers to carry 
a higher payload and walk further without requiring 
more effort or producing fatigue. However, the field 
where the impact of this type of robotics is the greatest 
is healthcare. The aging of the population will be one of 
the main problems in the near future, since more peo- 
ple are going to need some type of assistance on a daily 
basis. This dependency suffered by the elderly can be 
partially resolved with the use of robotic systems that 
help people with a lack of mobility or strength. Robotic 
systems for assistance can provide total or partial move- 
ment to these people. Moreover, rehabilitation using 
these systems can make regaining movement-related 
functions easier and faster. 

Some studies were developed in recent years that 
use fuzzy techniques on the rehabilitation of upper- 
limb motion (shoulder joint motion and elbow joint 
motion). The principal reason to use fuzzy logic is to 
deal with complicated, ill-defined, and dynamic pro- 
cesses, which are intrinsically difficult to being mod- 
elled mathematically. Moreover, fuzzy logic control 
incorporates human knowledge and experience directly 
without relying on a detailed model of the control 
system. 

In [20.93] an exoskeleton and its fuzzy control 
system to assist the human upper-limb motion of phys- 
ically weak persons was presented. The proposed robot 
automatically assists human motion mainly based on 
electromyogram signals on the skin surface. In a later 
work [20.94], the authors introduced a hierarchical 
neuro-fuzzy controller for a robotic exoskeleton where 


the angles of the elbow and shoulder are modeled us- 
ing fuzzy sets. Additionally, fuzzy sets are used in 
order to set a trigger in the activity of the muscles. 
In order to solve the same problem, a hybrid posi- 
tion/force fuzzy logic control system was presented 
in [20.95]. The objective of this work was to assist the 
subject in performing both passive and active move- 
ments along the designed trajectories with specified 
loads. 

More recent studies have been published under the 
paradigm of fuzzy sliding mode control. In [20.96] 
a novel adaptive self-organizing fuzzy sliding mode 
control for the control of a 3-degree-of-freedom (DOF) 
rehabilitation robot was presented. An interesting char- 
acteristic of this approach is the ability to establish 
and regulate the fuzzy rule base dynamically. Go- 
ing a step further along that same line, in [20.97] 
an adaptive self-organizing fuzzy sliding mode con- 
trol robot was proposed for a 2-DOF rehabilitation 
robot. 

For comparison and performance measurement pur- 
poses, one common practice in order to examine the 
effectiveness of the proposed exoskeleton in motion as- 
sistance is to use human subjects who perform different 
cooperative motions of the elbow and shoulder [20.93]. 
Different performance measures can be used to show 
the correctness of the proposed systems. In [20.94] 
the angles obtained by the exoskeleton were compared, 
while in [20.95] the results were shown in terms of force 
and stability. Finally, in [20.96, 97], the performance of 
the robotic rehabilitation system was measured in terms 
of response of the system to movements and tracking 
errors. 
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20.12 Emotional Robots 


Future robots need a transparent interface that regular 
people can interpret, such as an emotional human-like 
face. Moreover, such robots must exhibit behaviors that 
are perceived as believable and life like. In general 
terms, the use of fuzzy techniques in this field did not 
have a great impact. However, the fuzzy approach can 
not only simplify the design task, but also enrich the 
interaction between humans and robots. 

An application of fuzzy logic for this type of robot 
can be found in [20.98]. In this research it was proposed 
to use fuzzy logic for effectively building the whole be- 
havior system of face emotion expression robots. It was 
shown how these behaviors could be constructed by 
a fuzzy architecture that not only seems more realistic 
but can also be easily implemented. 

In [20.73] a fuzzy system that establishes a level of 
possibility about the degree of interest that the people 
around the robot may have in interacting with it was 
presented. Firstly, a method to detect and track persons 
using stereo vision was proposed. Then, the interest of 


20.13 Fuzzy Modeling 


Some extensions to fuzzy logic have been developed 
in the last decade in the field of mobile robotics. 
In [20.102] a probabilistic type-2 FLS was proposed for 
modeling and control. [20.89] focused on the design of 
a leg for a quadrupedal galloping machine. For that, two 
intelligent strategies (fuzzy and heuristic controllers) 
were developed for verification on a one-legged sys- 
tem. The fuzzy controller consists of a fuzzy rule base 
with an adaption mechanism that modifies the rule 
output centers to correct velocity errors. These tech- 
niques were successfully implemented for operating 
one leg at the speeds demanded for a dynamic gallop. 
It was shown that the fuzzy controller outperformed the 
heuristic controller without relying on a model of the 
systems. This proposal aims to solve the lack of capa- 
bility of FLSs to handle various uncertainties identified 
by this work in practical applications. Two examples 
were used to validate the probabilistic fuzzy model: 
a function approximation and a robotic application. The 
robotic application was successfully implemented for 
the control of a simulated biped robot. 

A novel representation of robot kinematics was pro- 
posed in [20.103, 104] in order to merge qualitative 
and quantitative reasoning. Fuzzy reasoning is good 
at communicating with sensing and control level sub- 


each person was computed using fuzzy logic by ana- 
lyzing its position and its level of attention to the robot. 
The level of attention is estimated by analyzing whether 
or not the person is looking at the robot. 

A more recent work of video-based emotion recog- 
nition was presented in [20.99]. In this work, a fuzzy 
rule-based approach was used for emotion recogni- 
tion from facial expressions. The fuzzy classifica- 
tion itself analyzes the deformation of a face sep- 
arately in each image. In contrast to most exist- 
ing approaches, also blended emotions with vary- 
ing intensities as proposed by psychologists can be 
handled. Other work that was based on physiolog- 
ical measures was presented in [20.100]. In this 
work, a fuzzy inference engine was developed to es- 
timate human responses. The authors demonstrated 
in a later work [20.101] that a hidden Markov 
model is able to achieve better classification re- 
sults than the previously reported fuzzy inference 
engine. 


systems by means of fuzzification and defuzzification 
methods. It has powerful reasoning strategies utilizing 
compiled knowledge through conditional statements 
so as to easily handle mathematical and engineering 
systems. Fuzzy reasoning also provides a means for 
handling uncertainty in a natural way, making it robust 
in significantly noisy environments. However, in this 
work a lack of ability in fuzzy reasoning alone to deal 
with qualitative inference about complex systems was 
pointed out. 

It is argued that qualitative reasoning can com- 
pensate this drawback. Qualitative reasoning has the 
advantage of operating at the conceptual modeling 
level, reasoning symbolically with models that retain 
the mathematical structure of the problem rather than 
the input/output representation of rule bases. Moreover, 
the computational cause-effect relations contained in 
qualitative models facilitate analyzing and explaining 
the behavior of a structural model. The kinematics 
of a PUMA 560 robot is modelled for the trajec- 
tory tracking task. Thus demonstrating the ability of 
fuzzy reasoning. Simulation results demonstrated that 
the proposed method effectively provides a two-way 
connection for robot representations used for both nu- 
merical and symbolic robotic tasks. 
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20.14 Comments and Conclusions 


The significance of the contributions of fuzzy logic to 
robotics is quite different in the three stages that we 
have identified. In the first stage, the objective was 
to endow the robot with a set of simple behaviors to 
solve basic tasks like wall-following, obstacle avoid- 
ance, moving object tracking, trajectory tracking, etc. 
Fuzzy logic significantly contributed in this stage, not 
only with the design of behaviors, but also with the co- 
ordination or fusion among them (more than 75% of 
the papers considered in this chapter deal with these 
topics). In the second stage, the focus moved to im- 
plementing autonomous robots that are able to operate 
in real environments (museums, hospitals, homes, . . .), 
which should, therefore, be able to generate a map of 
the environment, localize in the map, and also navigate 
between different positions (motion planning). The first 
two tasks were joined under the SLAM field, which has 
been one of the most important topics in robotics in the 
recent years. Motion planning has been another relevant 
field, that experimented a great improvement with the 
use of heuristic search algorithms and the inclusion of 
kinematic constraints in the planning. The contributions 
of fuzzy logic to this second wave have been marginal; 
SLAM techniques are dominated by probabilistic and 
optimization approaches, while the best motion plan- 
ning proposals rely on heuristic search processes and 
probabilistic approaches to manage the uncertainty. 

We have assessed which is the actual impact of the 
recently reported research on fuzzy-based approaches 
to robotics research and applications from a quan- 
titative/qualitative point of view. In order to have 
an estimation we have considered the two journals 
with the highest impact factor in the 2012 Thomson- 
Reuters Web of Knowledge (the International Journal 
of Robotics Research IJRR and IEEE Transactions on 
Robotics, TEEE-TR) and looked for papers that in- 
cluded the term fuzzy in the Abstract in the period 
considered (2003-2013). We found that only one paper 
fulfilled such conditions in RR and only seven papers 
in IEEE-TR. These results indicate that the actual im- 
pact and diffusion of fuzzy approaches in the relevant 
robotics arena is very limited. A vast majority of re- 
search and application results of fuzzy approaches in 
robotics are, therefore, published and presented in pa- 
pers and conferences related to soft computing. In fact, 


only 22 out of the 98 papers considered in this chapter 
(i. e., 22%) were published in robotics-related forums. 
Among these almost all papers (20 out of 22, 91%) de- 
scribed FLC of the Mamdani type for different tasks, 
which suggests that FLC is without a doubt the area 
with the highest impact in papers and conferences of the 
most genuine robotics area (i. e., out of the soft comput- 
ing related publications). 

Furthermore, FLC is the most active area of re- 
search and applications, since 66% of the papers con- 
sidered in this chapter describe Mamdani-based fuzzy 
controllers for all the behaviors and high-level tasks 
considered. Other hybrid methodologies such as neuro- 
fuzzy networks or fuzzy-based ones such as type-2 
fuzzy sets and Takagi-Sugeno rules follow at a large 
distance (12%, 9%, and 7%, respectively), but with al- 
most no impact in robotic-centered publications. 

Although these topics are still open, nowadays 
the focus in robotics is moving to other higher-level 
fields (third stage), like perception, learning of complex 
behaviors, or human-robot interaction. Perception re- 
quirements are not just the construction of occupancy 
or feature-based maps, but the recognition of objects, 
the classification of objects, the identification of ac- 
tions, etc.; in summary, scene understanding. Moreover, 
perception has to combine different sources of infor- 
mation, with visual and volumetric data being the two 
main sources. Contributions from fuzzy techniques are 
still few in number, but from our point of view, they can 
contribute to this topic and will surely do so in those 
cases where high-level reasoning is required, and also 
in the description of scenes. 

From the point of view of human-robot interaction, 
there are two directions in which fuzzy logic may also 
contribute significantly. The first one is the interpre- 
tation of the emotional state of the people interacting 
with the robot. This interpretation uses several infor- 
mation sources (visual, acoustic, etc.), requires expert 
knowledge to build the classification rules, and the kind 
of uncertainty of the data could be adequately modeled 
with fuzzy sets. The second direction is the expressive- 
ness of the robot, which will be fundamental for social 
robotics. Again, and for the same reasons as for the 
previous topic, fuzzy logic approaches may generate 
significant contributions in the field. 
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The rough set (RS) approach was proposed by 
Pawlak as a tool to deal with imperfect knowl- 
edge. Over the years the approach has attracted 
attention of many researchers and practitioners all 
over the world, who have contributed essentially 
to its development and applications. This chapter 
discusses the RS foundations from rudiments to 
challenges. 
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21.1 Rough Sets: Comments on Development 


The rough set (RS) approach was proposed by Zdzistaw 
Pawlak in 1982 [21.1,2] as a tool for dealing with im- 
perfect knowledge, in particular with vague concepts. 
Many applications of methods based on rough set the- 
ory alone or in combination with other approaches have 
been developed. This chapter discusses the RS founda- 
tions from rudiments to challenges. 

In the development of rough set theory and appli- 
cations, one can distinguish three main stages. While 
the first period was based on the assumption that ob- 
jects are perceived by means of partial information 
represented by attributes, in the second period it was 
assumed that information about the approximated con- 
cepts is also partial. Approximation spaces and search- 
ing strategies for relevant approximation spaces were 
recognized as the basic tools for rough sets. Impor- 
tant achievements both in theory and applications were 
obtained. Nowadays, a new period for rough sets is 
emerging, which is also briefly characterized in this 
chapter. 

The rough set approach seems to be of fundamen- 
tal importance in artificial intelligence AI and cognitive 
sciences, especially in machine learning, data mining, 


knowledge discovery from databases, pattern recogni- 
tion, decision support systems, expert systems, intel- 
ligent systems, multiagent systems, adaptive systems, 
autonomous systems, inductive reasoning, common- 
sense reasoning, adaptive judgment, conflict analysis. 

Rough sets have established relationships with 
many other approaches such as fuzzy set theory, gran- 
ular computing (GC), evidence theory, formal concept 
analysis, (approximate) Boolean reasoning, multicri- 
teria decision analysis, statistical methods, decision 
theory, and matroids. Despite the overlap with many 
other theories rough set theory may be considered as 
an independent discipline in its own right. There are re- 
ports on many hybrid methods obtained by combining 
rough sets with other approaches such as soft comput- 
ing (fuzzy sets, neural networks, genetic algorithms), 
statistics, natural computing, mereology, principal com- 
ponent analysis, singular value decomposition, or sup- 
port vector machines. 

The main advantage of rough set theory in data anal- 
ysis is that it does not necessarily need any preliminary 
or additional information about data like probability 
distributions in statistics, basic probability assignments 
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in evidence theory, a grade of membership, or the value 
of possibility in fuzzy set theory. 

One can observe the following advantages about the 
rough set approach: 


i) Introduction of efficient algorithms for finding hid- 
den patterns in data. 

ii) Determination of optimal sets of data (data reduc- 
tion); evaluation of the significance of data. 

iii) Generation of sets of decision rules from data. 

iv) Easy-to-understand formulation. 

v) Straightforward interpretation of results obtained. 

vi) Suitability of many of its algorithms for parallel 
processing. 


Due to space limitations, many important research 
topics in rough set theory such as various logics related 
to rough sets and many advanced algebraic properties 
of rough sets are only mentioned briefly in this chapter. 

From the same reason, we herein restrict the ref- 
erences on rough sets to the basic papers by Zdzistaw 
Pawlak (such as [21.1,2]), some survey papers [21.3— 
5], and some books including long lists of references 
to papers on rough sets. The basic ideas of rough set 
theory and its extensions as well as many interesting 
applications can be found in a number of books, issues 


21.2 Vague Concepts 


Mathematics requires that all mathematical notions 
(including sets) must be exact, otherwise precise rea- 
soning would be impossible. However, philosophers 
[21.10], and recently computer scientists as well as 
other researchers, have become interested in vague (im- 
precise) concepts. Moreover, in the twentieth century 
one can observe the drift paradigms in modern science 
from dealing with precise concepts to vague concepts, 
especially in the case of complex systems (e.g., in 
economy, biology, psychology, sociology, and quantum 
mechanics). 

In classical set theory, a set is uniquely determined 
by its elements. In other words, this means that ev- 
ery element must be uniquely classified as belonging 
to the set or not. That is to say the notion of a set 
is a crisp (precise) one. For example, the set of odd 
numbers is crisp because every integer is either odd or 
even. 

In contrast to odd numbers, the notion of a beauti- 
ful painting is vague, because we are unable to classify 
uniquely all paintings into two classes: beautiful and 


of Transactions on Rough Sets, special issues of other 
journals, numerous proceedings of international confer- 
ences, and tutorials [21.3, 6,7]. The reader is referred to 
the cited books and papers, references therein, as well 
as to web pages [21.8, 9]. 

The chapter is structured as follows. In Sect. 21.2 
we discuss some basic issues related to vagueness and 
vague concepts. The rough set philosophy is outlined 
in Sect. 21.3. The basic concepts for rough sets such 
as indiscernibility and approximation are presented in 
Sect. 21.4. Decision systems and rules are covered in 
Sect. 21.5. The basic information about dependencies 
is included in Sect. 21.6. Attribute reduction belonging 
to one of the basic problems of rough sets is discussed 
in Sect. 21.7. Rough membership function as a tool for 
measuring degrees of inclusion of sets is presented in 
Sect. 21.8. The role of discernibility and Boolean rea- 
soning for solving problems related to rough sets is 
briefly explained in Sect. 21.9. In Sect. 21.10 a short 
discussion on rough sets and induction is included. 
Several generalizations of the approach proposed by 
Pawlak are discussed in Sect. 21.11. In this section 
some emerging research directions related to rough sets 
are also outlined. In Sect. 21.12 some comments about 
logics based on rough sets are included. The role of 
adaptive judgment is emphasized. 


not beautiful. With some paintings it cannot be decided 
whether they are beautiful or not and thus they remain 
in the doubtful area. Thus, beauty is not a precise but 
a vague concept. 

Almost all concepts that we use in natural lan- 
guage are vague. Therefore, common sense reasoning 
based on natural language must be based on vague 
concepts and not on classical logic. An interesting dis- 
cussion of this issue can be found in [21.11]. The idea 
of vagueness can be traced back to the ancient Greek 
philosopher Eubulides of Megara (ca. 400 BC) who 
first formulated the so-called sorites (heap) and falakros 
(bald man) paradoxes [21.10]. There is a huge literature 
on issues related to vagueness and vague concepts in 
philosophy [21.10]. 

Vagueness is often associated with the boundary re- 
gion approach (i.e., existence of objects which cannot 
be uniquely classified relative to a set or its comple- 
ment), which was first formulated in 1893 by the father 
of modern logic, the German logician, Gottlob Frege 
(1848-1925) ([21.12]). According to Frege the concept 
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must have a sharp boundary. To the concept without 
a sharp boundary there would correspond an area that 
would not have any sharp boundary line all around. 
This means that mathematics must use crisp, not vague 
concepts, otherwise it would be impossible to reason 
precisely. 

One should also note that vagueness also relates to 
insufficient specificity, as the result of a lack of feasi- 
ble searching methods for sets of features adequately 


21.3 Rough Set Philosophy 


Rough set philosophy is founded on the assumption that 
with every object of the universe of discourse we asso- 
ciate some information (data, knowledge). For example, 
if objects are patients suffering from a certain disease, 
symptoms of the disease form information about the pa- 
tients. Objects characterized by the same information 
are indiscernible (similar) in view of the available in- 
formation about them. 

The indiscernibility relation generated in this way 
is the mathematical basis of rough set theory. This un- 
derstanding of indiscernibility is related to the idea 
of Gottfried Wilhelm Leibniz that objects are indis- 
cernible if and only if all available functionals take 
identical values on them (Leibniz’s law of indiscerni- 
bility: the identity of indiscernibles) [21.13]. However, 
in the rough set approach indiscernibility is defined rel- 
ative to a given set of functionals (attributes). 

Any set of all indiscernible (similar) objects is 
called an elementary set and forms a basic granule 
(atom) of knowledge about the universe. Any union 
of some elementary sets is referred to as a crisp (pre- 
cise) set. If a set is not crisp, then it is called rough 
(imprecise, vague). Consequently, each rough set has 
borderline cases (boundary-line), i.e., objects which 
cannot be classified with certainty as members of either 
the set or its complement. Obviously, crisp sets have no 
borderline elements at all. This means that borderline 
cases cannot be properly classified by employing avail- 
able knowledge. 

Thus, the assumption that objects can be seen only 
through the information available about them leads to 


describing concepts. A discussion on vague (imprecise) 
concepts in philosophy includes their following charac- 
teristic features [21.10]: (i) the presence of borderline 
cases, (ii) boundary regions of vague concepts are not 
crisp, (ili) vague concepts are susceptible to sorites 
paradoxes. In the sequel we discuss the first two issues 
in the RS framework. The reader can find a discussion 
on the application of the RS approach to the third item 
in [21.11]. 


the view that knowledge has granular structure. Due 
to the granularity of knowledge, some objects of in- 
terest cannot be discerned and appear as the same (or 
similar). As a consequence, vague concepts in contrast 
to precise concepts, cannot be characterized in terms 
of information about their elements. Therefore, in the 
proposed approach, we assume that any vague concept 
is replaced by a pair of precise concepts — called the 
lower and the upper approximation of the vague con- 
cept. The lower approximation consists of all objects 
which definitely belong to the concept and the upper 
approximation contains all objects which possibly be- 
long to the concept. The difference between the upper 
and the lower approximation constitutes the boundary 
region of the vague concept. Approximation operations 
are the basic operations in rough set theory. Properties 
of the boundary region (expressed, e.g., by the rough 
membership function) are important in the rough set 
methods. 

Hence, rough set theory expresses vagueness not by 
means of membership, but by employing a boundary re- 
gion of a set. If the boundary region of a set is empty it 
means that the set is crisp, otherwise the set is rough 
(inexact). A nonempty boundary region of a set means 
that our knowledge about the set is not sufficient to de- 
fine the set precisely. 

Rough set theory it is not an alternative to classical 
set theory but it is embedded in it. Rough set theory 
can be viewed as a specific implementation of Frege’s 
idea of vagueness, i. e., imprecision in this approach is 
expressed by a boundary region of a set. 


21.4 Indiscernibility and Approximation 


The starting point of rough set theory is the indiscerni- 
bility relation, which is generated by information about 
objects of interest (Sect. 21.1). The indiscernibility rela- 


tion expresses the fact that due to a lack of information 
(or knowledge) we are unable to discern some objects 
by employing available information (or knowledge). 
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This means that, in general, we are unable to deal with 
each particular object but we have to consider granules 
(clusters) of indiscernible objects as a fundamental ba- 
sis for our theory. 

From a practical point of view, it is better to define 
basic concepts of this theory in terms of data. There- 
fore, we will start our considerations from a data set 
called an information system. An information system 
can be represented by a data table containing rows la- 
beled by objects of interest and columns labeled by 
attributes and entries of the table are attribute values. 
For example, a data table can describe a set of pa- 
tients in a hospital. The patients can be characterized 
by some attributes, like age, sex, blood pressure, body 
temperature, etc. With every attribute a set of its val- 
ues is associated, e.g., values of the attribute age can 
be young, middle, and old. Attribute values can also be 
numerical. In data analysis the basic problem that we 
are interested in is to find patterns in data, i.e., to find 
a relationship between some set of attributes, e.g., we 
might be interested whether blood pressure depends on 
age and sex. 

More formally, suppose we are given a pair A = 
(U,A) of nonempty, finite sets U and A, where U 
is the universe of objects, and an A—a set consist- 
ing of attributes, i.e., functions a: U —> V,, where 
V, is the set of values of attribute a, called the do- 
main of a. The pair A =(U,A) is called an infor- 
mation system. Any information system can be repre- 
sented by a data table with rows labeled by objects 
and columns labeled by attributes. Any pair (x, a), 
where x€ U and a €A defines the table entry con- 
sisting of the value a(x). Note that in statistics or 
machine learning such a data table is called a sam- 
ple [21.14]. 

Any subset B of A determines a binary relation 
IND(B) on U, called an indiscernibility relation, de- 
fined by 


xIND(B)y if and only if (21.1) 

a(x) = a(y) for everya eB, ` 
where a(x) denotes the value of attribute a for the ob- 
ject x. 

Obviously, IND(B) is an equivalence relation. The 
family of all equivalence classes of IND(B), i.e., 
the partition determined by B, will be denoted by 
U/IND(B), or simply U/B; the equivalence class of 
IND(B), i.e., the block of the partition U/B, contain- 
ing x will be denoted by B(x) (other notation used: [x], 
or [x]mpæ)). Thus in view of the data we are unable, in 


general, to observe individual objects but we are forced 
to reason only about the accessible granules of knowl- 
edge. 

If (x,y) € IND(B) we will say that x and y are 
B-indiscernible. Equivalence classes of the relation 
IND(B) (or blocks of the partition U/B) are referred 
to as B-elementary sets or B-elementary granules. 
In the rough set approach the elementary sets are 
the basic building blocks (concepts) of our knowl- 
edge about reality. The unions of B-elementary sets 
are called B-definable sets. Let us note that in appli- 
cations we consider only some subsets of the fam- 
ily of definable sets, e.g., defined by conjunction 
of descriptors only. This is due to the computa- 
tional complexity of the searching problem for rele- 
vant definable sets in the whole family of definable 
sets. 

For BCA, we denote by Infg(x) the B-signature 
of x€ U, i.e., the set {(a, a(x)):a € B}. Let INF(B) = 
{Infg(x): x € U}. Then for any objects x, y € U the fol- 
lowing equivalence holds: xIND(B)y if and only if 
Infg (x) = Infg (y). 

The indiscernibility relation is used to define the ap- 
proximations of concepts. We define the following two 
operations on sets X C U 


BX) = {x € U: B(x) CX}, 
B*(X) = {xE U: BO) NX FB, 


(21.2) 
(21.3) 


assigning to every subset X of the universe U two sets 
B,.(X) and B*(X) called the B-lower and the B-upper 
approximation of X, respectively. The set 

BNg(X) = B* (X) — B4 (X) , (21.4) 
will be referred to as the B-boundary region of X. 

From the definition we obtain the following in- 
terpretation: (i) the lower approximation of a set X 
with respect to B is the set of all objects, which can 
for certain be classified to X using B (are certainly 
in X in view of B), (ii) the upper approximation of 
a set X with respect to B is the set of all objects which 
can possibly be classified to X using B (are possi- 
bly in X in view of B), (iii) the boundary region of 
a set X with respect to B is the set of all objects, 
which can be classified neither to X nor to not-X us- 
ing B. 

Due to the granularity of knowledge, rough sets 
cannot be characterized by using available knowledge. 
The definition of approximations is clearly depicted in 
Fig. 21.1. 
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The approximations have the following properties 


B4 (X) CX C B*(X), 
B» (Ø) = B* (Ø) = Ø , Bx(U) = B*(U) =U, 
B* (X UY) = B* (X) U B* (Y) , 
B(X OY) = B(X) O Bx (Y) , 
X CY implies B4 (X) C Bs (Y) 
and B* (X) C B* (Y), 
By (X UY) 2 Bx (X) U B4 (Y), 
B*(X NY) C B*(X)NB*(Y), 
B,(—X) = —B*(X), 
B* (—X) = —B,,.(X) , 
B,(Bx(X)) = B* (Bx (X)) = Bx(X) , 
B* (B* (X)) = B+ (B* (X)) = B* (X) . (21.5) 


Let us note that the inclusions (for union and inter- 
section) in (21.5) cannot, in general, be substituted by 
the equalities. This has some important algorithmic and 
logical consequences. 

Now we are ready to give the definition of rough 
sets. If the boundary region of X is the empty set, i. e., 
BNz (X) = Ø, then the set X is crisp (exact) with respect 
to B; in the opposite case, i. e., if BNg (X) Æ Ø, the set X 
is referred to as rough (inexact) with respect to B. Thus 
any rough set, in contrast to a crisp set, has a nonempty 
boundary region. This is the idea of vagueness proposed 
by Frege. 


Granules of knowledge The universe of objects 


The lower The set | The upper 
approximation approximation 


Fig. 21.1 A rough set 


Let us observe that the definition of rough sets refers 
to data (knowledge), and is subjective, in contrast to the 
definition of classical sets, which is in some sense an 
objective one. 

A rough set can also be characterized numerically 
by the following coefficient 


_ card(Bx (X)) 


a= card(B* (X)) ` 


(21.6) 


called the accuracy of approximation, where X ~ Ø and 
card(X) denotes the cardinality of X. Obviously 0 < 
ap(X) < 1. If æg(X) = 1 then X is crisp with respect 
to B (X is precise with respect to B), and otherwise, if 
ap(X) < 1 then X is rough with respect to B (X is vague 
with respect to B). The accuracy of approximation can 
be used to measure the quality of approximation of de- 
cision classes on the universe U. One can use another 
measure of accuracy defined by 1 — œg (X) or by 


card(BNg(X)) 
7 card(U) 


Some other measures of approximation accuracy are 
also used, e.g., based on entropy or some more specific 
properties of boundary regions. The choice of a relevant 
accuracy of approximation depends on a particular data 
set. Observe that the accuracy of approximation of X 
can be tuned by B. Another approach to the accuracy of 
approximation can be based on the variable precision 
rough set model (VPRSM). 

In [21.10], it is stressed that boundaries of vague 
concepts are not crisp. In the definition presented in 
this chapter, the notion of boundary region is de- 
fined as a crisp set BNg(X). However, let us ob- 
serve that this definition is relative to the subjective 
knowledge expressed by attributes from B. Different 
sources of information may use different sets of at- 
tributes for concept approximation. Hence, the bound- 
ary region can change when we consider these differ- 
ent views. Another reason for boundary change may 
be related to incomplete information about concepts. 
They are known only on samples of objects. Hence, 
when new objects appear again the boundary region 
may change. From the discussion in the literature it 
follows that vague concepts cannot be approximated 
with satisfactory quality by static constructs such as 
induced membership inclusion functions, approxima- 
tions, or models derived, e.g., from a sample. An 
understanding of vague concepts can be only realized 
in a process in which the induced models adaptively 
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match the concepts in dynamically changing environ- 
ments. This conclusion seems to have important con- 
sequences for the further development of rough set 


theory in combination with fuzzy sets and other soft 
computing paradigms for adaptive approximate reason- 
ing. 


21.5 Decision Systems and Decision Rules 


In this section, we discuss the decision rules (con- 
structed over a selected set B of features or a family 
of sets of features), which are used in inducing clas- 
sification algorithms (classifiers), making it possible to 
classify unseen objects to decision classes. Parameters 
which are tuned in searching for a classifier with high 
quality are its description size (defined, e.g., by used 
decision rules) and its quality of classification (mea- 
sured, e.g., by the number of misclassified objects on 
a given set of objects). By selecting a proper balance 
between the accuracy of classification and the descrip- 
tion size one can search for classifier with a high quality 
of classification also on testing objects. This approach 
is based on the minimum description length principle 
(MDL) [21.15]. 

In an information system A = (U,A) we some- 
times distinguish a partition of A into two disjoint 
classes C, D CA of attributes, called condition and de- 
cision (action) attributes, respectively. The tuple A = 
(U,C,D) is called a decision system (or a decision 
table). 

Let V=U{V. : a € C}U {V4 |d € D}. Atomic 
formulae over B C CUD and V are expressions a = 
v called descriptors (selectors) over B and V, where 
a €B and veV,. The set of formulae over B 
and V, denoted by F(B,V), is the least set con- 
taining all atomic formulae over B and V and 
closed under the propositional connectives A (con- 
junction), V (disjunction) and — (negation). By ||gll.a 
we denote the meaning of p € F(B,V) in the de- 
cision system A, which is the set of all objects 
in U with the property g. These sets are defined 
by 


la =vlla = tre Ula) =v}, 

IgA ¢'lla =llellaNle'lla; 

lev ¢'lla = lola Ulle'lla : 
I-¢lla =U- lela . 


The formulae from F(C, V), F(D, V) are called con- 
dition formulae of A and decision formulae of A, 
respectively. 


Any object xe U belongs to the decision class 
ll Asep 4 = d(x) || a of A. All decision classes of A 
create a partition U/D of the universe U. 

A decision rule for A is any expression of the 
form g => Y, where g € F(C, V), y € F(D, V), and 
loll a #9. Formulae gy and y are referred to as the 
predecessor and the successor of decision rule g > w. 
Decision rules are often called JF... THEN... rules. 
Such rules are used in machine learning. 

Decision rule y = y is true in A if and only if 
lola © lY l| a. Otherwise, one can measure its truth 
degree by introducing some inclusion measure of ||¢||_a 
in ||w||.4. Let us denote by card 4 (9) (or card(@), if this 
does not lead to confusion) the number of objects from 
U that satisfies formula g, i. e., the cardinality of |||]. 
According to Lukasiewicz [21.16], one can assign to 
formula ọ the value 


card(y) 
card(U) ° 


and to the implication g = w the fractional value 


card(p Ay) 

card(y) `? 
under the assumption that ||g||4 49. The fractional 
part proposed by Lukasiewicz was adapted much later 
by machine learning and data mining community, e.g., 
in the definitions of the accuracy of decision rules or 
confidence of association rules. 

For any decision system A =(U,C,D) one can 
consider a generalized decision function 64: U —> 
POW(INF(D)), where for any x € U, 64(x) is the set 
of all D-signatures of objects from U which are C- 
indiscernible with x, A= CU D and POW(INF(D)) is 
the powerset of the set INF(D) of all possible decision 
signatures. 

The decision system A is called consistent (de- 
terministic), if card(d4(x)) = 1, for any x € U. Other- 
wise A is said to be inconsistent (nondeterministic). 
Hence, a decision system is inconsistent if it consists 
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of some objects with different decisions but that are 
indiscernible with respect to condition attributes. Any 
set consisting of all objects with the same general- 
ized decision value is called a generalized decision 
class. 

Now, one can consider certain (possible) rules for 
decision classes defined by the lower (upper) approxi- 
mations of such generalized decision classes of A. This 
approach can be extended by using the relationships of 
rough sets with the evidence theory (Dempster-Shafer 


21.6 Dependencies 


Another important issue in data analysis is discover- 
ing dependencies between attributes in a given decision 
system A = (U, C, D). Intuitively, a set of attributes D 
depends totally on a set of attributes C, denoted C > D, 
if the values of attributes from C uniquely determine the 
values of attributes from D. In other words, D depends 
totally on C, if there exists a functional dependency be- 
tween values of C and D. 

D can depend partially on C. Formally such a de- 
pendency can be defined in the following way. We will 
say that D depends on C to a degree k (0 < k < 1), de- 
noted by C =; D, if 


_ _ card(POSc(D)) 
k= y(C,D) = ~ card(U) P (21.7) 
where 
POSc(D)= |) Cw, (21.8) 
XEU/D 


which is called a positive region of the partition U/D 
with respect to C, is the set of all elements of U that can 


21.7 Reduction of Attributes 


We often face the question as to whether we can re- 
move some data from a data table and still preserve 
its basic properties, that is — whether a table contains 
some superfluous data. Let us express this idea more 
precisely. 

Let C,DCA be sets of condition and decision 
attributes, respectively. We will say that C’ C C is a D- 
reduct (reduct with respect to D) of C, if C’ is a minimal 


theory) by considering rules relative to decision classes 
defined by the lower approximations of unions of deci- 
sion classes of A. 

Numerous methods have been developed for the 
generation of different types of decision rules, and the 
reader is referred to the literature on rough sets for de- 
tails. Usually, one is searching for decision rules that are 
(semi) optimal with respect to some optimization crite- 
ria describing the quality of decision rules in concept 
approximations. 


be uniquely classified to blocks of the partition U/D, 
by means of C. 

If k= 1, we say that D depends totally on C, and 
if k < 1, we say that D depends partially (to degree k) 
on C. If k = 0, then the positive region of the partition 
U/D with respect to C is empty. 

The coefficient k expresses the ratio of all elements 
of the universe, which can be properly classified to 
blocks of the partition U/D, employing attributes C and 
is called the degree of the dependency. 

It can be easily seen that if D depends totally on C, 
then IND(C) € IND(D). This means that the partition 
generated by C is finer than the partition generated 
by D. 

Summing up: D is totally (partially) dependent on 
C, if all (some) elements of the universe U can be 
uniquely classified to blocks of the partition U/D, em- 
ploying C. Observe that (21.7) defines only one of the 
possible measures of dependency between attributes. 
Note that one can consider dependencies between ar- 
bitrary subsets of attributes in the same way. One also 
can compare the dependency discussed in this section 
with dependencies considered in databases. 


subset of C such that 
y(C, D) = y(C’,D). (21.9) 


The intersection of all D-reducts is called a D-core 
(core with respect to D). Because the core is the in- 
tersection of all reducts, it is included in every reduct, 
i.e., each element of the core belongs to some reduct. 
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Thus, in a sense, the core is the most important subset 
of attributes, since none of its elements can be removed 
without affecting the classification power of attributes. 
Certainly, the geometry of reducts can be more comlex. 
For example, the core can be empty but there can ex- 
ist a partition of reducts into a few sets with nonempty 
intersection. 

Many other kinds of reducts and their approxima- 
tions have been discussed in the literature. They are 
defined relative to different quality measures. For ex- 
ample, if one changes the condition (21.9) to d4(x) = 
dp(x), (where A = CUD and B = C’ UD), then the de- 
fined reducts preserve the generalized decision. Other 
kinds of reducts preserve, e.g., (i) the distance between 
attribute value vectors for any two objects, if this dis- 
tance is greater than a given threshold, (ii) the distance 
between entropy distributions between any two objects, 
if this distance exceeds a given. Yet another kind of 
reducts is defined by the so-called reducts relative to 
object used for the generation of decision rules. 


21.8 Rough Membership 


Let us observe that rough sets can be also defined em- 
ploying the rough membership function (21.10) instead 
of approximation. That is, consider 


ux: U — [0,1], 
defined by 
ga _ card(B(x) N X) 
Hx) = card(B(x)) `° peat) 


where x€ X C U. The value u(x) can be interpreted 
as the degree that x belongs to X in view of knowledge 
about x expressed by B or the degree to which the el- 
ementary granule B(x) is included in the set X. This 
means that the definition reflects a subjective knowl- 
edge about elements of the universe, in contrast to the 
classical definition of a set related to objective knowl- 
edge. 

Rough membership function can also be interpreted 
as the conditional probability that x belongs to X 
given B. One may refer to Bayes’ theorem as the origin 
of this function. This interpretation was used by several 
researchers in the rough set community. 

One can observe that the rough membership func- 
tion has the following properties: 


Reducts are used for building data models. Choos- 
ing a particular reduct or a set of reducts has an 
impact on the model size as well as on its qual- 
ity in describing a given data set. The model size 
together with the model quality are two basic com- 
ponents tuned in selecting relevant data models. This 
is known as the minimum length principle. Selection 
of relevant kinds of reducts is an important step in 
building data models. It turns out that the different 
kinds of reducts can be efficiently computed using 
heuristics based, e.g., on the Boolean reasoning ap- 
proach. 

Let us note that analogously to the information 
flow [21.17] one can consider different theories over 
information or decision systems representing different 
views on knowledge encoded in the systems. In partic- 
ular, this approach was used for inducing concurrent 
models from data tables. For more details the reader 
is referred to the books cited at the beginning of the 
chapter. 


1) u(x) = 1 iff x € B.(X), 

2) w(x) = Oiff xe U—B* (Xx), 

3) 0 < u(x) < 1 iff x € BNa(X), 

4) u? y(x) = 1-8) for any x€ U, 

5) weoy@) = max(uf (x), 2 (x) for any x € U, 
6) uny) < min(uf (x), u?(x)) for any x € U. 


From the properties it follows that the rough mem- 
bership differs essentially from the fuzzy member- 
ship [21.18], for properties 5) and 6) show that the 
membership for union and intersection of sets, in gen- 
eral, cannot be computed — as in the case of fuzzy sets 
— from their constituents’ membership. Thus formally 
rough membership is different from fuzzy membership. 
Moreover, the rough membership function depends on 
available knowledge (represented by attributes from B). 
Besides, the rough membership function, in contrast 
to the fuzzy membership function, has a probabilistic 
flavor. 

Let us also mention that rough set theory, in con- 
trast to fuzzy set theory, clearly distinguishes two very 
important concepts, vagueness and uncertainty, very 
often confused in the AI literature. Vagueness is the 
property of concepts. Vague concepts can be approx- 
imated using the rough set approach. Uncertainty is 
the property of elements of a set or a set itself (e.g., 
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only examples and/or counterexamples of elements of 
a considered set are given). Uncertainty of elements of 
a set can be expressed by the rough membership func- 
tion. 

Both fuzzy and rough set theory represent two 
different approaches to vagueness. Fuzzy set theory ad- 
dresses gradualness of knowledge, expressed by the 
fuzzy membership, whereas rough set theory addresses 
granularity of knowledge, expressed by the indiscerni- 
bility relation. One can also cope with knowledge 
gradualness using the rough membership. A nice illus- 
tration of this difference was given by Dider Dubois and 


Henri Prade in their example related to image process- 
ing, where fuzzy set theory refers to gradualness of gray 
level, whereas rough set theory is about the size of pix- 
els. 

Consequently, these theories do not compete with 
each other but are rather complementary. In particular, 
the rough set approach provides tools for approxi- 
mate construction of fuzzy membership functions. The 
rough-fuzzy hybridization approach has proved to be 
successful in many applications. An interesting discus- 
sion of fuzzy and rough set theory in the approach to 
vagueness can be found in [21.11]. 


21.9 Discernibility and Boolean Reasoning 


The discernibility relations are closely related to indis- 
cernibility relations and belong to the most important 
relations considered in rough set theory. Tools for dis- 
covering and classifying patterns are based on reason- 
ing schemes rooted in various paradigms. Such patterns 
can be extracted from data by means of methods based, 
e.g., on discernibility and Boolean reasoning. 

The ability to discern between perceived objects is 
important for constructing many entities like reducts, 
decision rules, or decision algorithms. In the standard 
approach the discernibility relation DIS(B) C U x U is 
defined by xDIS(B)y if and only if non(xIND(B)y), i. e., 
B(x) N B) = Ø. However, this is, in general, not the 
case for generalized approximation spaces. 

The idea of Boolean reasoning is based on the con- 
struction for a given problem P of a corresponding 
Boolean function fp with the following property: the so- 
lutions for the problem P can be decoded from prime 
implicants of the Boolean function fp [21.19-21]. Let 
us mention that to solve real-life problems it is neces- 
sary to deal with very large Boolean functions. 

A successful methodology based on the discerni- 
bility of objects and Boolean reasoning has been de- 
veloped for computing many important ingredients for 
applications. These applications include generation of 
reducts and their approximations, decision rules, as- 
sociation rules, discretization of real-valued attributes, 
symbolic value grouping, searching for new features de- 
fined by oblique hyperplanes or higher-order surfaces, 
pattern extraction from data, as well as conflict resolu- 
tion or negotiation [21.4, 6]. 

Most of the problems related to the generation of the 
above-mentioned entities are NP-complete or NP-hard. 
However, it was possible to develop efficient heuris- 


tics returning suboptimal solutions of the problems. 
The results of experiments on many data sets are very 
promising. They show very good quality of solutions 
generated by the heuristics in comparison with other 
methods reported in the literature (e.g., with respect to 
the classification quality of unseen objects). Moreover, 
they are very efficient from the point of view of the time 
necessary to compute the solution. Many of these meth- 
ods are based on discernibility matrices. However, it 
is possible to compute the necessary information about 
these matrices without their explicit construction (i. e., 
by sorting or hashing original data). 

It is important to note that the methodology makes 
it possible to construct heuristics with a very impor- 
tant approximation property, which can be formulated 
as follows: expressions, called approximate implicants, 
generated by heuristics that are close to prime impli- 
cants define approximate solutions for the problem. 

Mining large data sets is one of the biggest chal- 
lenges in knowledge discovery and data mining (KDD). 
In many practical applications, there is a need for data 
mining algorithms running on terminals of a client— 
server database system where the only access to 
database (located in the server) is enabled by queries 
in structured query language (SQL). 

Let us consider two illustrative examples of prob- 
lems for large data sets: (i) searching for short reducts, 
(ii) searching for best partitions defined by cuts on con- 
tinuous attributes. In both cases, the traditional imple- 
mentations of rough sets and Boolean reasoning-based 
methods are characterized by a high computational cost. 
The critical factor for the time complexity of algorithms 
solving the discussed problems is the number of data 
access operations. Fortunately some efficient modifi- 
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cations of the original algorithms were proposed by 
relying on concurrent retrieval of higher-level statistics, 
which are sufficient for the heuristic search of reducts 


21.10 Rough Sets and Induction 


The rough set approach is strongly related to inductive 
reasoning (e.g., in rough set-based methods for induc- 
ing classifiers or clusters [21.6]). The general idea for 
inducing classifiers is as follows. From a given decision 
table a set of granules in the form of decision rules is 
induced together with arguments for and against each 
decision rule and decision class. For any new object 
with known signature one can select rules matching 
this object. Note that the left-hand sides of decision 
rules are described by formulae that make it possible 
to check for new objects if they satisfy them assuming 
that the signatures of these objects are known. In this 
way, one can consider two semantics of formulae: on 
a sample of objects U and on its extension U* D> U. 
Definitely, one should consider a risk related to such 
generalization, e.g., in the decision rule induction. Next, 
a conflict resolution should be applied to resolve con- 
flicts between matched rules by new object voting for 
different decisions. In the rough set approach, the pro- 
cess of inducing classifiers can be considered as the 
process of inducing approximations of concepts over 


and partitions [21.4,6]. The rough set approach was 
also applied in the development of other scalable big 
data processing techniques (e.g., [21.22]). 


extensions of approximation spaces (defined over sam- 
ples of objects represented by decision systems). The 
whole procedure can be generalized for the case of ap- 
proximation of more complex information granules. It 
is worthwhile mentioning that approaches for inducing 
approximate reasoning schemes have also been devel- 
oped. 

A typical approach in machine learning is based 
on inducing classifiers from samples of objects. These 
classifiers are used for prediction decisions on objects 
unseen so far, if only the signatures of these objects 
are available. This approach can be called global, i.e., 
leading to decision extension from a given sample of 
objects on the whole universe of objects. This global ap- 
proach has some drawbacks (see the Epilog in [21.23]). 
Instead of this one can try to use transduction [21.23], 
semi-supervised learning, induced local models rela- 
tive to new objects, or adaptive learning strategies. 
However, we are still far away from fully understand- 
ing the discovery processes behind such generalization 
strategies [21.24]. 


21.11 Rough Set-Based Generalizations 


The original approach by Pawlak was based on indis- 
cernibility defined by equivalence relations. Any such 
indiscernibility relation defines a partition of the uni- 
verse of objects. Over the years many generalizations 
of this approach were introduced, many of which are 
based on coverings rather than partitions. In particu- 
lar, one can consider the similarity (tolerance)-based 
rough set approach, binary relation based rough sets, 
neighborhood and covering rough sets, the dominance- 
based rough set approach, hybridization of rough sets 
and fuzzy sets, and many others. 

One should note that dealing with coverings re- 
quires solving several new algorithmic problems such 
as the selection of family of definable sets or resolving 
problems with the selection of the relevant definition of 
the approximation of sets among many possible ones. 
One should also note that for a given problem (e.g., 


classification problem) one should discover the relevant 
covering for the target classification task. In the litera- 
ture there are numerous papers dedicated to theoretical 
aspects of the covering rough set approach. However, 
still much more work should be done on rather hard 
algorithmic issues, e.g., for the relevant covering dis- 
covery. 

Another issue to be solved is related to inclusion 
measures. Parameters of such measures are tuned in 
inducing of the high quality approximations. Usually, 
this is done on the basis of the minimum description 
length principle. In particular, approximation spaces 
with rough inclusion measures have been investigated. 
This approach was further extended to the rough mereo- 
logical approach. More general cases of approximation 
spaces with rough inclusion have also been discussed 
in the literature, including approximation spaces in GC. 
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Finally, it is worthwhile mentioning the approach for 
ontology approximation used in hierarchical learning of 
complex vague concepts [21.6]. 

In this section, we discuss in more detail some is- 
sues related to the above-mentioned generalizations. 
Several generalizations of the classical rough set ap- 
proach based on approximation spaces defined as pairs 
of the form (U, R), where R is the equivalence relation 
(called indiscernibility relation) on the set U, have been 
reported in the literature. They are related to different 
views on important components used in the definition 
of rough sets. In the definition of rough sets different 
kinds of structural sets that are examples of information 
granules are used. From mathematical point of view, 
one may treat them as sets defined over the hierarchy 
of the powerset of objects. Among them are the follow- 
ing ones: 


@ Elementary granules (neighborhoods) of objects 
(e.g., similarity, tolerance, dominance neighbor- 
hoods, fuzzy neighborhoods, rough-fuzzy neigh- 
borhoods, fuzzy rough neighborhoods, families of 
neighborhoods). 

@ Granules defined by accessible information about 
objects (e.g., only partial information on the signa- 
ture of objects may be accessible). 

@ Methods for the definition of higher-order informa- 
tion granules (e.g., defined by the left-hand sides of 
induced decision rules or clusters of similar infor- 
mation granules). 

@ Inclusion measures making it possible to define the 
degrees of inclusion and/or closeness between in- 
formation granules (e.g., the degrees of inclusion 
granules defined by accessible information about 
objects into elementary granules). 

@ Aggregation methods of inclusion or/and closeness 
degrees. 

@ Methods for the definition of approximation opera- 
tions, including strategies for extension of approx- 
imations from samples of objects to larger sets of 
objects. 

@ Algebraic structures of approximation spaces. 


Let us consider some examples of generalizations 
of the rough set approach proposed by Pawlak in 1982. 

A generalized approximation space [21.25] can be 
defined by a tuple AS = (U, I, v), where J is the uncer- 
tainty function defined on U with values in the powerset 
POW(U) of U (I(x) is the neighborhood of x) and v is 
the inclusion function defined on the Cartesian product 
POW(U) x POW(U) with values in the interval [0, 1] 


measuring the degree of inclusion of sets. The lower 
and upper approximation operations can be defined 
in AS by 


LOW(AS, X) = {x€ U: v(I(x), X) = 1}, (21.11) 
UPP(.AS, X) = {x € U: v(x), X) > O}. (21.12) 


In the case considered by Pawlak [21.2], I(x) is equal to 
the equivalence class B(x) of the indiscernibility rela- 
tion IND(B); in the case of the tolerance (or similarity) 
relation T C Ux U we take I(x) = [x]r = {y € U: xTy}, 
i.e., I(x) is equal to the tolerance class of T defined 
by x. 

The standard rough inclusion relation vgpy is defined 
for X, Y C U by 


card(X N Y) 
card (X) 
1, otherwise . 


, if XAG, 
Vsri(X, Y) = 7 


(21.13) 


For applications it is important to have some construc- 
tive definitions of J and v. 

One can consider another way to define I(x). Usu- 
ally together with AS we consider some set F of 
formulae describing sets of objects in the universe U 
of AS defined by semantics ||- || as, i.e., ||a|| as CU 
for any a € F. If AS = (U, A) then we will also write 
||o||v instead of ||æ|| as. Now, one can take the set 


Ne(x) = {a CFix€ lla|l as} ; (21.14) 
and I(x) = {||æ|| as:a € Ny (x)}. Hence, more general 
uncertainty functions with values in POW(POW(U)) 
can be defined and in consequence different definitions 
of approximations are considered. For example, one can 


consider the following definitions of approximation op- 
erations in this approximation space AS 


LOW(ASo,X) 
= {x € U: v(Y,X) = 1 for some Y € I(x)}, (21.15) 
UPP(ASo,X) 
= {x € U: v(Y,X) > 0 for any Y € I(x)}. (21.16) 


There are also different forms of rough inclusion 
functions. Let us consider two examples. In the first 
example of a rough inclusion function, a threshold t € 
(0, 0.5) is used to relax the degree of inclusion of sets. 
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The rough inclusion function v; is defined by 


v; (X, Y) 
1 if vsr (X, Y) > 1-t, 
vsri (X,Y)—t . 
= | —————— ift< X, Y)<1l-~-t, 
i7 if t < vsri ( ) 
(0) if vsrı (X, Y) <t. 


(21:17) 


One can obtain approximations considered in the 
variable precision rough set approach (VPRSM) by sub- 
stituting in (21.12) and (21.13) the rough inclusion 
function v, defined by (21.17) instead of v, assuming 
that Y is a decision class and J(x) = B(x) for any object 
x, where B is a given set of attributes. Another example 
of application of the standard inclusion was developed 
by using probabilistic decision functions. The rough 
inclusion relation can be also used for function approx- 
imation and relation approximation [21.25]. 

The approach based on inclusion functions has 
been generalized to the rough mereological ap- 
proach [21.26]. The inclusion relation xu,y with the 
intended meaning x is a part of y to a degree at 
least r has been taken as the basic notion of the rough 
mereology being a generalization of the Lesniewski 
mereology [21.27]. 

Usually families of approximation spaces labeled 
by some parameters are considered. By tuning such 
parameters according to chosen criteria (e.g., minimal 
description length) one can search for the optimal ap- 
proximation space for concept approximation. 

Our knowledge about the approximated concepts 
is often partial and uncertain. For example, con- 
cept approximation should be constructed from ex- 
amples and counterexamples of objects for the con- 
cepts [21.14]. Hence, concept approximations con- 
structed from a given sample of objects are extended, 
using inductive reasoning, on objects not yet observed. 
The rough set approach for dealing with concept ap- 
proximation under such partial knowledge is now well 
developed. 

Searching strategies for relevant approximation 
spaces are crucial for real-life applications. They in- 
clude the discovery of uncertainty functions, inclusion 
measures, as well as selection of methods for approxi- 
mations of decision classes and strategies for inductive 
extension of approximations from samples on larger 
sets of objects. 

Approximations of concepts should be constructed 
under dynamically changing environments. This leads 
to a more complex situation where the boundary re- 


gions are not crisp sets, which is consistent with the 
postulate of the higher-order vagueness considered by 
philosophers [21.10]. Different aspects of vagueness in 
the rough set framework have been discussed. 

It is worthwhile mentioning that a rough set ap- 
proach to the approximation of compound concepts has 
been developed. For such concepts, it is hardly pos- 
sible to expect that they can be approximated with 
the high quality by the traditional methods [21.23, 
28]. The approach is based on hierarchical learning 
and ontology approximation. Approximation methods 
of concepts in distributed environments have been de- 
veloped. The reader may find surveys of algorithmic 
methods for concept approximation based on rough sets 
and Boolean reasoning in the literature. 

In several papers, the problem of ontology approx- 
imation was discussed together with possible applica- 
tions to approximation of compound concepts or to 
knowledge transfer. In any ontology [21.29] (vague) 
concepts and local dependencies between them are 
specified. Global dependencies can be derived from lo- 
cal dependencies. Such derivations can be used as hints 
in searching for relevant compound patterns (informa- 
tion granules) in approximation of more compound 
concepts from the ontology. The ontology approxi- 
mation problem is one of the fundamental problems 
related to approximate reasoning in distributed environ- 
ments. One should construct (in a given language that 
is different from the language in which the ontology 
is specified) not only approximations of concepts from 
ontology, but also vague dependencies specified in the 
ontology. It is worthwhile mentioning that an ontology 
approximation should be induced on the basis of in- 
complete information about concepts and dependencies 
specified in the ontology. Information granule calculi 
based on rough sets have been proposed as tools making 
it possible to solve this problem. Vague dependencies 
have vague concepts in premisses and conclusions. 

The approach to approximation of vague dependen- 
cies based only on degrees of closeness of concepts 
from dependencies and their approximations (classi- 
fiers) is not satisfactory for approximate reasoning. 
Hence, more advanced approach should be developed. 
Approximation of any vague dependency is a method 
which for any object allows us to compute the argu- 
ments for and against its membership to the depen- 
dency conclusion on the basis of analogous arguments 
relative to the dependency premisses. Any argument 
is a compound information granule (compound pat- 
tern). Arguments are fused by local schemes (produc- 
tion rules) discovered from data. Further fusions are 
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possible through composition of local schemes, called 
approximate reasoning schemes (AR) [21.30]. To esti- 
mate the degree to which (at least) an object belongs to 
concepts from ontology the arguments for and against 
those concepts are collected and next a conflict resolu- 
tion strategy is applied to them to predict the degree. 

Several generalizations of the rough set approach 
introduced by Pawlak in 1982 are discussed in this 
handbook in more detail: 


@ The similarity (tolerance)-based rough set approach 
(Chap. 25) 

@ Binary relation based rough sets (Chap. 25) 

@ Neighborhood and covering rough sets (Chap. 25) 

@ The dominance-based rough set approach 
(Chap. 22) 

@ The probabilistic rough set approach and its prob- 
abilistic extension called the variable consistency 
dominance-based rough set approaches (Chap. 24) 

@ Parameterized rough sets based on Bayesian confir- 
mation measures (Chap. 22) 

@ Stochastic rough set approaches (Chap. 22) 

@ Generalizations of rough set approximation opera- 
tions (Chap. 25) 

© Hybridization of rough sets and fuzzy sets 
(Chap. 26) 

@ Rough sets on abstract algebraic structures (e.g., lat- 
tices) (Chap. 25). 


There are some other well-established or emerging 
domains not covered in the chapter where some gener- 
alizations of rough sets are proposed as the basic tools, 
often in combination with other existing approaches. 
Among them are rough sets based on [21.6]: 


i) Incomplete information and/or decision systems 

ii) Nondeterministic information and/or decision sys- 
tems 

iii) The rough set model on two universes 

iv) Dynamic information and/or decision systems 

v) Dynamic networks of information and/or decision 
systems. 


21.12 Rough Sets and Logic 


The father of contemporary logic was the German 
mathematician Gottlob Frege (1848-1925). He thought 
that mathematics should not be based on the notion 


Moreover, rough sets play a crucial role in the de- 
velopment of granular computing (GC) [21.31]. The ex- 
tension to interactive granular computing (IGR) [21.32] 
requires generalization of basic concepts such as in- 
formation and decision systems, as well as methods of 
inducing hierarchical structures of information and de- 
cision systems. 

Let us note that making progress in understanding 
interactive computations is one of the key problems 
in developing high quality intelligent systems working 
in complex environments [21.33]. The current research 
projects aim at developing foundations of IGC based on 
the rough set approach in combination with other soft 
computing approaches, in particular with fuzzy sets. 
The approach is called interactive rough granular com- 
puting (IRGC). In IRGC computations are based on 
interactions of complex granules (c-granules, for short). 
Any c-granule consists of a physical part and a mental 
part that are linked in a special way [21.32]. IRGC is 
treated as the basis for (see [21.6] and references in this 
book): 


i) Wistech Technology, in particular for approximate 
reasoning, called adaptive judgment about proper- 
ties of interactive computations 

ii) Context induction 

iii) Reasoning about changes 

iv) Process mining (this research was inspired 
by [21.34]) 

v) Perception-based computing (PBC) 

vi) Risk management in computational systems 
[21.32]. 


Interactive computations based on c-granules seem 
to create a good background, e.g., for modeling com- 
putations in Active Media Technology (AMT) and 
Wisdom Web of Things (W2T). We plan to investigate 
their role for foundations of natural computing too. Let 
us also mention that the interactive computations based 
on c-granules are quite different in nature than Turing 
computations. Hence, we plan to investigate relation- 
ships of interactive computability based on c-granules 
and Turing computability. 


of set but on the notions of logic. He created the first 
axiomatized logical system but it was not understood 
by the logicians of those days. During the first three 
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decades of the twentieth century there was a rapid 
development in logic, bolstered to a great extent by Pol- 
ish logicians, in particular by Alfred Tarski, Stanislaw 
Leśniewski, Jan Lukasiewicz, and next by Andrzej 
Mostowski and Helena Rasiowa. The development of 
computers and their applications stimulated logical re- 
search and widened their scope. 

When we speak about logic, we generally mean 
deductive logic. It gives us tools designed for de- 
riving true propositions from other true propositions. 
Deductive reasoning always leads to true conclusions 
from true premises. The theory of deduction has well- 
established, generally accepted theoretical foundations. 
Deductive reasoning is the main tool used in mathemat- 
ical reasoning. 

Rough set theory has contributed to some extent to 
various kinds of deductive reasoning. Particularly, var- 
ious kinds of logics based on the rough set approach 
have been investigated; rough set methodology has con- 
tributed essentially to modal logics, many valued logic, 
intuitionistic logic, and others (see, e.g., references in 
the book [21.6] and in articles [21.3, 4]). A summary of 
this research can be found in [21.35, 36] and the inter- 
ested reader is advised to consult these volumes. 

In natural sciences (e.g., in physics) inductive rea- 
soning is of primary importance. The characteristic 
feature of such reasoning is that it does not begin 
from axioms (expressing general knowledge about the 
reality) like in deductive logic, but some partial knowl- 
edge (examples) about the universe of interest are the 
starting point of this type of reasoning, which are gen- 
eralized next and they constitute the knowledge about 
a wider reality than the initial one. In contrast to de- 
ductive reasoning, inductive reasoning does not lead to 
true conclusions but only to probable (possible) ones. 
Also, in contrast to the logic of deduction, the logic of 
induction does not have uniform, generally accepted, 
theoretical foundations as yet, although many impor- 
tant and interesting results have been obtained, e.g., 
concerning statistical and computational learning and 
others. 

Verification of the validity of hypotheses in the logic 
of induction is based on experiment rather than the for- 
mal reasoning of the logic of deduction. Physics is the 
best illustration of this fact. The research on modern 
inductive logic has a several centuries’ long history. It 
is worthwhile mentioning here the outstanding English 
philosophers Francis Bacon (1561-1626) and John Stu- 
art Mill (1806-1873) [21.37]. 

The creation of computers and their innovative ap- 
plications essentially contributed to the rapid growth 


of interest in inductive reasoning. This domain is de- 
veloping very dynamically thanks to computer sci- 
ence. Machine learning, knowledge discovery, reason- 
ing from data, expert systems, and others are exam- 
ples of new directions in inductive reasoning. Rough 
set theory is very well suited as a theoretical basis 
for inductive reasoning. Basic concepts of this the- 
ory fit very well to represent and analyze knowledge 
acquired from examples, which can be next used as 
a starting point for generalization. Besides, in fact, 
rough set theory has been successfully applied in many 
domains to find patterns in data (data mining) and 
acquire knowledge from examples (learning from ex- 
amples). Thus, rough set theory seems to be another 
candidate as a mathematical foundation of inductive 
reasoning. 

The most interesting from a computer science point 
of view is common sense reasoning. We use this kind 
of reasoning in our everyday lives, and we face exam- 
ples of such kind of reasoning in newspapers, radio, TV, 
etc., in political, economics, etc., and in debates and 
discussions. 

The starting point for such reasoning is the knowl- 
edge possessed by a specific group of people (com- 
mon knowledge) concerning some subject and intuitive 
methods of deriving conclusions from it. Here we 
do not have the possibility to resolve the dispute by 
means of methods given by deductive logic (reason- 
ing) or by inductive logic (experiment). So the best 
known methods for solving the dilemma are voting, 
negotiations, or even war. See, e.g., Gulliver’s Trav- 
els [21.38], where the hatred between Tramecksan 
(High-Heels) and Slamecksan (Low-Heels) or disputes 
between Big-Endians and Small-Endians could not be 
resolved without a war. These methods do not reveal 
the truth or falsity of the thesis under consideration at 
all. Of course, such methods are not acceptable in math- 
ematics or physics. Nobody is going to solve the truth 
of Fermat’s theorem or Newton’s laws by voting, nego- 
tiations, or declare a war. 

Reasoning of this kind is the least studied from 
the theoretical point of view and its structure is not 
sufficiently understood, in spite of many interesting the- 
oretical research in this domain [21.39]. The meaning of 
commonsense reasoning, considering its scope and sig- 
nificance for some domains, is fundamental, and rough 
set theory can also play an important role in it, but more 
fundamental research must be done to this end. In par- 
ticular, the rough truth introduced and studied in [21.40] 
seems to be important for investigating commonsense 
reasoning in the rough set framework. 


Foundations of Rough Sets | 21.12 Rough Sets and Logic 345 


Let us consider a simple example. In the decision 
system considered we assume U = Birds is a set of 
birds that are described by some condition attributes 
from a set A. The decision attribute is a binary attribute 
Flies with possible values yes if the given bird flies 
and no, otherwise. Then, we define the set of abnormal 
birds by Ab, (Birds) = Ax ({x € Birds: Flies(x) = no}). 
Hence, we have, Ab,(Birds) = Birds—A*({x € 
Birds: Flies(x) = yes}) and Birds — Ab, (Birds) = 
A* ({x € Birds: Flies(x) = yes}). This means that for 
normal birds it is consistent, with knowledge repre- 
sented by A, to assume that they can fly, i.e., it is 
possible that they can fly. One can optimize Ab, (Birds) 
using A to obtain minimal boundary region in the 
approximation of {x € Birds: Flies(x) = no}. 

It is worthwhile mentioning that in [21.41] an ap- 
proach was presented that combines the rough sets with 
nonmonotonic reasoning. Some basic concepts are dis- 
tinguished, which can be approximated on the basis of 
sensor measurements and more complex concepts that 
are approximated using so-called transducers defined 
by first-order theories constructed over approximated 
concepts. Another approach to commonsense reason- 
ing was developed in a number of papers. The approach 
is based on an ontological framework for approxima- 
tion. In this approach, approximations are constructed 
for concepts and dependencies between the concepts 
represented in a given ontology, expressed, e.g., in nat- 
ural language. Still another approach combining rough 
sets with logic programming has been developed. Let us 
also note that Pawlak proposed a new approach to con- 
flict analysis [21.42]. The approach was next extended 
in the rough set framework. 

To recapitulate, let us consider the following char- 
acteristics of the three above-mentioned kinds of rea- 
soning: 


a) Deductive 
1) Reasoning methods: axioms and rules of infer- 
ence 
ii) Applications: mathematics 
iii) Theoretical foundations: complete theory 
iv) Conclusions: true conclusions from true pre- 
misses 
v) Hypotheses verification: formal proof 
b) Inductive 
i) Reasoning methods: generalization from exam- 
ples 
ii) Applications: natural sciences (physics) 
iii) Theoretical foundations: lack of generally ac- 
cepted theory 


iv) Conclusions: not true but probable (possible) 
v) Hypotheses verification: empirical experiment 
c) Common sense 

i) Reasoning methods: reasoning method based on 
common sense knowledge with intuitive rules of 
inference expressed in natural language 

ii) Applications: everyday life, humanities 

iii) Theoretical foundations: lack of generally ac- 
cepted theory 

iv) Conclusions: obtained by mixture of deduc- 
tive and inductive reasoning based on concepts 
expressed in natural language, e.g., with ap- 
plication of different inductive strategies for 
conflict resolution (such as voting, negotiations, 
cooperation, war) based on human behavioral 
patterns 

v) Hypotheses verification: human behavior. 


There are numerous issues related to approximate 
reasoning under uncertainty. These issues are discussed 
in books on granular computing, rough mereology, and 
the computational complexity of algorithmic problems 
related to these issues. For more details, the reader is 
referred to the following books [21.26, 31, 43, 44]. 

Finally, we would like to stress that still much more 
work should be done to develop approximate reasoning 
about complex vague concepts to make progress in the 
development of intelligent systems. According to Leslie 
Valiant [21.45] (who is the 2011 winner of the ACM 
Turing Award, for his fundamental contributions to the 
development of computational learning theory and to 
the broader theory of computer science): 


A fundamental question for artificial intelligence is 
to characterize the computational building blocks 
that are necessary for cognition. A specific chal- 
lenge is to build on the success of machine learning 
so as to cover broader issues in intelligence ... This 
requires, in particular a reconciliation between two 
contradictory characteristics — the apparent logi- 
cal nature of reasoning and the statistical nature of 
learning. 


It is worthwhile presenting two more views. The 
first one by Lotfi A. Zadeh, the founder of fuzzy sets 
and the computing with words (CW) paradigm [21.46]: 


Manipulation of perceptions plays a key role in hu- 
man recognition, decision and execution processes. 
As a methodology, computing with words provides 
a foundation for a computational theory of per- 
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ceptions — a theory which may have an important 
bearing on how humans make and machines might 
make — perception-based rational decisions in an 
environment of imprecision, uncertainty and partial 
truth. ... computing with words, or CW for short, is 
a methodology in which the objects of computation 
are words and propositions drawn from a natural 
language. 


The other view is that of Judea Pearl [21.47] (the 
2011 winner of the ACM Turing Award, the high- 
est distinction in computer science, for fundamental 
contributions to artificial intelligence through the de- 
velopment of a calculus for probabilistic and causal 
reasoning): 


Traditional statistics is strong in devising ways of 
describing data and inferring distributional param- 
eters from sample. Causal inference requires two 
additional ingredients: a science-friendly language 
for articulating causal knowledge, and a mathe- 
matical machinery for processing that knowledge, 
combining it with data and drawing new causal 
conclusions about a phenomenon. 


The question arises about the logic relevant for the 
above-mentioned tasks. First, let us observe that the 
satisfiability relations in the IRGC framework can be 
treated as tools for constructing new granules. In fact, 
for a given satisfiability relation one can define the se- 
mantics of formulae related to this relation, i. e., which 
are the candidates for the new relevant granules. We 
would like to emphasize one a very important feature. 
The relevant satisfiability relation for the considered 
problems is not given but it should be induced (discov- 
ered) from partial information given by information or 
decision systems. For real-life problems it is often nec- 
essary to discover a hierarchy of satisfiability relations 
before we obtain the relevant target one. Granules con- 
structed on different levels of this hierarchy finally lead 
to relevant ones for approximation of complex vague 
concepts related to complex granules expressed using 
natural language. 

The reasoning making it possible to derive rel- 
evant c-granules for solutions of the target tasks is 
called adaptive judgment. Intuitive judgment and ra- 


tional judgment are distinguished as different kinds 
of judgment [21.48]. Deduction and induction as well 
as abduction or analogy-based reasoning are involved 
in adaptive judgment. Among the tasks for adaptive 
judgment are the following ones, which support rea- 
soning under uncertainty toward: searching for relevant 
approximation spaces, discovery of new features, se- 
lection of relevant features, rule induction, discovery 
of inclusion measures, strategies for conflict resolu- 
tion, adaptation of measures based on the minimum 
description length principle, reasoning about changes, 
perception (action and sensory) attributes’ selection 
by agent control, adaptation of quality measures over 
computations relative to agents, adaptation of object 
structures, discovery of relevant contexts, strategies for 
knowledge representation and interaction with knowl- 
edge bases, ontology acquisition and approximation, 
learning in dialog of inclusion measures between gran- 
ules from different languages (e.g., the formal language 
of the system and the user’s natural language), strate- 
gies for adaptation of existing models, strategies for 
development and evolution of communication language 
among agents in distributed environments, strategies for 
risk management in distributed computational systems. 
Definitely, in the language used by agents for deal- 
ing with adaptive judgment (i. e., intuitive and rational) 
some deductive systems known from logic may be ap- 
plied for reasoning about knowledge relative to closed 
worlds. This may happen, e.g., if the agent languages 
are based on classical mathematical logic. However, if 
we move to interactions in open worlds, then new spe- 
cific rules or patterns relative to a given agent or group 
of agents in such worlds should be discovered. The pro- 
cess of inducing such rules or patterns is influenced 
by uncertainty because they are induced by agents un- 
der uncertain and/or imperfect knowledge about the 
environment. 

The concepts discussed, such as interactive com- 
putation and adaptive judgment, are among the basic 
concepts in Wisdom Technology (WisTech) [21.49, 50]. 
Let us mention here the WisTech meta-equation 


WISDOM = INTERACTIONS 
+ ADAPTIVE JUDGMENT 


+ KNOWLEDGE . (21.18) 
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21.13 Conclusions 


In the chapter, we have discussed some basic issues and 
methods related to rough sets together with some gen- 
eralizations, including those related to relationships of 
rough sets with inductive reasoning. We have also listed 


some current research directions based on interactive 
rough granular computing. For more details, the reader 
is referred to the literature cited at the beginning of this 
chapter (see also [21.9]). 
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22. Rough Set Methodology for Decision Aiding 


Roman Stowinski, Salvatore Greco, Benedetto Matarazzo 


Since its conception, the dominance-based rough 
set approach (DRSA) has been adapted to a large 
variety of decision problems. In this chapter we 
outline the rough set methodology designed for 
multi-attribute decision aiding. DRSA was pro- 
posed as an extension of the Pawlak concept of 
rough sets in order to deal with ordinal data. 
We focus on decision problems where all attributes 
describing objects of a decision problem have 
ordered value sets (scales). Such attributes are 
called criteria, and thus the problems are called 
multi-criteria decision problems. Criteria are real- 
valued functions of gain or cost type, depending 
on whether a greater value is better or worse, re- 
spectively. In these problems, we also assume the 
presence of a well defined decision maker (DM) 
(single of group DM) concerned by multi-criteria 
classification, choice, and ranking. 
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Ordinal data are typically encountered in multi-attribute 
decision problems, where a set of objects (also called 
actions, acts, solutions, etc.) evaluated by a set of at- 
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jects (choice or its particular case — optimization), or 
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(iii) how to rank the objects from the best to the worst 
(ranking). The answer to all of these questions in- 
volves an aggregation of the multi-attribute evaluation 
of objects, which takes into account a law relating the 
evaluation with the classification, or choice, or ranking 
decision. This law has to be discovered by inductive 
learning from data describing the considered decision 
situation. In the case of decision problems that corre- 
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spond to some physical phenomena, this law is a model 
of cause-effect relationships, and in the case of hu- 
man decision making, this law is a decision maker’s 
preference model. In DRSA, these models have the 
form of a set of if..., then... decision rules. In the 
case of multi-attribute classification the syntax of rules 
is: if evaluation of object a is better (or worse) than 
given values of some attributes, then a belongs to at 


22.1 Data Inconsistency as a Reason 


The data describing a given decision situation include 
either observations of DM’s past decisions in the same 
decision context, or examples of decisions consciously 
elicited by the DM on the demand of an analyst. These 
data hides the value system of the DM, and thus they are 
called preference information. This way of preference 
information elicitation is called indirect, in opposition 
to direct elicitation when the DM is supposed to pro- 
vide information leading directly to the definition of all 
preference model parameters, like weights and discrim- 
ination thresholds of criteria, trade-off rates, etc. [22.1]. 

Past decisions or decision examples may, however, 
be inconsistent with the dominance principle com- 
monly accepted for multi-criteria decision problems. 
Decisions are inconsistent with the dominance princi- 
ple if: 


© Incase of ordinal classification: object a has been 
assigned to a worse decision class than object b, al- 
though ais at least as good as b on all the considered 
criteria, i. e., a dominates b. 

@ In the case of choice and ranking: a pair of ob- 
jects (a, b) has been assigned a degree of preference 
worse than pair (c, d), although differences of eval- 
uations between a and b on all the considered crite- 
ria is at least as strong as the respective differences 
of evaluations between c and d, i. e., pair (a, b) dom- 
inates pair (c, d). 


Thus, in order to build a preference model from 
partly inconsistent preference information, we had the 
idea to structure this data using the concept of a rough 
set introduced by Pawlak [22.2, 3]. Originally, however, 
Pawlak’s understanding of inconsistency was different 
to the above inconsistency with the dominance prin- 
ciple. The original rough set philosophy (Chap. 21) 
is based on the assumption that with every object of 
the universe U there is associated a certain amount of 


least (at most) a given class, and in the case of multi- 
attribute choice or ranking: if object a is preferred to 
object b in at least (at most) given degrees with re- 
spect to some attributes, then a is preferred to b in 
at least (at most) a given degree. These models are 
used to work out a recommendation concerning un- 
seen objects in the context of one of the three problem 
statements. 


for Using Rough Sets 


information (data, knowledge). This information can 
be expressed by means of a number of attributes that 
describe the objects. Objects which have the same de- 
scription are said to be indiscernible (or similar) with 
respect to the available information. The indiscernibil- 
ity relation thus generated constitutes the mathematical 
basis of rough set theory. It induces a partition of the 
universe into blocks of indiscernible objects, called el- 
ementary sets, which can be used to build knowledge 
about a real or abstract world. The use of the indiscerni- 
bility relation results in information granulation. 

Any subset X of the universe may be expressed in 
terms of these blocks either precisely (as a union of ele- 
mentary sets) or approximately. In the latter case, the 
subset X may be characterized by two ordinary sets, 
called the lower and upper approximations. A rough 
set is defined by means of these two approximations, 
which coincide in the case of an ordinary set. The lower 
approximation of X is composed of all the elementary 
sets included in X (whose elements, therefore, certainly 
belong to X), while the upper approximation of X con- 
sists of all the elementary sets which have a non-empty 
intersection with X (whose elements, therefore, may be- 
long to X). The difference between the upper and lower 
approximations constitutes the boundary region of the 
rough set, whose elements cannot be characterized with 
certainty as belonging or not to X (by using the avail- 
able information). The information about objects from 
the boundary region is, therefore, inconsistent or am- 
biguous. The cardinality of the boundary region states, 
moreover, the extent to which it is possible to express X 
in exact terms, on the basis of the available information. 
For this reason, this cardinality may be used as a mea- 
sure of vagueness of the information about X. 

Some important characteristics of the rough set ap- 
proach make it a particularly interesting tool in a variety 
of problems and concrete applications. For example, it 
is possible to deal with both quantitative and qualita- 
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tive input data, and inconsistencies need not be removed 
prior to the analysis. In terms of the output informa- 
tion, it is possible to acquire a posteriori information 
regarding the relevance of particular attributes and their 
subsets to the quality of approximation considered 
within the problem at hand. Moreover, the lower and 
upper approximations of a partition of U into decision 
classes prepare the ground for inducing certain and pos- 
sible knowledge patterns in the form of if... then... 
decision rules. 

Several attempts have been made to employ 
rough set theory for decision aiding [22.4,5]. The 
Indiscernibility-based Rough Set Approach (IRSA) is 
not able, however, to handle inconsistencies with re- 
spect to the dominance principle. 


22.1.1 From Indiscernibility-Based Rough 
Sets to Dominance-Based Rough Sets 


An extension of IRSA which deals with inconsisten- 
cies with respect to the dominance principle, which 
are typical for preference data, was proposed by Greco 
et al. in [22.6-8]. This extension is the dominance- 
based rough set approach (DRSA), which is mainly 
based on the substitution of the indiscernibility relation 
by a dominance relation in the rough approximation 
of decision classes. An important consequence of this 
fact is the possibility of inferring (from observations of 
past decisions or from exemplary decisions) the DM’s 


preference model in terms of decision rules which are 
logical statements of the type if..., then .... The sep- 
aration of certain and uncertain knowledge about the 
DM’s preferences is carried out by the distinction of dif- 
ferent kinds of decision rules, depending upon whether 
they are induced from lower approximations of de- 
cision classes or from the difference between upper 
and lower approximations (composed of inconsistent 
examples). Such a preference model is more general 
than the classical functional models considered within 
multi-attribute utility theory or the relational models 
considered, for example, in outranking methods [22.9— 
11]. 

This chapter is based on previous publications of 
the authors, in particular, on [22.12-14]. In the next 
section, we explain the need for replacing the indis- 
cernibility relation by the dominance relation in the 
definition of rough sets when reasoning about ordi- 
nal data. This leads us to Sect. 22.3, where DRSA is 
presented with respect to multi-criteria ordinal classifi- 
cation. This section also includes two special versions 
of DRSA: variable consistency DRSA (VC DRSA) and 
stochastic DRSA. Section 22.4 presents DRSA with re- 
spect to multi-criteria choice and ranking. Section 22.5 
characterizes some relevant extensions of DRSA, and 
Sect. 22.6 presents applications of DRSA to some op- 
erational research problems. Section 22.7 summarizes 
the features of DRSA applied to multi-criteria decision 
problems and concludes the chapter. 


22.2 The Need for Replacing the Indiscernibility Relation 
by the Dominance Relation when Reasoning About Ordinal Data 


When trying to apply the rough set concept based on 
indiscernibility to reasoning about preference ordered 
data, it has been noted that IRSA ignores not only 
the preference order in the value sets of attributes but 
also the monotonic relationship between evaluations of 
objects on such attributes (called criteria) and the pref- 
erence ordered value of decision (classification decision 
or degree of preference) [22.6, 15-17]. 

In order to explain the importance of the above 
monotonic relationship for data describing multi- 
criteria decision problems, let us consider the example 
of a data set concerning pupils’ achievements in a high 
school. Suppose that among the criteria used for eval- 
uation of the pupils there are results in Mathematics 
(Math) and Physics (Ph). There is also a General 


Achievement (GA) result, which is considered as a clas- 
sification decision. The value sets of all three criteria 
are composed of three values: bad, medium, and good. 
The preference order of these values is obvious: good 
is better than medium and bad, and medium is better 
than bad. The three values bad, medium, and good can 
be number-coded as 1, 2, and 3, respectively, making 
a gain-type criterion scale. One can also notice a seman- 
tic correlation between the two criteria and the classi- 
fication decision, which means that an improvement in 
one criterion should not worsen the classification de- 
cision, while the other criterion value is unchanged. 
Precisely, an improvement of a pupil’s score in Math 
or Ph, with other criterion value unchanged, should not 
worsen the pupil’s general achievement (GA), but rather 


TZ |) Hed 


352 


TZ |) Hed 


Part C 


Rough Sets 


improve it. In general terms, this requirement is concor- 
dant with the dominance principle defined in Sect. 22.1. 

This semantic correlation is also called a mono- 
tonicity constraint, and thus, an alternative name of 
the classification problem with semantic correlation be- 
tween evaluation criteria and classification decision is 
ordinal classification with monotonicity constraints. 

Two questions naturally follow the consideration of 
this example: 


@ What classification rules can be drawn from the 
pupils’ data set? 

@ How does the semantic correlation influence the 
classification rules? 


The answer to the first question is: monotonic if 
..., then... decision rules. Each decision rule is char- 
acterized by a condition profile and a decision profile, 
corresponding to vectors of threshold values on evalua- 
tion criteria and on classification decision, respectively. 
The answer to the second question is that condition 
and decision profiles of a decision rule should observe 
the dominance principle (monotonicity constraint) if 
the rule has at least one pair of semantically correlated 
criteria spanned over the condition and decision part. 
We say that one profile dominates another if the values 
of criteria of the first profile are not worse than the val- 
ues of criteria of the second profile. 

Let us explain the dominance principle with respect 
to decision rules on the pupils’ example. Suppose that 
two rules induced from the pupils’ data set relate Math 
and Ph on the condition side, with GA on the decision 
side: 


© rule #1: if Math = medium and Ph = medium, then 
GA = good, 

© rule #2: if Math = good and Ph = medium, then 
GA = medium. 


The two rules do not observe the dominance princi- 
ple because the condition profile of rule #2 dominates 
the condition profile of rule #1, while the decision pro- 
file of rule #2 is dominated by the decision profile of 
rule #1. Thus, in the sense of the dominance principle, 
the two rules are inconsistent, i. e., they are wrong. 

One could say that the above rules are true because 
they are supported by examples of pupils from the an- 
alyzed data set, but this would mean that the examples 
are also inconsistent. The inconsistency may come from 
many sources. Examples include: 


© Missing attributes (regular ones or criteria) in the 
description of objects. Maybe the data set does not 


include such attributes as the opinion of the pupil’s 
tutor expressed only verbally during an assessment 
of the pupil’s GA by a school assessment commit- 
tee. 

@ Unstable preferences of decision makers. Maybe 
the members of the school assessment committee 
changed their view on the influence of Math on GA 
during the assessment. 


Handling these inconsistencies is of crucial impor- 
tance for data structuring prior to induction of decision 
rules. They cannot be simply considered as noise or er- 
ror to be eliminated from data, or amalgamated with 
consistent data by some averaging operators. They 
should be identified and presented as uncertain rules. 

If the semantic correlation was ignored in prior 
knowledge, then the handling of the above-mentioned 
inconsistencies would be impossible. Indeed, there 
would be nothing wrong with rules #1 and #2. They 
would be supported by different examples discerned by 
the attributes considered. 

It has been acknowledged by many authors that 
rough set theory provides an excellent framework 
for dealing with inconsistencies in knowledge dis- 
covery [22.3, 18-24]. These authors show that the 
paradigm of rough set theory is that of granular com- 
puting, because the main concept of the theory (rough 
approximation of a set) is built up of blocks of ob- 
jects which are indiscernible by a given set of attributes, 
called granules of knowledge. In the space of regu- 
lar attributes, the indiscernibility granules are bounded 
sets. Decision rules induced from indiscernibility-based 
rough approximations are also built up of such granules. 

It appears, however, as demonstrated by the above 
pupils’ example, that rough sets and decision rules built 
up of indiscernibility granules are not able to handle 
inconsistency with respect to the dominance principle. 
For this reason, we have proposed an extension of the 
granular computing paradigm that enables us to take 
into account prior knowledge about multi-criteria eval- 
uation with monotonicity constraints. The combination 
of the new granules with the idea of rough approxima- 
tion is the DRSA approach [22.6, 8, 12-16, 25-27]. 

In the following, we present the concept of granules, 
which permit us to handle prior knowledge about multi- 
criteria evaluation with monotonicity constraints when 
inducing decision rules. 

Let U be a finite set of objects (universe) and let Q 
be a finite set of attributes divided into a set C of 
condition attributes and a set D of decision attributes, 
where CN D = Ø. Also, let X4 be the set of possible 


Rough Set Methodology for Decision Aiding | 22.3 The Dominance-based Rough Set Approach to Multi-Criteria Classification 


evaluations of considered objects with respect to at- 
tribute g € Q, so that 


are attribute spaces corresponding to sets of condi- 
tion and decision attributes, respectively. The elements 
of Xc and Xp can be interpreted as possible evaluations 
of objects on attributes from set C = {1,...,|C|} and 
from set D = {1,...,|D|}, respectively. In the follow- 
ing, with a slight abuse of notation, we shall denote the 
value of object x € U on attribute q € Q by x4. 

Suppose, for simplicity, that all condition attributes 
in C and all decision attributes in D are criteria, and 
that C and D are semantically correlated. 

Let =, be a weak preference relation on U, repre- 
senting a preference on the set of objects with respect 
to criterion q € {CUD}. Now, x4 = yg means x, is at 
least as good as y4 with respect to criterion q. On the 
one hand, we say that x dominates y with respect to P C 
C (shortly, x P-dominates y) in the condition attribute 
space Xp (denoted by xDpy) if x4 = yq for all q € P. 
Assuming, without loss of generality, that the domains 
of the criteria are number-coded (i. e., X, C R for any 
q € C) and that they are ordered so that the preference 
increases with the value (gain-type), we can say that 
xDpy is equivalent to x4 = yg for all q € P, P C C. Ob- 
serve that for each x € Xp, xDpx, i.e., P-dominance Dp 
is reflexive. Moreover, for any x,y,z € Xp, xDpy and 
yDpz imply xDpz, i.e., P-dominance Dp is a transitive 
relation. Being a reflexive and transitive relation, P- 
dominance Dp is a partial preorder. On the other hand, 
the analogous definition holds in the decision attribute 
space Xr, R C D, where x4 = y, for all q € R will be de- 
noted by xDry. 


The dominance relations xDpy and xDry (PCC 
and R C D) are directional statements where x is a sub- 
ject and y is a referent. 

If x € Xp is the referent, then one can define a set of 
objects y € Xp dominating x, called the P-dominating 
set (denoted by De (x)) and defined as D7 (x) = {ye 
U: yDpx}. If x € Xp is the subject, then one can de- 
fine a set of objects y € Xp dominated by x, called the 
P-dominated set (denoted by Dp (x)) and defined as 
Dp (x) = {y € U: xDpy}. 

P-dominating sets Dt (x) and P-dominated sets 
Dp (x) correspond to positive and negative dominance 
cones in Xp, with the origin x. 

With respect to the decision attribute space Xp 
(where R C D), the R-dominance relation enables us to 
define the following sets 

Ch” = {y€ U:yDr}, Cle” = {y € U: xD py} . 
Cl = {x € Xp: Xq = tq} is a decision class with respect 
to q € D. Clz" is called the upward union of classes, 
and ce is the downward union of classes. If y € CIZ”, 
then y belongs to class Cli, Xq = tg, or better, on each 
decision attribute g € R. On the other hand, if y € CE’, 
then y belongs to class Cli, Xq = tg, Or worse, on each 
decision attribute q € R. The downward and upward 
unions of classes correspond to the positive and neg- 
ative dominance cones in Xp, respectively. 

In this case, the granules of knowledge are open 
sets in Xp and Xp defined by dominance cones De (x), 
Dp (x) (P © C) and ce. Cc (R C D), respectively. 
Then, classification rules to be induced from data are 
functions representing granules Clz*, Cl“ by gran- 
ules Dt (x), Dp (x), respectively, in the condition at- 
tribute space Xp, for any P C C and R C D and for any 
x € Xp. 


22.3 The Dominance-based Rough Set Approach 


to Multi-Criteria Classification 


22.3.1 Granular Computing 
with Dominance Cones 


When inducing classification rules, a set D of deci- 
sion attributes is, usually, a singleton, D = {d}. Let 
us make this assumption for further presentation, al- 
though it is not necessary for DRSA. The decision 
attribute d makes a partition of U into a finite number 
of classes, Cl = {Cl;, t= 1,...,n}. Each object x € U 


belongs to one and only one class, Cl, € Cl. The upward 
and downward unions of classes boil down to, respec- 
tively, 
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where t = 1,...,n. Notice that for t = 2,...,n we have 
CI} = U — CIŠ |, i.e., all the objects not belonging to 


class CI, or better, belong to class Cl,—; or worse. 

Let us explain how the rough set concept has been 
generalized in DRSA, so as to enable granular comput- 
ing with dominance cones. 

Given a set of criteria, P C C, the inclusion of an 
object x € U to the upward union of classes CIF, t = 
2,...,N, is inconsistent with the dominance principle if 
one of the following conditions holds: 


@ x belongs to class C1, or better but it is P-dominated 
by an object y belonging to a class worse than Cl,, 
i.e., x € CI= but Dt (x) N CIE, £9; 

@ x belongs to a worse class than Cl, but it P- 
dominates an object y belonging to class Cl, or 
better, i. e., x ¢ CIF but Dp (x) N CIF # ð. 


If, given a set of criteria P C C, the inclusion of x € 
U to CÈ, where t = 2,...,n, is inconsistent with the 
dominance principle, we say that x belongs to C= with 
some ambiguity. Thus, x belongs to CIF without any 
ambiguity with respect to P C C, if x € CI= and there 
is no inconsistency with the dominance principle. This 
means that all objects P-dominating x belong to C/-, 
1.e., Dy (x) © cÈ. Geometrically, this corresponds to 
the inclusion of the complete set of objects contained 
in the positive dominance cone originating in x, in the 
positive dominance cone C/? originating in Cl,. 

Furthermore, x possibly belongs to CIF with respect 
to P C C if one of the following conditions holds: 


© According to decision attribute d, x belongs to CIF. 
© According to decision attribute d, x does not belong 
toc [= , but it is inconsistent in the sense of the dom- 
inance principle with an object y belonging to C/-. 


In terms of ambiguity, x possibly belongs to Cl 
with respect to PCC, if x belongs to CI? with or 
without any ambiguity. Due to the reflexivity of the P- 
dominance relation Dp, the above conditions can be 
summarized as follows: x possibly belongs to class Cl, 
or better, with respect to P C C, if among the objects 
P-dominated by x there is an object y belonging to 
class Cl, or better, i. e., 


Dp NCF AB. 


Geometrically, this corresponds to the non-empty inter- 
section of the set of objects contained in the negative 
dominance cone originating in x, with the positive dom- 
inance cone CI originating in Cl. 


For P C C, the set of all objects belonging to CIF 
without any ambiguity constitutes the P-lower approx- 
imation of CI=, denoted by P(CI=), and the set of all 
objects that possibly belong to Cl= constitutes the P- 
upper approximation of CI=, denoted by P(CI=). More 
formally 


P(CIZ) = {x € U: DF (x) C CR}, 

P(CI=) = {x € U: Dp (x) O CIF £ Ø}, 
where tf = 1,...,n. Analogously, one can define the P- 
lower approximation and the P-upper approximation of 
Ce 

P(CIF) = {x € U: Dp (x) € CIF}, 

P(CI=) = {x € U: De (x) NCIE £0}, 


where t= 1,...,n. 

The P-lower and P-upper approximations of Ci, 
t=1,...,n, can also be expressed in terms of unions 
of positive dominance cones as follows 

Pciz)= |) OW, 

Dp CC 

PC) = |) Rœ. 

xe Cl= 


Analogously, the P-lower and P-upper approxima- 
tions of CIF, t= 1,...,n, can be expressed in terms of 
unions of negative dominance cones as follows 


PKCp= |] DFO, 
Dp (ECF 

Pee) =| | Bee. 
x€CI= 


The P-lower and P-upper approximations so de- 
fined satisfy the following inclusion properties for each 
te {1,...,n} and for all PCC 


ACE yVeCr Cr), 
P(CIF) C CIF C P(CIȘ) . 


All the objects belonging to Cl= and CIF with some 
ambiguity constitute the P-boundary of Cl7 and CIF, 
denoted by Bnp(C/=) and Bnp(CI=), respectively. They 


Rough Set Methodology for Decision Aiding | 22.3 The Dominance-based Rough Set Approach to Multi-Criteria Classification 


can be represented, in terms of upper and lower approx- 
imations, as follows 


Bnp(Cl=) = P(CIF) — P(CIF) , 
Bnp(CI=) = P(CI=) — P(CI=) , 


where t = 1,...,n. The P-lower and P-upper approx- 
imations of the unions of classes C/= and CIF have 
an important complementarity property. It says that if 
object x belongs without any ambiguity to class Cl, 
or better, then it is impossible that it could belong to 
class Cl,—; or worse, i. e., 


P(CIĪž) = U- PCE) .. t=2,...,n. 


Due to the complementarity property, Bnp (C17) = 
Bnp(Cl=,), for f= 2,...,n, which means that if x be- 
longs with ambiguity to class Cl, or better, then it also 
belongs with ambiguity to class C/,—; or worse. 

Considering application of the lower and the upper 
approximations based on dominance Dp, P C C, to any 
set X C U, instead of the unions of classes Cl= and 
CIF, one obtains upward lower and upper approxima- 
tions P=(X) and P= (X), as well as downward lower 
and upper approximations P=(X) and P“ (X), as fol- 
lows 


P= (X) = {x € UDI (x) CX}, 
P= (X) = {x € U: Dp (X) NX ZB}, 
PSX) = {x € U: Dp (x) CX}, 
P= (X) = {x € UDI (xX) NX ZO}. 


From the definition of rough approximations 
P=(X), P= (X), P=(X) and P= (X), we can also obtain 
the following properties of the P-lower and P-upper ap- 
proximations [22.28, 29]: 


1. PØ) =P 0 = P=(0) = O=, 
P= (U) = P-(U) = P=(U) = P“ (U) =U, 
2. P=(XUY) =P" QUY), 
P=(X UY) = P=(X)UP=(Y), 
3. P=(XNY) = P=(X)NP=(¥), 
P=(XNY) = P=(X)N P=(Y), 
4. XCY=>P-(X)CP*(Y), 
XovSP Wer Y), 
5. X CY = P=(X) C P=(Y), 
XCY=> PX(X)C PRY), 


6. P=(XUY) > P®(X) UP=(¥), 
PS(X UY) 2 P=(X)UPS(¥), 

7. P=(XNY) Cc P=(X)NP=(Y), 
P=(XNY) C P=(X)NP=(Y), 

8. P=(P=(X)) = Po PEX) = PEX), 
P= (P= (X)) = P> (P=(X)) = P=(X), 

9. EPA) = P= (P(X) = P=), 
PPTX) = PEP X) = P(X). 


From the knowledge discovery point of view, 
P-lower approximations of unions of classes rep- 
resent certain knowledge provided by criteria from 
P CC, while P-upper approximations represent possi- 
ble knowledge and the P-boundaries contain doubtful 
knowledge provided by the criteria from P C C. 


22.3.2 Variable Consistency 
Dominance-Based Rough Set 
Approach (VC-DRSA) 


The above definitions of rough approximations are 
based on a strict application of the dominance princi- 
ple. However, when defining non-ambiguous objects, 
it is reasonable to accept a limited proportion of neg- 
ative examples, particularly for large data tables. This 
relaxed version of DRSA is called the variable con- 
sistency dominance-based rough set approach (VC- 
DRSA) model [22.30]. 

For any P C C, we say that x € U belongs to CIF 
with no ambiguity at consistency level le (0, 1], if x € 
CIF and at least / * 100% of all objects y € U dominat- 
ing x with respect to P also belong to CIF, i.e., 


Pe) ACF. , 
IDE (x)| 


The term IDF x) N CIž|/|DF (x)| is called rough 
membership and can be interpreted as conditional prob- 
ability Pr(y € CIF | y€ DF (x)). The level / is called the 
consistency level because it controls the degree of con- 
sistency between objects qualified as belonging to CI 
without any ambiguity. In other words, if / < 1, then at 
most (1 —/) x 100% of all objects y € U dominating x 
with respect to P do not belong to C/= and thus contra- 
dict the inclusion of x in CIF. 

Analogously, for any P C C we say that x € U be- 
longs to CIF with no ambiguity at consistency level 
Le (0, 1], if x € CIF and at least /* 100% of all the ob- 
jects y € U dominated by x with respect to P also belong 
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to Cie, kës 


Dr) ACI 
[Dp (I 


The rough membership |D; (x) NCI=|/|Dp (x)| can 
be interpreted as conditional probability Pr(y € CIF | 
y € D; (x)). Thus, for any P C C, each object x€ U 
is either ambiguous or non-ambiguous at consistency 
level / with respect to the upward union CIF (t= 
2,...,n) or with respect to the downward union CIF 
@=1,...,n— 1). 

The concept of non-ambiguous objects at some 
consistency level / naturally leads to the definition of 
P-lower approximations of the unions of classes CŽ 
and Cl, which can be formally presented as follows 


Dt (x) N CIZ 
Pict) = rece: ! ae) lei. 
|Dp (x)| 
Dz (x) N CF 
P! (CIE) = rege nal] 
|D} (x)| 


Given P C C and consistency level /, we can define 
the P-upper approximations of Cl= and CIF, denoted 
by P'(CIF) and P'(CI=), respectively, by complemen- 
tation of PCT) and PCa a) with respect to U as 
follows 


P(Ci2) = U—P\(CIE,) .t=2,....n, 


P (CIS) = U-P\(CE 


ei PH 1am. 


P (Cl) can be interpreted as the set of all the ob- 
jects belonging to Cl=, which are possibly ambiguous 
at consistency level /. Analogously, P (CIF) can be in- 
terpreted as the set of all the objects belonging to CIF, 
which are possibly ambiguous at consistency level /. 


The P-boundaries (P-doubtful regions) of Cl= and CIS 
are defined as 


Bnp(CIZ) = P'(CI2) 
Bnp(ClZ) = P (CIE) 


PAGE); 
PCr): 


where t= 1,...,. The VC-DRSA model provides 
some degree of flexibility in assigning objects to lower 
and upper approximations of the unions of decision 


classes. It can easily be demonstrated that for 0 < I/ < 
I<landt=2,...,n, 


P(Clz) c P'(CIZ) and P (CIF) CP(CE). 


The VC-DRSA model was inspired by Ziarko’s 
model of the variable precision rough set ap- 
proach [22.31]. However, there is a significant differ- 
ence in the definition of rough approximations because 
P!(CI=) and P(CIz) are composed of non-ambiguous 
and ambiguous objects at the consistency level /, re- 
spectively, while Ziarko’s P!(Cl,) and P'(CL) are com- 
posed of P-indiscernibility sets such that at least l 
100% of these sets are included in Cl, or have a 
non-empty intersection with Cl,, respectively. If one 
would like to use Ziarko’s definition of variable preci- 
sion rough approximations in the context of multiple- 
criteria classification, then the P-indiscernibility sets 
should be substituted by P-dominating sets Dy (x). 
However, then the notion of ambiguity that natu- 
rally leads to the general definition of rough approx- 
imations [22.21] loses its meaning. Moreover, a bad 
side effect of the direct use of Ziarko’s definition 
is that a lower approximation P! (CIF) may include 
objects y assigned to Cl,, where h is much less 
than ¢, if y belongs to Dr (x), which was included 
in P'(CI=). When the decision classes are preference 
ordered, it is reasonable to expect that objects as- 
signed to far worse classes than the considered union 
are not counted to the lower approximation of this 
union. 

The VC-DRSA model presented above has been 
generalized in [22.32, 33]. The generalized model ap- 
plies two types of consistency measures in the definition 
of lower approximations: 


© Gain-type consistency measures f£ (x), f2,(x) 


P*= (CŽ) = {x EC of) Sta} 
P*= (CIF) = {x € CIF: fL, 0) = war}, 


@ Cost-type consistency measures g, gx) 


PPM CE) = {x € CIF: 82,0 2 b=}, 
PÊ= (CIF) = {x € CIF: g&,(x) = bah, 
where as,;, @<,, B>;, B<,, are threshold values on 


the consistency measures that condition the inclusion 
of object x in the P-lower approximation of Cl, or 
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CIF. Here are the consistency measures considered Table 22.1 Monotonicity properties of consistency mea- 


in [22.33]: for all x € U and P C C 


IDE (x) 0 CF| 


P 
HS) = 
IDF (x)| 
P IDp @) NCIF| 
uo =~ 
= |Dp (x)| 
= IDR NC | 
maS mx —~— 
esr. (Dg | 
zEDR (xX) NCR 
D7 (2AN CIS 
Ba) = max DRONG, 


er, De @I 
zeDt (NCE Á 


pon DEACRA _- 
B, (4) = -F a t= Duna; 
Dp (x) CIZ, | |ClF| 
p Dp (x) 9 Cl | |Cl | 
Ba, (x) = = ees <)? 
Dp (x) CTA |Cl=| 
t=1,...,m—-1, 
itis a eae 
D Cl 
cB (9) = Ee t=2,....m, 
=l 
Dz (x) NCH, 
Peg ZON f= banm- h 
st [= 
ICT || 
; DNCE 
t= £ ICF] E t=2,....m, 
t 
1 D; (x AcE 
= as oe, t=1,...,m—1, 
t 


e$, (x) = max es (x), 

eS 0) = max e4, 0) , 
with 

M), HG), Hs), 

Ta), BEE), BZ) 


being gain-type consistency measures and 


P P ‘Pp iP P P 
es (x) , E<) ’ es (x) , E2) ’ es, (x) ’ B=, (x) 


being cost-type consistency measures. 

To be concordant with the rough set philosophy, 
consistency measures should enjoy some monotonic- 
ity properties (Table 22.1). A consistency measure is 
monotonic if it does not decrease (or does not increase) 
when: 


sures (after [22.33]) 


Consistency (m1) (m2) (m3) (m4) 
measure 

HE, (x) 9 we, (x) no yes yes no 
(rough membership) 

FE, 0) 3 Be, (x) yes yes yes yes 
BE (x) , BE, (x) no no no no 
(Bayesian) _ 

eko) > ekaa) yes yes no yes 
SOO yes yes yes yes 
e) ; eo) yes yes yes yes 


(m1) The set of attributes is growing. 

(m2) The set of objects is growing. 

(m3) The union of ordered classes is growing. 

(m4) x improves its evaluation, so that it dominates 
more objects. 


For every P C C, the objects being consistent in the 
sense of the dominance principle with all upward and 
downward unions of classes are called P-correctly clas- 
sified. For every P C C, the quality of approximation of 
classification Cl by the set of criteria P is defined as the 
ratio between the number of P-correctly classified ob- 
jects and the number of all the objects in the decision 
table. Since the objects which are P-correctly classi- 
fied are those that do not belong to any P-boundary 
of unions Cl= and Ci-, t=1,...,n, the quality of ap- 
proximation of classification Cl by the set of criteria P, 
can be written as 


\(U- (Uet PENNS ny Bnp (CIF )) 


ch= 
yp(Cl) la 
r (Urets,....ny Bre(Cl=)))| 
|U| 
T (U - (Uş PRES ny Bnp (CIF )))| 
|U| 


yp(Cl) can be seen as a measure of the quality of 
knowledge that can be extracted from the decision table, 
where P is the set of criteria and Cl is the classification 
considered. 

Each minimal subset P C C, such that yp(Cl) = 
yc(Cl), is called a reduct of Cl and is denoted by 
RED. Note that a decision table can have more than 
one reduct. The intersection of all reducts is called the 
core and is denoted by CORE ;. Criteria from CORE; 
cannot be removed from the decision table without de- 
teriorating the knowledge to be discovered. This means 
that in set C there are three categories of criteria: 
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© Indispensable criteria included in the core. 

@ Exchangeable criteria included in some reducts but 
not in the core. 

© Redundant criteria that are neither indispensable 
nor exchangeable, thus not included in any reduct. 


Note that reducts are minimal subsets of crite- 
ria conveying the relevant knowledge contained in the 
decision table. This knowledge is relevant for the ex- 
planation of patterns in a given decision table but not 
necessarily for prediction. 

It has been shown in [22.34] that the quality of 
classification satisfies properties of set functions called 
fuzzy measures. For this reason, we can use the quality 
of classification for the calculation of indices that mea- 
sure the relevance of particular attributes and/or criteria, 
in addition to the strength of interactions between them. 
The useful indices are: the value index and interaction 
indices of Shapley and Banzhaf; the interaction indices 
of Murofushi-Soneda and Roubens; and the Mobius 
representation [22.15]. All these indices can help to as- 
sess the interaction between the criteria considered and 
can help us to choose the best reduct. 


22.3.3 Stochastic Dominance-based Rough 
Set Approach 


From a probabilistic point of view, the assignment of 
object x; to at least class t can be made with probabil- 
ity Pr(y; > t | x;), where y; is the classification decision 
for x;,t=1,..., n. This probability is supposed to sat- 
isfy the usual axioms of probability 

Pr(y,; > 1 |x) =1, 

Pr(y; <t | Xi) = 1—Pr(y; >t+1 | x;) , and 

Pry > t| x) < Proz rt |x) fort>zfť. 

These probabilities are unknown but can be estimated 
from data. 

For each class t = 2,...,, we have a binary prob- 
lem of estimating the conditional probabilities Pr(y; > 
t | x;) = 1, Pr(y; < t | x;). It can be solved by isotonic re- 
gression [22.35]. Let y;, = 1 if y; > t, otherwise y; = 0. 
Let also p;, be the estimate of the probability Pr(y; > 
t | x;). Then, choose estimates p+ which minimize the 
squared distance to the class assignment yj, subject to 
the monotonicity constraints 

lUl 
Minimize X Or —piy 
i=1 
subject to pj; = pj for all x;, x; € U such that x; = x, 


where x; = x; means that x; dominates x. 


Then, stochastic œ-lower approximations for classes 
at least t and at most t— 1 can be defined as 


P (CIE ,) = {x; € U: Pr(y; < t | xi) > a}. 


P“ (CIF) = {xj € U: Pry; 2 t | x;) 2 a}, 


Replacing the unknown probabilities Pr(y; > t | x;) 


and Pr(y; < t | x;) by their estimates př and 1 — př ob- 
tained from isotonic regression, we obtain 


P (CIF) = {x; € U: py > a}, 
P% (CI=_,) = {x € U: p= <1-a}, 


where parameter a €[0.5,1] controls the allowed 
amount of inconsistency. 

Solving isotonic regression requires O(|U|*) time, 
but a good heuristic needs only O(|U|*). In fact, as 
shown in [22.35], we do not really need to know the 
probability estimates to obtain stochastic lower approx- 
imations. We only need to know for which object x;, 
pP% >a and for which x;, p% < 1—a. This can be found 
by solving a linear programming (reassignment) prob- 
lem. 

As before, yy = 1 if y; > t, otherwise y; = 0. Let di, 
be the decision variable which determines a new class 
assignment for object x;. Then, reassign objects from 
union of classes indicated by y;, to the union of classes 
indicated by dž, such that the new class assignments 
are consistent with the dominance principle, where dž 
results from solving the following linear programming 
problem 


lU] 
Minimize 5 Wyz [Yir — dirl 
i=1 
subject to di = dy for all x;, x; € U 


such that x; > x, 


where w,, are some positive weights and x; > x; means 
that x; dominates x. 

Due to unimodularity of the constraint matrix, the 
optimal solution of this linear programming problem is 
always integer, i. e., dž € {0, 1}. For all objects consis- 
tent with the dominance principle, dž = yi. If we set 
wo =a@ and w; = a—1, then the optimal solution d% 
satisfies: df = 1 & př >a. If we set wo = 1—a@ and 
wı =a, then the optimal solution d satisfies: df = 
0 <p; <1l-a. 

For each t=2,...,n, solving the reassignment 
problem twice, we can obtain the lower approximations 
P® (CIF), P% (CK), without knowing the probability 
estimates! 
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22.3.4 Induction of Decision Rules 


Using the terms of knowledge discovery, the 
dominance-based rough approximations of upward and 
downward unions of classes are applied on the data set 
in the pre-processing stage. In result of this stage, the 
data are structured in a way that facilitates induction 
of if ..., then ... decision rules with a guaranteed 
consistency level. For a given upward or downward 
union of classes, CI? or Cl=, the decision rules 
induced under the hypothesis that objects belonging to 
P(CI=) or P(CI=) are positive and all the others are 
negative, suggests an assignment to class Cl, or better, 
or to class Cl; or worse, respectively. On the other 
hand, the decision rules induced under a hypothesis that 
objects belonging to the intersection P(CI=) N P(C/=) 
are positive and all the others are negative, suggest an 
assignment to some classes between Cl, and Cl,(s < t). 

In the case of preference ordered data it is mean- 
ingful to consider the following five types of decision 
tules: 


1. Certain Ds-decision rules. These provide lower 
profile descriptions for objects belonging to CIF 
without ambiguity: 
if Xq1 Zq1 Tq and Xg Zq Fq and ... Xqp Zqp To 
then x € Ci=, where for each w4, Zq E€ Xq» Wq Zq Z4 
means wq is at least as good as zq. 

2. Possible D>-decision rules. Such rules provide 
lower profile descriptions for objects belonging to 
CIŽ with or without any ambiguity: 
if Xqi Zqi qi and xyz =qr "q2 and ... Xap =qp Top 
then x possibly belongs to CIŽ. 

3. Certain D<-decision rules. These give upper profile 
descriptions for objects belonging to CIF without 
ambiguity: 
if Xg1 Xqi qı and Xq Sq Yq and ... Xqp Xq "ap 
then x € F, where for each wg, Zq E€ Xq» Wq Xq Zq 
means w, is at most as good as Zq. 

4. Possible D<-decision rules. These provide upper 
profile descriptions for objects belonging to CIF 
with or without any ambiguity: 
if Xq1 Xqi qı and Xq Sq Yq and ... Xqp Xap Tap» 
then x possibly belongs to CIF. 

5. Approximate Ds <-decision rules. These represent 
simultaneously lower and upper profile descriptions 
for objects belonging to Cls U Cls+1 U- -+U Cl, with- 
out the possibility of discerning the actual class: 
if Xq1 ql ‘ql and... Xqk > qk Vk and Xgk+1 Nok+1 
Tgk+1 and ... Xap XqpVap, then x € Cl; U Cls U 
-+U Ch. 


In the left-hand side of a D> <-decision rule we can 
have xq =q rq and xq <q T, where ry < Fj for the same 
q€ C. Moreover, if rg = na the two conditions boil 
down to x4 ~q rg, Where for each wy, Zg € Xq» Wa ~q Zq 
means wq is indifferent to zq. 

A rule is minimal if there is no other rule with a left- 
hand side that has at least the same weakness (which 
means that it uses a subset of elementary conditions 
and/or weaker elementary conditions) and which has 
a right-hand side that has at least the same strength 
(which means a D>- or a D<-decision rule assigning 
objects to the same union or sub-union of classes, or 
a Ds <-decision rule assigning objects to the same or 
larger set of classes). 

Rules of type 1) and 3) represent certain knowl- 
edge extracted from the decision table, while the rules 
of type 2) and 4) represent possible knowledge. Rules 
of type 5) represent doubtful knowledge. 

Rules of type 1) and 3) are exact if they do not cover 
negative examples; they are probabilistic, otherwise. In 
the latter case, each rule is characterized by a confi- 
dence ratio, representing the probability that an object 
matching the left-hand side of the rule also matches its 
right-hand side. 

A set of decision rules is complete if it is able to 
cover all objects from the decision table in such a way 
that consistent objects are re-classified to their original 
classes and inconsistent objects are classified to clus- 
ters of classes that refer to this inconsistency. Each set 
of decision rules that is complete and non-redundant is 
called minimal. Note that an exclusion of any rule from 
this set makes it non-complete. 

In the case of VC-DRSA, the decision rules are 
probabilistic because they are induced from the P- 
lower approximations whose composition is controlled 
by the user-specified consistency level /. Consequently, 
the value of confidence œ for the rule should be con- 
strained from the bottom. It is reasonable to require 
that the smallest accepted confidence level of the rule 
should not be lower than the currently used consis- 
tency level /. Indeed, in the worst case, some objects 
from the P-lower approximation may create a rule us- 
ing all the criteria from P, thus giving a confidence 
a>l. 

Observe that the syntax of decision rules induced 
from dominance-based rough approximations uses the 
concept of dominance cones: each condition profile 
is a dominance cone in Xc, and each decision pro- 
file is a dominance cone in Xp. In both cases the 
cone is positive for D>-rules and negative for D<- 
tules. 
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Also note that dominance cones that correspond to 
condition profiles can originate in any point of Xc, with- 
out the risk of being too specific. Thus, in contrast to 
granular computing based on the indiscernibility (or 
similarity) relation, in the case of granular computing 
based on dominance, the condition attribute space Xc 
need not be discretized [22.28, 36, 37]. 

Procedures for induction of rules from dominance- 
based rough approximations have been proposed 
in [22.38, 39]. A publicly available computer imple- 
mentation of one of these procedures is called JMAF 
(java multi-criteria and multi-attribute analysis frame- 
work) [22.40, 41]. 

The utility of decision rules is threefold: they ex- 
plain (summarize) decisions made on objects from the 
dataset, they can be used to make decisions with respect 
to new (unseen) objects which are matching conditions 
of some rules, and they permit to build up a strategy 
of intervention [22.42]. The attractiveness of particular 
decision rules can be measured in many different ways; 
however, the most convincing measures are Bayesian 
confirmation measures enjoying a special monotonicity 
property, as reported in [22.43, 44]. 

In [22.45], a new methodology for the induction of 
monotonic decision trees from dominance-based rough 
approximations of preference ordered decision classes 
was proposed. 

It is finally worth noting that several algebraic mod- 
els have been proposed for DRSA [22.29, 46,47] — 
the algebraic structures are based on bipolar disjoint 
representation (positive and negative) of the interior 
and the exterior of a concept. These algebra mod- 
els give elegant representations of the basic properties 
of dominance-based rough sets. Moreover, a topol- 
ogy for DRSA in a bitopological space was proposed 
in [22.48]. 


22.3.5 Rule-based Classification Algorithms 


We will now comment upon the application of deci- 
sion rules to some objects described by criteria from 
C. When applying Ds -decision rules to an object x, it 
is possible that x either matches the left hand side of 
at least one decision rule or it does not. In the case 
of at least one such match, it is reasonable to con- 
clude that x belongs to class Cl,, because it is the 
lowest class of the upward union C/= which results 
from intersection of all the right hand sides of the rules 
covering x. More precisely, if x matches the left-hand 
side of rules p1, p2,...,Pm, having right-hand sides 


a > ad . . 
xe CF, xeCl5,...,x€Cly,, then x is assigned to 


class Cl,, where t = max{tl, 12,..., tm}. In the case of 
no matching, we can conclude that x belongs to Cl), 
i.e., to the worst class, since no rule with a right-hand 
side suggesting a better classification of x covers this 
object. 

Analogously, when applying D<-decision rules to 
the object x, we can conclude that x belongs either to 
class Cl, (because it is the highest class of the down- 
ward union CIF resulting from the intersection of all 
the right-hand sides of the rules covering x), or to class 
Cl, i.e., to the best class, when x is not covered by 
any rule. More precisely, if x matches the left-hand side 


of rules p1, p2, . . - , Pm, having right-hand sides x € ci. 
xE Cc re a= Ck, then x is assigned to class Cl,, 
where t = min{f1, 2,..., tm}. In the case of no match- 


ing, it is concluded that x belongs to the best class 
Cl,, because no rule with a right-hand side suggesting 
a worse classification of x covers this object. Finally, 
when applying Ds <-decision rules to x, it is possi- 
ble to conclude that x belongs to the union of all the 
classes suggested in the right-hand side of the rules cov- 
ering x. 

A new classification algorithm was proposed 
in [22.49]. Let pı > W,...,@ —> We, be the rules 
matching object x. Then, R,(x) = {j: Cl, € Wy. j = 
1,...,k} denotes the set of rules matching x, which 
recommend assignment of object x to a union includ- 
ing class Cl, and R,(x) = {j: Cl, é Yj = 1,..., k} 
denotes the set of rules matching x, which do not rec- 
ommend assignment of object x to a union including 
class CL. |||, ||y;|| are sets of objects with property 
gj and yy, respectively, j = 1,...,k. For a classified ob- 
ject x, one has to calculate the score for each candidate 
class 


score(Cl;,x) = scoret (Cl;, x) — score” (Cl,, x) , 


where 


2 


[UerwllgllN Ch) 


score (Cl;, x) = 
[Uerw Ihgll] x ICH 


and 


score” (Cl,, x) 
2 
[Uero hall I| 
Uero Hall] [Uero lvl 
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scoret (Cl,, x) and score™ (Cl;, x) can be interpreted in 
terms of conditional probability as a product of confi- 
dence and coverage of the matching rules 


scoret (Cl, x) = Pr({ø:j € RiQ@)}Cl) 

x Pr(CL Hg: j € Rœ}. 
score” (Cl, x) = Prg: j € Rm} CL) 

x Pr(=CL hø: j € R0} - 


The recommendation of the univocal classification 
x — Cl, is such that 


Examples illustrating the application of DRSA to 
multi-criteria classification in a didactic way can be 
found in [22.12—-14, 50]. 


22.4 The Dominance-based Rough Set Approach 
to Multi-Criteria Choice and Ranking 


22.4.1 Differences with Respect 
to Multi-Criteria Classification 


One of the very first extensions of DRSA concerned 
preference ordered data representing pairwise compar- 
isons (i.e., binary relations) between objects on both, 
condition and decision attributes [22.7, 8, 25,51]. Note 
that while classification is based on the absolute eval- 
uation of objects, choice and ranking refer to pairwise 
comparisons of objects. In this case, the decision rules 
to be discovered from the data characterize a com- 
prehensive binary relation on the set of objects. If 
this relation is a preference relation and if, among the 
condition attributes, there are some criteria which are 
semantically correlated with the comprehensive prefer- 
ence relation, then the data set (serving as the learning 
sample) can be considered as preference information 
provided by a DM in a multi-criteria choice or ranking 
problem. In consequence, the comprehensive prefer- 
ence relation characterized by the decision rules discov- 
ered from this data set can be considered as a preference 
model of the DM. It may be used to explain the decision 
policy of the DM and to recommend an optimal choice 
or preference ranking with respect to new objects. 

Let us consider a finite set A of objects evaluated 
by a finite set C of criteria. The optimal choice (or the 
preference ranking) in set A is semantically correlated 
with the criteria from set C. The preference informa- 
tion concerning the multi-criteria choice or ranking 
problem is a data set in the form of a pairwise com- 
parison table which includes pairs of some reference 
objects from a subset BC A x A. This is described by 
preference relations on particular criteria and a com- 
prehensive preference relation. One such example is 
a weak preference relation called the outranking rela- 
tion. By using DRSA for the analysis of the pairwise 


comparison table, we can obtain a rough approxima- 
tion of the outranking relation by a dominance relation. 
The decision rules induced from the rough approxima- 
tion are then applied to the complete set A of the objects 
associated with the choice or ranking. As a result, one 
obtains a four-valued outranking relation on this set. 
In order to obtain a recommendation, it is advisable 
to use an exploitation procedure based on the net flow 
score of the objects. We present this methodology in 
more detail below. 


22.4.2 The Pairwise Comparison Table 
as Input Preference Information 


Given a multi-criteria choice or ranking problem, a DM 
can express the preferences by pairwise comparisons 
of the reference objects. In the following, xSy denotes 
the presence, while xS°y denotes the absence of the 
outranking relation for a pair of objects (x,y) € A x A. 
Relation xSy reads object x is at least as good as ob- 
ject y. 

For each pair of reference objects (x, y) € BC AxA, 
the DM can select one of the three following possibili- 
ties: 


Object x is as good as y, 1. e., xSy. 

Object x is worse than y, i. e., xS°y. 

3. The two objects are incomparable at the present 
stage. 


Noe 


A pairwise comparison table, denoted by Spcr, is 
then created on the basis of this information. The first 
m columns correspond to the criteria from set C. The 
last, i.e., the (m+ 1)-th column, represents the com- 
prehensive binary preference relation S or S°. The rows 
correspond to the pairs from B. For each pair in Spcr, 
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a difference between criterion values is put in the cor- 
responding column. If the DM judges that two objects 
are incomparable, then the corresponding pair does not 
appear in Spcr. 

We will define Sper more formally. For any cri- 
terion g; € C, let T; be a finite set of binary relations 
defined on A on the basis of the evaluations of objects 
from A with respect to the considered criterion g;, such 
that for every (x, y) € A x A exactly one binary relation 
t € T; is verified. More precisely, given the domain V; 
of g; € C, if vi, vy’ € V; are the respective evaluations of 
x,y E A by means of g; and (x,y) € t, with t € T;, then 
for each w, z € A having the same evaluations vi, vi’ by 
means of g;, (w,z) € t. Furthermore, let Tz be a set of 
binary relations defined on set A (comprehensive pair- 
wise comparisons) such that at most one binary relation 
t € T4 is verified for every (x,y) E A XA. 

The pairwise comparison table is defined as the data 
table Sper = (B, CU {d}, Tg U Ta, f), where BC AxA 
is a non-empty set of exemplary pairwise comparisons 
of reference objects, Tg = | giec Tis d is a decision cor- 
responding to the comprehensive pairwise comparison 
(comprehensive preference relation), and f : B x (CU 
{d}) + T U T4 is a total function such that f[(x, y), q] € 
T; for every (x,y)E€AXA and for each g; €C, and 
fix, y), q] € Ta for every (x, y) € B. It follows that for 
any pair of reference objects (x, y) € B there is verified 
one and only one binary relation t € Ty. Thus, T4 in- 
duces a partition of B. In fact, the data table Sper can be 
seen as a decision table, since the set C of considered 
criteria and the decision d are distinguished. 

We consider a pairwise comparison table where the 
set Ty is composed of two binary relations defined on A: 


@ x outranks y (denoted by xSy or (x,y) € S), where 
(x,y) € B, 

@ x does not outrank y (denoted by xS‘y or (x, y) € S°), 
where (x, y) € B, and SUS° = B. 


Observe that the binary relation S is reflexive, but 
not necessarily transitive or complete. 


22.4.3 Rough Approximation 
of Preference Relations 


In the following, we will distinguish between two types 
of evaluation scales of criteria: cardinal and ordinal. 
Let C" be the set of criteria expressing preferences 
on a cardinal scale, and let C? be the set of criteria 
expressing preferences on an ordinal scale, such that 
CUC? =C and CNC? = Ø. Moreover, for each 
P C C, we denote by P? the subset of P composed of 


criteria expressing preferences on an ordinal scale, i. e., 
P? = PNC®, and by P” we denote the subset of P com- 
posed of criteria expressing preferences on a cardinal 
scale, i. e., PY = PAC’. Of course, for each P C C, we 
have P= PY U P° and P“ A P? = Ø. 

The meaning of the two scales is such that in the 
case of the cardinal scale we can specify the intensity 
of the preference for a given difference of evaluations, 
while in the case of the ordinal scale, this is not possible 
and we can only establish an order of evaluations. 


Multi-Graded Dominance 

We assume that the pairwise comparisons of reference 
objects on cardinal criteria from set C™ can be rep- 
resented in terms of graded preference relations (for 
example, very weak preference, weak preference, strict 
preference, strong preference, and very strong prefer- 
ence), denoted by Po for each q € C™ and for every 
(x,y) € AXA, T; = {P}, h € H;}, where H; is a particu- 
lar subset of the relative integers and: 


o xP}y, h > 0, means that object x is preferred to ob- 
ject y by degree h with respect to criterion g;. 

e xPly, h < 0, means that object x is not preferred to 
object y by degree h with respect to criterion g;. 

@ xP°y means that object x is similar (asymmetrically 
indifferent) to object y with respect to criterion g;. 


Within the preference context, the similarity rela- 
tion P?, even if not symmetric, resembles the indiffer- 
ence relation. Thus, in this case, we call this similarity 
relation asymmetric indifference. Of course, for each 
gi € C and for every (x, y) E€ Å x A, 


[xPly,h>0] => [Px k<0], 
[xPly, h<0| > [yPkx, k>0]. 


Let P= P^ and P? =Ø. Given PCC (P #®), 
(x, y), (w,z) EA x A, the pair of objects (x,y) is said 
to dominate (w, z) with respect to criteria from P (de- 
noted by (x, y)Dp(w, z)), if x is preferred to y at least 
as strongly as w is preferred to z with respect to each 
gi € P. More precisely, at least as strongly as means by 
at least the same degree, i.e., h> k, where h,k € Hj, 
xP!y, and wPkz, for each g; € P. 

Let D,;; be the dominance relation confined to 
the single criterion g; € P. The binary relation Dy; 
is reflexive ((x, y)Dgi (x, y) for every (x,y) € A xA), 
transitive ((x, y)Dgņ(w,z) and (w,z)Dşņ (u,v) imply 
(x, y)Dgi (u,v) for every (x,y), (w,z), (u,v) E€ A XA), 
and complete ((x, y)D(w,z) and/or (w, z)Din (x, y) 
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for all (x, y), (w,z) € A x A). Therefore, Dg} is a com- 
plete preorder on AXA. Since the intersection of 
complete preorders is a partial preorder, and Dp = 


AQ gep Dti P G C, the dominance relation Dp is a par- 


tial preorder on A x A. 
Let RCP CC and (x,y), (u,v) € AxA; then the 
following implication holds 


(x, y)Dp(u, v) = (x, y)Dr(u, v). 
Given P C C and (x, y) € A x A, we define the fol- 
lowing: 


@ A set of pairs of objects dominating (x, y), called the 
P-dominating set, denoted by D7 (x, y) and defined 
as {(w,z) E€ A xA: (w, z)Dp(x, y)}. 

@ A set of pairs of objects dominated by (x, y), called 
the P-dominated set, denoted by Dp (x, y) and de- 
fined as {(w, z) € A x A: (x, y)Dp(w, z)}. 


The P-dominating sets and the P-dominated sets 
defined on B for all pairs of reference objects from B 
are granules of knowledge that can be used to express 
P-lower and P-upper approximations of the comprehen- 
sive outranking relations S and S°, respectively, 


P(S) = {a EB DF (uy) ES}, 


P(S)= (J De (uy). 


(yes 
P(S®) = {œ y) € B: Dp (x,y) SS" , 


P(S)= |) Dre»). 


(x.y) ES¢ 
It was proved in [22.7] that 
P(S)CSCP(S), P(S) CS’ CPS). 


Furthermore, the following complementarity properties 
hold 


P(S)=B-P(S*), P(S)=B-P(S), 
PS) =B-—P(S), P(S°) =B — P(S) . 


The P-boundaries (P-doubtful regions) of S and S° 
are defined as 


Bnp(S) = P(S)—P(S) , 
Bnp(S°) = P (S°) -P (S°) . 


From the above it follows that Bnp (S) = Bnp(S°). 


The concepts of the quality of approximation, 
reducts, and core can also be extended to the approx- 
imation of the outranking relation by the multi-graded 
dominance relations. 

In particular, the coefficient 


= POURO 
|B| 


defines the quality of approximation of S and S° by 
P C C. It expresses the ratio of all pairs of reference 
objects (x, y) € B correctly assigned to S and S° by the 
set P of criteria to all the pairs of objects contained in 
B. Each minimal subset P C C, such that yp = yc, is 
called a reduct of C (denoted by REDs,.,). Note that 
Spcr can have more than one reduct. The intersection of 
all B-reducts is called the core (denoted by CORE;,,,). 

It is also possible to use the variable consistency 
model on Spcr [22.52], if one is aware that some of 
the pairs in the positive or negative dominance sets be- 
long to the opposite relation, while at least / * 100% of 
pairs belong to the correct one. Then the definition of 
the lower approximations of S and S° boils down to 


preva 
[pe (x, »)| 7 


[Dp wns | >l 
|Dp (x, y)| 


P(S)=4G@,y) E€ B: 


’ 


PS) = laes 


Dominance Without Degrees of Preference 
The degree of graded preference considered above is 
defined on a cardinal scale of the strength of preference. 
However, in many real world problems, the existence of 
such a quantitative scale is rather questionable. This is 
the case with ordinal scales of criteria. In this case, the 
dominance relation is defined directly on evaluations 
gi(x) for all objects x € A. Let us explain this latter case 
in more detail. 

Let P = P? and P” = Ø, then, given (x, y), (w, z) € 
AxA, the pair (x,y) is said to dominate the pair 
(w,z) with respect to criteria from P (denoted by 
(x, y)Dp(w, z)), if for each g; €P, g;(x) > gi(w) and 
gi(z) = gi(y). 

Let Dgn be the dominance relation confined to the 
single criterion g; € P°. The binary relation Dy; is re- 
flexive, transitive, but non-complete (it is possible that 
not (x, y)Dsix(w,z) and not (w,z)Dsi(x, y) for some 
(x, y), (w,z) € A x A). Therefore, Ds; is a partial pre- 
order. Since the intersection of partial preorders is also 
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a partial preorder and Dp = Meer Din, P= P°, the 
dominance relation Dp is a partial preorder. 

If some criteria from P C C express preferences on 
a quantitative or a numerical non-quantitative scale and 
others on an ordinal scale, i.e., if PY A@ and P? Æ 
Ø, then, given (x,y), (w,z) €A XA, the pair (x,y) is 
said to dominate the pair (w, z) with respect to criteria 
from P, if (x, y) dominates (w, z) with respect to both 
PN and P®. Since the dominance relation with respect 
to P” is a partial preorder on A x A (because it is a multi- 
graded dominance) and the dominance with respect to 
P? is also a partial preorder on Ax A (as explained 
above), then the dominance Dp, being the intersection 
of these two dominance relations, is a partial preorder. 
In consequence, all the concepts introduced in the previ- 
ous section can be restored using this specific definition 
of dominance. 


22.4.4 Induction of Decision Rules 
from Rough Approximations 
of Preference Relations 


Using the rough approximations of preference rela- 
tions S and S° defined in Sect. 22.4.3, it is possible 
to induce a generalized description of the preference 
information contained in a given Spcr in terms of suit- 
able decision rules. The syntax of these rules involves 
the concept of upward cumulated preferences (denoted 
by p=") and downward cumulated preferences (denoted 


by P="), with the following interpretation: 


e xP="y means 
x is preferred to y with respect to g; by at least de- 
gree h, 

e xPs"y means 
x is preferred to y with respect to g; by at most de- 
gree h. 


The exact definition of the cumulated preferences, 
for each (x,y) € AXA, g; € C™, and h € H;, can be rep- 
resented as follows: 


e xp="y if xPky, where k € H; and k > h; 
@ xP="y if xPky, where k € H; and k < h. 


Let also G; = {g;(x), x € A}, gi € C?. The decision 
rules have then the following syntax: 


1. D>-decision rules: 
If xpzhidy and ... xper),y and gie+1(x) = 


Tie+1 ANd Sie+1(¥) < Sie+1 and ... gip (x) > rip and 
Eip) < Sip, then xSy, 


where 
P= {8i1,.--,8ipt GC, 
PY = (Bits -o Biel > 
P? St peti 2e Rint 3 
(A(il),...,h(ie)) € Hi x- -x Hie 
and 
(Fie+1> +- -3 Fip), (Sie+1> - - -> Sip) € Giet X+ +X Gip « 


These rules are supported by pairs of objects from 
the P-lower approximation of S only. 
D<-decision rules: 


If apy and ... xP?) y and gie+1(x) < 


Viet+1 and Bie+10) = Sie+1 and... Ep2) < rip and 
Sip(Y) = Sip, then xS‘y, 


where 
P={8i1,.--,8ipt GC, 
PPS pai ena Biel 
PO = {gepi Eip} 
(h(il),... ,h(ie)) € Ha x+- x H; 
and 
(Ties ++ +s Tip)s Siet1s +++ Sip) € Giet1 X X Gip 


These rules are supported by pairs of objects from 
the P-lower approximation of S° only- 
Ds <-decision rules: 


If apy and ... xpzhtie)y and ay 
iG 

zá sah P and gy+i(x) = rg+i and gy+ily) < 

Sy+1 and ... ig(x)>rig and gig(y) < Sig and 


Sigti(X) STig+1 and gig+i(y) = Sigti and ... 
Sip(X) < Tip and gip) = Sip, then xSy or xS‘y, 


where 
O = {8ü;.- -8i EC, 
O" = {gieti se Ret SC, 
PaO U0", 

O’, and O” are not necessarily disjoint, 
P? — ESTER Sint ’ 
(A(i1),..., ACif)) € Ha x- x Hy, 
(Hp iisenns Tip) (yF itp) 


E Git X+ X Gip. 


These rules are supported by pairs of objects from 
the P-boundary of S and S° only. 
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22.4.5 Application of Decision Rules 
to Multi-Criteria Choice and Ranking 


The decision rules induced from a given Spcr describe 
the comprehensive preference relations S and S° either 
exactly (Ds - and D<-decision rules) or approximately 
(Ds <-decision rules). A set of these rules covering 
all pairs of Spcr represents a preference model of the 
DM who gave the preference information in terms of 
pairwise comparison of reference objects. The appli- 
cation of these decision rules on a new subset M T 
A of objects induces a specific preference structure 
on M. 

In fact, any pair of objects (u, v) € M x M can match 
the decision rules in one of four ways: 


@ At least one Ds -decision rule and neither D< nor 
Ds <-decision rules. 

© At least one D<-decision rule and neither D> nor 
Ds <-decision rules. 

@ At least one Ds -decision rule and at least one D<- 
decision rule, or at least one D> <-decision rule, or 
at least one Ds <-decision rule and at least one D> 
and/or at least one D<-decision rule. 

@ No decision rule. 


These four ways correspond to the following four 
situations of outranking, respectively: 


@ uSv and not uS‘y, i.e., true outranking (denoted by 
uS7y). 

© uS‘v and not usy, i.e., false outranking (denoted by 
uS*y). 

© uSv and uS‘y, i.e., contradictory outranking (de- 
noted by uS*v). 

© not uSv and not uS‘y, i. e., unknown outranking (de- 
noted by usv). 


The above four situations, which together constitute 
the so-called four-valued outranking [22.53], have been 
introduced to underline the presence and absence of 
positive and negative reasons for the outranking. More- 
over, they make it possible to distinguish contradictory 
situations from unknown ones. 


A final recommendation (optimal choice or rank- 
ing) can be obtained upon suitable exploitation of this 
structure, i.e., of the presence and the absence of 
outranking S and S° on M. A possible exploitation pro- 
cedure consists of calculating a specific score, called the 
net flow score, for each object x € M 


Syp (x) = STF (x) — ST (@) + S~T(@)—-S7(@), 


where 


@ St*(x) =|{y eM: there is at least one decision 
rule which affirms xSy}|; 

e St ~(~) =|{y €M: there is at least one decision 
rule which affirms ySx}|; 

e S~t+(~ =|{y eM: there is at least one decision 
rule which affirms yS°x}|; 

@ S (x)= |{y e€ M: there is at least one decision rule 
which affirms xS°y}]. 


The recommendation in ranking problems consists 
of the total preorder determined by S,(x) on M. In 
choice problems, it consists of the object(s) x* € M 
such that 


Saa”) = peat {Sip (x) } : 


The above procedure has been characterized with 
reference to a number of desirable properties in [22.53, 
54]. A computer implementation of the whole ap- 
proach, called jRank (ranking generator using DRSA) 
is publicly available [22.55]. 

Recently, Fortemps et al. [22.56] extended DRSA 
to multi-criteria choice and ranking on multi-graded 
preference relations, instead of uni-graded relations S$ 
and S°. 

It is also worth mentioning that there is a machine 
learning approach to multi-criteria choice and ranking 
using ensembles of decision rules. The approach pre- 
sented by Dembczyriski et al. [22.57] makes a bridge 
between stochastic methods of preference learning and 
DRSA for choice and ranking. Examples illustrating the 
application of DRSA to multi-criteria choice and rank- 
ing in a didactic way can be found in [22.12—14, 50, 
54]. 
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22.5 Important Extensions of DRSA 


The existing literature describes many extensions of 
DRSA that make it a useful tool for other practical ap- 
plications. These extensions are: 


@ DRSA to decision under risk and uncer- 
tainty [22.58]; 

@ DRSA to decision under uncertainty and time pref- 
erence [22.59]; 

© DRSA handling missing data [22.60, 61]; 

@ DRSA for imprecise object evaluations and assign- 
ments [22.62]; 

@ Dominance-based approach to induction of associ- 
ation rules [22.63]; 

© Fuzzy-rough hybridization of DRSA [22.8, 64-67]; 

@ DRSA as a way of operator-free fuzzy-rough hy- 
bridization [22.28, 67, 68]; 

@ DRSA to granular computing [22.36, 37]; 

@ DRSA to case-based reasoning [22.69, 70]; 

@ DRSA for hierarchical structure of evaluation crite- 
ria [22.71]; 

@ DRSA to decision involving multiple decision mak- 
ers [22.72, 73]; 

@ DRSA to interactive multi-objective optimiza- 
tion [22.74]; 

© DRSA to interactive evolutionary multi-objective 
optimization under risk and uncertainty [22.75, 76]. 


It is worth stressing that dealing with ordinal data 
and monotonicity constraints also makes sense in gen- 
eral classification problems, where the notion of prefer- 
ence has no meaning. Even when the ordering seems 
irrelevant, the presence or the absence of a property 
have an ordinal interpretation. If two properties are re- 
lated, one of the two: the presence or the absence of 
one property should make more (or less) probable the 
presence of the other property. A formal proof show- 
ing that the IRSA is a particular case of the DRSA was 


given in [22.28]. With this in mind, DRSA can be seen 
as a general framework for analysis of classification 
data. Although it was designed for ordinal classification 
problems with monotonicity constraints, DRSA can be 
used to solve a general classification problem where no 
additional information about ordering is taken into ac- 
count. 

The idea behind this claim is the following [22.77]. 
We assume, without loss of generality, that the value 
sets of all regular attributes are number-coded. While 
this is natural for numerical attributes, categorical at- 
tributes must get numerical codes for categories. In 
this way, the value sets of all regular attributes be- 
come ordered (as all sets of numbers are ordered). 
Now, to analyze a non-ordinal classification problem 
using DRSA, we transform the decision table such that 
each regular attribute is cloned (doubled). It is assumed 
that the value set of each original attribute is ordered 
with respect to increasing preference (gain-type crite- 
rion), and the value set of its clone is ordered with 
respect to decreasing preference (cost-type criterion). 
Using DRSA, for each t € {1,...,}, we approximate 
two sets of objects from the decision table: class Cl; 
and its complement —C/,. Obviously, we can calcu- 
late dominance-based rough approximations of the two 
sets. Moreover, they can serve to induce if ..., then 
... decision rules recommending assignment to class 
Cl, or to its complement —C/,. In this way, we refor- 
mulated the original non-ordinal classification problem 
to an ordinal classification problem with monotonicity 
constraints. Due to cloning of attributes with opposite 
preference orders, we can have rules that cover a sub- 
space in the condition space, which is bounded from the 
top and from the bottom — this leads (without discretiza- 
tion) to more synthetic rules than those resulting from 
the IRSA. 


22.6 DRSA to Operational Research Problems 


DRSA is also a useful instrument in the toolbox of oper- 
ational research (OR). DRSA has been adapted to solve 
the following OR problems: 


@ Interactive multi-objective optimization [22.74] ap- 
plied to OR problems, such as portfolio manage- 
ment, project scheduling, and production planning. 


@ Interactive evolutionary multi-objective optimiza- 
tion under risk and uncertainty [22.75, 76]. 

@ Decision under uncertainty and time prefer- 
ence [22.59], which is useful for dealing with many 
OR problems where uncertainty of outcomes and 
their distribution over time play a fundamental role, 
such as portfolio selection [22.78], scheduling with 
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time-resource interactions, and inventory manage- 
ment. 

Global investment risk analysis on partially missing 
data [22.79]. 


© Explanation of recommendations following from 


robust ordinal regression applied to multi-criteria 
ranking problems in terms of rules [22.80]. 


22.7 Concluding Remarks on DRSA Applied to Multi-Criteria 


Decision Problems 


Let us point out the main features of the methodology 
described: 


The input data set describing a given decision sit- 
uation is the preference information elicited by the 
DM in terms of exemplary decisions (class assign- 
ments or pairwise comparisons of some objects). 
The rough set analysis of preference informa- 
tion using DRSA supplies some useful elements 
of knowledge about the decision situation. These 
are: the relevance of particular criteria, information 
about their interaction, minimal subsets of criteria 
(reducts) conveying important knowledge contained 
in the exemplary decisions and the set of the non- 
reducible criteria (core). 

The methodology presented is based on elementary 
concepts and mathematical tools (sets and set op- 
erations, binary relations), without recourse to any 
complex algebraic or analytical structures; the main 
idea is very natural and the key concept of domi- 
nance is rational and objective. 

DRSA structures the input data prior to induction 
of decision rules. The structuring takes into ac- 
count inconsistencies of the preference information 
with respect to the dominance principle. Due to the 
structuring the induced decision rules are certain or 
possible, depending whether they are induced from 
lower or upper approximations (of unions of classes 
or preference relations), respectively. 

The preference model induced from the rough ap- 
proximations defined on the preference information 
is expressed in the natural and comprehensible lan- 
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23. Rule Induction from Rough Approximations 


Jerzy W. Grzymala-Busse 


Rule induction is an important technique in data 
mining or machine learning. Knowledge is fre- 
quently expressed by rules in many areas of 
artificial intelligence (Al), including rule-based 
expert systems. In this chapter we discuss only su- 
pervised learning in which all cases of the input 
data set are pre-classified by an expert. 
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23.1 Complete and Consistent Data 


Our basic assumption is that the data sets are pre- 
sented as decision tables. An example of a decision 
table is presented in Table 23.1. Rows of the de- 
cision table represent cases and columns represent 
variables. The set of all cases is denoted by U. In Ta- 
ble 23.1, U = {1, 2,3, 4,5, 6,7, 8}. Some variables are 
called attributes while one selected variable is called 
a decision and is denoted by d. The set of all at- 
tributes will be denoted by A. In Table 23.1, A= 
{Wind, Humidity, Temperature} and d = Trip. For an at- 
tribute a and case x, a(x) denotes the value of the 
attribute a for case x. For example, Wind(1) = low. 

Let B be a subset of the set A of all attributes. Com- 
plete data sets are characterized by the indiscernibility 
relation IND(B) [23.1,2] defined as follows: for any 
x,y EU, 


(x, y) € IND(B) if and only if a(x) = a(y) 
foranyaeB. (23.1) 


Obviously, IND(B) is an equivalence relation. The 
equivalence class of IND(B) containing x € U will be 
denoted by [x] and called a B-elementary set. A- 
elementary sets will be called elementary. Any union 
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23.3 Decision Table 
with Numerical Attributes .................... 377 
23.4 Incomplete Data....................... ee 378 
23.4.1 Singleton, Subset, 
and Concept Approximations...... 379 
23.4.2 Modified LEM2 Algorithm ........... 381 
23.4.3 Probabilistic Approximations...... 382 
2355  GOMCIUSIONS neneeese 384 
REFEFENCES......... 0... ccc eee cece eeeeecce een eeeneceeeenes 384 


of B-elementary sets will be called a B-definable set. 
By analogy, the A-definable set will be called definable. 
The elementary sets of the partition {d}* are called con- 
cepts. In Table 23.1, the concepts are {1,2,3}, {4,5}, 
and {6,7,8}. The set of all equivalence classes [x]z, 
where x€ U, is a partition on U denoted by B*. For 
Table 23.1, A* = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8} 
All members of A* are elementary sets. 

We will quote some definitions from [23.3]. A rule 
r is an expression of the following form 


(ay, V1 )& (do, V2)& ... & (Ax, vk) > (d,w), (23.2) 


Table 23.1 A complete and consistent decision table 


Attributes Decision 
Case Wind Humidity Temperature Trip 
1 low low medium yes 
2 low low low yes 
3 low medium medium yes 
4 low medium high maybe 
3 medium low medium maybe 
6 medium high low no 
7 high high high no 
8 medium high high no 
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where aj, a2, ..., ag are distinct attributes, d is a deci- 
sion, V1, V2, . . ., vg are respective attribute values, and w 
is a decision value. 

A case x is covered by a rule r if and only if any 
attribute—value pair of r is satisfied by the correspond- 
ing value of x. For example, case 1 from Table 23.1 is 
covered by the following rule r: 


(Wind, low) & (Humidity, low) — (Trip, yes) . 


The concept C defined by rule r is indicated by r. The 
above rule r indicates concept {1, 2, 3}. 

A rule r is consistent with the data set if and only 
if for any case x covered by r, x is a member of the 
concept indicated by r. The above rule is consistent with 
the data set represented by Table 23.1. A rule set R is 
consistent with the data set if and only if for any r € R, 
r is consistent with the data set. The rule set containing 
the above rule is consistent with the data set represented 
by Table 23.1. 

We say that a concept C is completely covered by 
a rule set R if and only if for every case x from C there 
exists a rule r from R such that r covers x. For example, 
the single rule 


(Wind, low) — (Trip, yes) 


completely covers the concept {1,2,3}. On the other 
hand, this rule is not consistent with the data set repre- 
sented by Table 23.1. A rule set R is complete for a data 
set if and only if every concept from the data set is com- 
pletely covered by R. 

In this chapter we will discuss how to induce rule 
sets that are complete and consistent with the data set. 


23.1.1 Global Coverings 


The simplest approach to rule induction is based on 
finding the smallest subset B of the set A of all at- 
tributes that is sufficient to be used in a rule set. Such 
reducing of the attribute set is one of the main and fre- 
quently used techniques in rough set theory [23.1, 2, 4]. 
This approach is also called a feature selection. In Ta- 
ble 23.1 the attribute Humidity is redundant (irrelevant). 
The remaining two attributes (Wind and Temperature) 
distinguish all eight cases. Let us make it more precise 
using the fundamental definitions of rough set the- 
ory [23.1, 2,4]. 

For a decision d we say that {d} depends on B if 
and only if B* < {d}*, i.e., for any elementary set X in 
B there exists a concept C from {d}* such that X C C. 


note that for partitions x and t on U, if for any X € x 
there exists Y € t such that X C Y, then we say that 7 
is smaller than or equal to t and denote it by a < t. 
A global covering (or relative reduct) of {d} is a subset 
B of A such that {d} depends on B and B is minimal in 
A. The algorithm to compute a single global covering is 
presented below. 


Algorithm 23.1 Algorithm to compute a single 
global covering 
1: (input: the set A of all attributes, 
partition {d}* on U; 
output: a single global covering R); 


2: begin 
3: compute partition A*; 
4: P:=A; 
5: R:=@; 
6: if A* < {d}* 
7: then 
8: begin 
9: for each attribute a in A do 
10: begin 
11: Q:=P-{a}; 
12: compute partition Q*; 
13: if Q* < {d}* 
14: then P := Q 
15: end {for} 
16: R:=P 
17: end {then} 


18: end {algorithm}. 
Let us use this algorithm for Table 23.1. First, 


A™ = {{1}, {2}, {3}, {4}, {5}, {6}, {7}, (883 
< {Trip}* . 


Initially, 


P =A and Q= P — Wind, 
Q = {Humidity, Temperature} , 


and then we compute Q*, where 


Q™ = {{1, 5}, {2}, {3}, {4}, {6}, {7, 8H} . 


We find that Q* £ {Trip}*. Thus, P=A. Next, 
we try to delete Humidity from P. We obtain 
Q = {Wind, Temperature} and then we compute Q*, 
where Q* = {{1,3}, {2}, {4}, {5}, {6}, {7}, {8}}. This 
time Q* < {Trip}*, so P = {Wind, Temperature}. 
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We still need to check Q = P— {Temperature}, Q = 
{Wind} and Q* = {{1, 2,3, 4}, {5, 6, 8}, {7}}, and O* £ 
{Trip}*. Thus R = {Wind, Temperature} is a global cov- 
ering. 

For a given global covering rules are induced by 
examining cases of the data set. Initially, such a rule 
contains all attributes from the global covering with 
the corresponding attribute values, then a dropping 
conditions technique is used; we try to drop one con- 
dition (attribute—value pair) at a time, starting from the 
leftmost condition, checking whether the rule is still 
consistent with the data set, then we try to drop the next 
condition, and so on. For example, 


(Wind, low) & (Temperature, medium) — (Trip, yes) 


is our first candidate for a rule. If we are going to 
drop the first condition, the above rule will be reduced 
to 


(Temperature, medium) — (Trip, yes) . 


However, this rule covers the case 5, so it is not con- 
sistent with the data set represented by Table 23.1. By 
dropping the second condition from the initial rule we 
obtain 


(Wind, low) — (Trip, yes) , 


but this rule is not consistent with the data represented 
by Table 23.1 either, since it covers case 4, so we con- 
clude that the initial rule is the simplest possible. This 
rule covers two cases: | and 3. 

It is not difficult to check that the rule 


(Wind, low) & (Temperature, low) — (Trip, yes) 


is as simple as possible and that it covers only case 2. 
Thus, the above two rules consistently and completely 
cover the concept {1, 2, 3}. 

The above algorithm is implemented as LEM1 
(Learning from Examples Module, version 1). It is 
a component of the data mining system LERS (Learn- 
ing from Examples Using Rough Sets). A similar sys- 
tem was described in [23.5]. 


23.1.2 Local Coverings 
The LEM1 algorithm is based on calculus on partitions 


on the entire universe U. Another approach to rule in- 
duction, based on attribute—value pairs, is presented in 


the LEM2 algorithm (Learning from Examples Module, 
version 2), another component of LERS. We will quote 
a few definitions from [23.6, 7]. 

For an attribute—value pair (a, v) = t, a block of t, 
denoted by [f], is a set of all cases from U such that for 
attribute a have value v, i.e., 


[(a, v)] = {x | a(x) =v}. (23.3) 


Let T be a set of attribute—value pairs. The block of 
T, denoted by [T], is the following set 


Nia. (23.4) 


tET 


Let B be a subset of U. Set B depends on a set T 
of attribute—value pairs t = (a,v) if and only if [T] is 
nonempty and 


[T] CB. (23.5) 


Set T is a minimal complex of B if and only if B 
depends on T and no proper subset T’ of T exists such 
that B depends on T’. Let T be a nonempty collection of 
nonempty sets of attribute—value pairs. Then T is a lo- 
cal covering of B if and only if the following conditions 
are satisfied: 


1. each member T of T is a minimal complex of B, 
2. Ue [7] =B, and T is minimal, i.e., T has the 
smallest possible number of members. 


An algorithm for finding a single local covering, 
called LEM2, is presented below. For a set X, |X| de- 
notes the cardinality of X. 


Algorithm 23.2 LEM2 
1: (input: a set B, 
output: a single local covering T of set B); 
2: begin 
3: G:= B; 
4: T =p: 
5: while G Æ Ø 
6: begin 
7 T := Ø; 
8: T(G):= {tN GAB}; 
9: while T = Ø or [T] ZB 
10: begin 
11: select a pair t € T(G) 
12: such that |[t] A G| is 
13: maximum; if a tie 
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14: occurs, select a pair select the first (top) pair, (Humidity, low). This 
15: t € T(G) with the time {1,2,3,4}N {1,2,5} = {1,2} C {1,2,3}, so 
16: smallest cardinality of [¢]; {(Wind, low), (Humidity, low)} is the first element T 
17: if another tie occurs, of T. 

18: select first pair; 3. The new set G= B-[|T] = {1,2,3}— {1,2} = 
19: T:=TU {t}; {3}. The pair [(Humidity, medium)] has the 
20: G:=[ANG; smallest cardinality of [t], so it is the best 
21: T(G) := {t|[ O G Æ 9}; choice. However, [(Humidity, medium)] = {3,4} É 
22; T(G):=T(G)-T ; {1,2,3}, hence we need to look for the next t. 

23: end {while} 4. The pair [(Temperature, medium)] is the best 
24: foreach te T do choice, and {3,4} N {1,3,5} = {3} C {1,2,3}, so 
25: if [T—{}] CB {(Humidity, medium), (Temperature, medium)} is 
26: then T := T — {t}; the second element T of T. 


7: T:=T U{T}; 

28: G:=B-Urer(T]; 
29: end {while}; 

30: for each T € T do 

31: if Usern [$] =B 
32: then T := T — {T}; 
33: end {procedure}. 


We will trace the LEM2 algorithm applied to the 
following input set {1, 2,3} = [(Trip, yes)]. The tracing 
of LEM2 is presented in the Tables 23.2 and 23.3. The 
corresponding comments are: 


1. The set G= {1,2,3}. The best attribute-value pair 
t, with the largest cardinality of the intersection 
of [t] and G (presented in the third column of 
Table 23.2) is (Wind, low). The corresponding en- 
try in the third column of Table 23.2 is bulleted. 
However, [(Wind, low)] = {1,2,3,4} Z {1,2,3} = 
B, hence we need to look for the next t. 

2. The set G is the same, G = {1,2,3}. There are 
four attribute—value pairs with |[f O G| = 2. Two 
of them have the same cardinality as [t], so we 


Table 23.2 Computing a local covering for the concept 
[(Trip, yes)], part I 


(a,v) =t [@,»)] {1, 2, 3} {1, 2, 3} 
(Wind, low) fi, 23.2% | M2. 3hO | CLs 
(Wind, medium) {5, 6, 8} — = 
(Wind, high) {7} = = 
(Humidity, low) San ee {1,2} o 
(Humidity, medium) {3,4} {3} {3} 
(Humidity, high) {6, 7, 8} = = 
(Temperature, low) {2, 6} {2} {1, 3} 
(Temperature, {i BSH {iL sh {1,3} 
medium) 

(Temperature, high) {4, 7, 8} = 


Comments 1 2 


Thus, 


T = {{(Wind, low), (Humidity, low)}, 
{(Humidity, medium), (Temperature, medium) }} . 


Therefore, the LEM2 algorithm induces the following 
rule set 


(Wind, low) & (Humidity, low) 
— (Trip, yes) 
(Humidity, medium) & (Temperature, medium) 
— (Trip, yes) . 


Rules induced from local coverings differ from 
rules induced from global coverings. In many cases 
the former are simpler than the latter. For example, for 
Table 23.1 and the concept [(7rip, no)], the LEM2 al- 
gorithm would induce just one rule that covers all three 
cases 


(Humidity, high) —> (Trip, no) . 


Table 23.3 Computing a local covering for the concept 
[(Trip, yes)], part II 


(a,v) =t [@,v)] {3} {3} 
(Wind, low) Hie sah | Sp {3} 
(Wind, medium) {5, 6, 8} = = 
(Wind, high) {7} — — 
(Humidity, low) 112,5% — — 
(Humidity, medium) {3,4} {3} e = 
(Humidity, high) {6,7,8} = — 
(Temperature, low) {2, 6} — = 
(Temperature, {1, 3,5} {3} {3} 
medium) 

(Temperature, high) {4,7, 8} — 

Comments 3 4 


Rule Induction from Rough Approximations | 23.2 Inconsistent Data 


On the other hand, the attribute Humidity is not in- 
cluded in the global covering. The rules induced from 
the global covering are 


(Temperature, high) — (Trip, no). 
(Wind, medium) & (Temperature, low) 
— (Trip, no). 


23.1.3 Classification 


Rule sets, induced from data sets, are used most fre- 
quently to classify new, unseen cases. A classification 
system has two inputs: a rule set and a data set con- 
taining new cases and it classifies every case as being 
a member of some concept. A classification system 
used in LERS is a modification of the well-known 
bucket brigade algorithm [23.7-9]. 

The decision of to which concept a case belongs 
is made on the basis of three factors: strength, speci- 
ficity, and support. These factors are defined as follows: 
strength is the total number of cases correctly classi- 
fied by the rule during training. Specificity is the total 
number of attribute—value pairs on the left-hand side of 
the rule. The matching rules with a larger number of 
attribute—value pairs are considered more specific. The 
third factor, support, is defined as the sum of products 
of strength and specificity for all matching rules indi- 
cating the same concept. The concept C for which the 
support, i. e., the following expression 


5 Strength(r)* 


matching rules r describing C 


Specificity (r) (23.6) 


23.2 Inconsistent Data 


Frequently data sets contain conflicting cases, i.e., 
cases with the same attribute values but from different 
concepts. An example of such a data set is presented 
in Table 23.4. Cases 4 and 5 have the same values for 
all three attributes, yet their decision values are dif- 
ferent (they belong to different concepts). Similarly, 
cases 7 and 8 also conflict. Rough set theory handles 
inconsistent data by introducing lower and upper ap- 
proximations for every concept [23.1, 2]. 

There exists a very simple test for consistency: 
A* < {d}*. If this condition is false, the correspond- 
ing data set is not consistent. For Table 23.4, A* = 
{{1}, {2}, {3}, {4, 53, {6, 7, 8}, {93, {103}, and {d}* = 
{{1,2,3, 4}, {5, 6, 7}, {8,9, 10}}, so A* £ {d}*. 


is the largest is the winner, and the case is classified as 
being a member of C. 

In the classification system of LERS, if complete 
matching is impossible, all partially matching rules are 
identified. These are rules with at least one attribute- 
value pair matching the corresponding attribute—value 
pair of a case. For any partially matching rule r, the ad- 
ditional factor, called Matching_factor (r), is computed. 
Matching _factor (r) is defined as the ratio of the num- 
ber of matched attribute—value pairs of r with a case to 
the total number of attribute—value pairs of r. In par- 
tial matching, the concept C for which the following 
expression 


bD Matching_factor(r)* 


partially matching 
rules r describing C 


Strength(r)* 
Specificity(r) . (23.7) 


is the largest is the winner and the case is classified as 
being a member of C. 

Since the classification system is a part of the LERS 
data mining system, rules induced by any component of 
LERS, such as LEM1 or LEM2, are presented in the 
LERS format, in which every rule is associated with 
three numbers: the total number of attribute—value pairs 
on the left-hand side of the rule (i. e., specificity), the 
total number of cases correctly classified by the rule 
during training (i. e., strength), and the total number of 
training cases matching the left-hand side of the rule, 
i. e., the rule domain size. 


Let B be a subset of the set A of all attributes. For 
inconsistent data sets, in general, a concept X is not 
a definable set. However, set X may be approximated 
by two B-definable sets; the first one is called a B-lower 
approximation of X, denoted by BX and defined as fol- 
lows 


{x € U|[x]z C X}. (23.8) 


The second set is called a B-upper approximation of X, 
denoted by BX and defined as follows 


{x e Ulke NX # 8}. (23.9) 


In (23.8) and (23.9) lower and upper approxima- 
tions are constructed from singletons x; we say that we 
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are using the so-called first method. The B-lower ap- 
proximation of X is the largest B-definable set contained 
in X. The B-upper approximation of X is the smallest B- 
definable set containing X. 

As was observed in [23.2], for complete decision 
tables we may use a second method to define the B- 
lower approximation of X, by the following formula 

U{[x]s|x € U, [x]z € X}, (23.10) 
while the B-upper approximation of x may be defined, 
using the second method, by 

U{[x]s|x € U, [x] NX AO}. (23.11) 
Obviously, both (23.8) and (23.10) define the same 


set. Similarly, (23.9) and (23.11) also define the same 
set. For Table 23.4, 


A{1,2,3,4} = {1,2,3} 
and 
A{1, 2, 3,4} = {1, 2, 3,4, 5}. 
It is well known that for any B C A and X C U, 


BX CX CBX, (23.12) 
hence any case x from BX is certainly a member of X, 
while any member x of BX is possibly a member of X. 
This observation is used in the LERS data mining sys- 
tem. If an input data set is inconsistent, LERS computes 
lower and upper approximations for any concept and 
then induces certain rules from the lower approxima- 
tion and possible rules from the upper approximation. 
For example, if we want to induce certain and possible 


Table 23.4 An inconsistent decision table 


Attributes Decision 

Case Wind Humidity Temperature Trip 

1 low low medium yes 

2 low low low yes 

3 low medium medium yes 

4 low medium high yes 

5 low medium high maybe 
6 medium low medium maybe 
7 medium low medium maybe 
8 medium low medium no 

9 high high high no 

10 medium high high no 


rule sets for the concept [(Trip, yes)] from Table 23.4, 
we need to consider the following two data sets, pre- 
sented in Tables 23.5 and 23.6. 

Table 23.5 was obtained from Table 23.4 by assign- 
ing the value yes of the decision Trip to all cases from 
the lower approximation of [(Trip, yes)] and by replac- 
ing all remaining values of Trip by a special value, say 
SPECIAL. Similarly, Table 23.6 was obtained from Ta- 
ble 23.4 by assigning the value yes of the decision Trip 
to all cases from the upper approximation of [(Trip, 
yes)] and by replacing all remaining values of Trip 
by the value SPECIAL. Obviously, both tables 23.5 
and 23.6 are consistent. Therefore, we may use the 
LEM1 or LEM2 algorithms to induce rules from Ta- 
bles 23.5 and 23.6. The rule set induced by the LEM2 
algorithm from Table 23.5 is: 


@ 2,2,2 


(Wind, low) & 
(Humidity, low) — (Trip, yes) , 


Table 23.5 A new data set for inducing certain rules for 
the concept [(Trip, yes)] 


Attributes Decision 

Case Wind Humidity Temperature Trip 

1 low low medium yes 

2 low low low yes 

3 low medium medium yes 

4 low medium high SPECIAL 
5 low medium high SPECIAL 
6 medium low medium SPECIAL 
T medium low medium SPECIAL 
8 medium low medium SPECIAL 
9 high high high SPECIAL 
10 medium high high SPECIAL 


Table 23.6 A new data set for inducing possible rules for 
the concept [(Trip, yes)] 


Attributes Decision 

Case Wind Humidity Temperature Trip 

1 low low medium yes 

2 low low low yes 

3 low medium medium yes 

4 low medium high yes 

5 low medium high yes 

6 medium low medium SPECIAL 
7 medium low medium SPECIAL 
8 medium low medium SPECIAL 
9 high high high SPECIAL 
10 medium high high SPECIAL 


Rule Induction from Rough Approximations 


23.3 Decision Table with Numerical Attributes 


èe 2,1,1 

(Humidity, medium) & 

(Temperature, medium) — (Trip, yes) , 
© 1,4,4 

(Temperature, high) — (Trip, SPECIAL) , 
© 1,4,4 

(Wind, medium) — (Trip, SPECIAL) , 


where all rules are presented in the LERS format, see 
Sect. 23.1.3. 


Obviously, only rules with (Trip, yes) on the right- 
hand side are informative; the remaining rules, with 
(Trip, SPECIAL) on the right-hand side should be ig- 
nored. These two rules are certain. The only infor- 
mative rule induced by the LEM2 algorithm from Ta- 
ble 23.6 is: 


© 1,4,5 
(Wind, low) — (Trip, yes) . 


This rule is possible. 


23.3 Decision Table with Numerical Attributes 


An example of a data set with numerical attributes is 
presented in Table 23.7. 

In rule induction from numerical data a prelim- 
inary step called discretization [23.10-12] is usually 
conducted. During discretization a domain of the nu- 
merical attribute is divided into intervals defined by 
cut-points (left and right delimiters of intervals). Such 
an interval, delimited by two cut-points, c and d, will 
be denoted by c...d. In this chapter we will discuss 
how to do both processes concurrently: rule induction 
and discretization. First we need to check whether our 
data set is consistent. Note that numerical data are, in 
general, consistent, but inconsistent numerical data are 
possible. For inconsistent numerical data we need to 
compute lower and upper approximations and the in- 
duce certain and possible rule sets. In the data set from 
Table 23.7, A* = {{1}, {2}, {3}, {4}, {53 {6}, {7}, (83, 
{d}* = {{1,2, 3}, {4,5}, {6,7, 8}}, so A* < {d}*, and 
the data set is consistent. 

A modified LEM2 algorithm for rule induction, 
called MLEM2 [23.13], does not need any preliminary 
discretization of numerical attributes. The domain of 


Table 23.7 A data set with numerical attributes 


Attributes Decision 

Case Wind Humidity Temperature Trip 

1 4 low medium yes 

2 8 low low yes 

3 4 medium medium yes 

4 8 medium high maybe 

5 12 low medium maybe 

6 16 high low no 

T 30 high high no 

8 12 high high no 


any numerical attribute is sorted first. Then potential 
cut-points are selected as averages of any two consec- 
utive values of the sorted list. For each cut-point c the 
MLEM2 algorithm creates two blocks, the first block 
contains all cases for which values of the numerical 
attribute are smaller than c, the second block contains 
the remaining cases (with values of the numerical at- 
tribute larger than c). Once all such blocks have been 
computed, rule induction in MLEM2 is conducted the 
same way as in LEM2. We will illustrate rule induction 


Table 23.8 Computing a local covering for the concept 
(Trip, yes)], part I 


(a,v) =t [@,»)] {1, 2, 3} {1, 2, 3} 
(Wind, 4..6) {1,3} {1, 3} {1,3} 
(Wind, 6..30) {2,4,5, 6, 7, 8} {2} {2} 
(Wind, 4..10) Hil, 2 nah {1,2,3}¢ — 
(Wind, 10..30) {5, 6, 7, 8} = = 
(Wind, 4..14) Hil, 2S. OF Hil Se = 
(Wind, 14..30) {6, 7} = = 
(Wind, 4..23) He ote | LA SH = 
(Wind, 23..30) {7} = = 
(Humidity, {1, 2, 5} {1,2} {1,2} 
low) 

(Humidity, {3,4} {3} {3} 
medium) 

(Humidity, high) {6,7,8} = = 
(Temperature, {2, 6} {2} {1, 3} 
low) 

(Temperature, PELS {1, 3} {1,3} 
medium) 

(Temperature, {4,7, 8} — = 
high) 

Comments 1 2 
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Table 23.9 Computing a local covering for the concept 
[(Trip, yes)], part II 


(a,v)=t [@,v)] (2) {2} 
(Wind, 4..6) {1,3} = = 
(Wind, 6..30) DA S.6, 1 | GR {2} 
(Wind, 4..10) {1,2, 3,4} {2} {2} 
(Wind, 10..30) {5, 6, 7, 8} = = 
(Wind, 4..14) {1, 2, 3,4, 5, 8} {2} {2} 
(Wind, 14..30) {6, 7} = = 
(Wind, 4..23) HN Zsa SO | ee {2} 
(Wind, 23..30) {7} = = 
(Humidity, 1,2, {2} {2}e 
low) 

(Humidity, {3, 4} = = 
medium) 

(Humidity, high)  {6, 7, 8} = = 
(Temperature, {2, 6} {2} © = 
low) 

(Temperature, ql, dS = = 
medium) 

(Temperature, {4, 7, 8} = = 
high) 

Comments 3 4 


from Table 23.7 using the MLEM2 rule induction algo- 
rithm. The MLEM2 algorithm is shown in Tables 23.8 
and 23.9. The corresponding comments are 


1. 


The set G = {1,2,3}. The best attribute—value pair 
t, with the largest cardinality of the intersection of 
[t] and G (presented in the third column of Ta- 
ble 23.8) is (Wind, 4..10). The corresponding entry 
in the third column of Table 23.8 is bulleted. How- 
ever, 


[(Wind, 4..10)] = {1,2,3,4 Z {1,2,3} =B, 


hence we need to look for the next t. 


23.4 Incomplete Data 


Real-life data are frequently incomplete. In this section 
we will consider incompleteness in the form of miss- 
ing attribute values. We will distinguish three types of 
missing attribute values: 


© Lost values, denoted by ?, where the original values 


existed, but are currently unavailable, since these 
values have been, for example, erased or the op- 
erator forgot to input them. In rule induction we 


2; 


Set G is the same, G = {1,2,3}. There are dashes 
for rows (Wind, 4..14) and (Wind, 4..23) since the 
corresponding intervals contain 4..10. There are 
four attribute—value pairs with |[t1 G| = 2. The best 
attribute—value pair, with the smallest cardinality of 
[t], is (Wind, 4..6). This time 


{1,2,3,4 N {1, 3} = {1,3} C {1,2,3}. 


Obviously, the common part of both intervals is 4..6, 
so {(Wind, 4..6)} is the first element T of T. 

The new set G = B—[T] = {1,2,3}— {1,3} = {2}. 
The pair [(Zemperature, low)] has the smallest car- 
dinality of [t], so it is the best choice. However, 
[(Temperature, low)] = {2,6} Z {1,2,3}, hence we 
need to look for the next t. 

The pair [(Humidity, low)] is the best choice, and 


{3,459 {1, 3,5} = 3) S {1,2,3}, 


so {[(Zemperature, low), (Humidity, low)} is the 
second element T of T. 


As a result, 


T ={{(Wind, 4..6)}, {(Temperature, low) , 
(Humidity, low)}} . 


In other words, the MLEM2 algorithm induces the fol- 
lowing rule set for Table 23.7: 


1,2,2 
(Wind, 4..6) — (Trip, yes) , 
2,1,1 


(Temperature, low) & (Humidity, low) 
— (Trip, yes) . 


will induce rules from existing, specified attribute 
values. 

Do not care conditions, denoted by *, where the 
original values are mysterious. For example, data 
were collected in a form of the interview, some 
questions were considered to be irrelevant or were 
embarrassing. Let us say that in an interview as- 
sociated with the diagnosis of a disease, there is 
a question about eye color. For some people such 
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a question is irrelevant. In rule induction we are as- 
suming that the attribute value is any value from the 
attribute domain. 

© Attribute-concept value, denoted by —. This in- 
terpretation is a special case of the do not care 
condition: it is restricted to attribute values typ- 
ical for the concept to which the case belongs. 
For example, typical values of temperature for pa- 
tients sick with flu are: high and very-high, for 
a patient the temperature value is missing, but we 
know that this patient is sick with flu, if using 
the attribute-concept interpretation, we will assume 
that possible temperature values are: high and very- 
high. 


We will assume that for any case at least one at- 
tribute value is specified (i. e., is not missing) and that 
all decision values are specified. An example of a deci- 
sion table with missing attribute values is presented in 
Table 23.10. 

The definition of consistent data from Sect. 23.2 
cannot be applied to data with missing attribute val- 
ues, since for such data the standard definition of 
the indiscernibility relation must be extended. More- 
over, it is well known that the standard defini- 
tions of lower and upper approximations are not 
applicable to data with missing attribute values. In 
Sect. 23.4.1 we will discuss three generalizations of 
the standard approximations: singleton, subset, and 
concept. 


23.4.1 Singleton, Subset, 
and Concept Approximations 


For incomplete data the definition of a block of an 
attribute-value pair is modified [23.14]: 


e If for an attribute a there exists a case x such that 
a(x) = ?, i.e., the corresponding value is lost, then 


Table 23.10 An incomplete decision table 


Attributes Decision 

Case Wind Humidity Temperature Trip 

1 low low medium yes 

2 2 low X yes 

3 medium medium yes 

+ low 7 high maybe 

5 medium = — medium maybe 

6 e high low no 

7 = high id no 

8 medium high high no 


the case x should not be included in any blocks 
[(a, v)] for all values v of attribute a. 

© If for an attribute a there exists a case x such that 
the corresponding value is a do not care condition, 
i.e., a(x) = x, then the case x should be included 
in blocks [(a, v)] for all specified values v of at- 
tribute a. 

© If for an attribute a there exists a case x such 
that the corresponding value is an_attribute— 
concept value, i.e., a(x) = —, then the correspond- 
ing case x should be included in blocks [(a, v)] 
for all specified values v € V(x, a) of attribute a, 
where 


V(x, a) ={a(y) | aQ) is specified, 
yeU, diy) = d(x)}. (23.13) 
For Table 23.10, 


V(5, Humidity) = Ø and 
V(7, Wind) = {medium} , 


so the blocks of attribute—value pairs are 


[(Wind, low)] = {1, 3, 4, 6}, 
[(Wind, medium)] = {3,5,6,7, 8}, 
[(Humidity, low)] = {1,2}, 
[(Humidity, medium)] = {3} , 
[(Humidity, high)] = {6,7, 8}, 
[(Temperature, low)] = {2, 6,7}, 
(Temperature, medium)] = {1, 2,3,5,7}, 
[(Temperature, high)] = {2,4,7, 8}. 


For a case x€ U, the characteristic set Kg(x) is 
defined as the intersection of the sets K(x, a), for all 
a € B, where the set K(x, a) is defined in the following 
way: 


© If a(x) is specified, then K(x,a) is the block 
[(a, a(x)] of attribute a and its value a(x). 

© If a(x) =? or a(x) = * then the set K(x, a) = U. 

© If a(x)=-, then the corresponding set 
K(x,a) is equal to the union of all blocks of 
attribute-value pairs (a,v), where v€ V(x, a) 
if V(x,a) is nonempty. If V(x,a) is empty, 
K(x,a)= U. 
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For Table 23.10 and B = A, 


Ka(1) = {1,3,4,6, N {1,2 N {1,2,3,5,7} 
={1}, 
KO=UN,ZNU={1,2}, 
Ka(3) = UN {3N (1,2, 3, 5,7} = {3}, 
K,(4) = {1,3,4,6 N UN {1, 2, 3,5, 7} = {4}, 
Ka(5) = (3, 5,6, 7,8} NUN (£1, 2, 3,5, 7} 
= {3,5,7}, 
Ka(6) = UN (6,7, 8} N {2, 6, 7} = {6, 7}, 
Ka(7) = (3,5, 6,7, 8} {6, 7, 83 U = (6,7, 8}, 
Ka(8) = (3,5, 6, 7, 8} 9 {6,7,8} {2, 4,7, 8} 
= {7,8}. 


The characteristic set Kg(x) may be interpreted as 
the set of cases that are indistinguishable from x using 
all attributes from B and using a given interpretation of 
missing attribute values. For completely specified data 
sets (i.e., data sets with no missing attribute values), 
characteristic sets are reduced to elementary sets. The 
characteristic relation R(B) is a relation on U defined 
for x, y € U as follows 


(x, y) € R(B) if and only if y € Kg(x) . (23.14) 


The characteristic relation R(B) is reflexive but — 
in general — does not need to be symmetric or tran- 
sitive. Obviously, the characteristic relation R(B) 
is known if we know characteristic sets Kg(x) for 
all xe U and vice versa. In our example, R(A) = 
{(1, 1), (2, 1), (2,2), (3,3), 4,4), G, 3), (5, 5), (6, 6), 
(6,7), (7,6), (7,7), (7, 8), (8, 7), (8, 8)}. For a com- 
plete decision table, the characteristic relation R(B) is 
reduced to the indiscernibility relation [23.2]. 

Definability for completely specified decision tables 
should be modified to fit into incomplete decision ta- 
bles. For incomplete decision tables, a union of some 
intersections of attribute—value pair blocks, where such 
attributes are members of B and are distinct, will be 
called B-locally definable sets. A union of characteristic 
sets Kg(x), where x € X C U will be called a B-globally 
definable set. Any set X that is B-globally definable is 
B-locally definable; the converse is not true. 

For example, the set {2} is A-locally definable since 
{2} = [(Humidity, low)| A [(Temperature, high)|. How- 
ever, the set {2} is not A-globally definable. On the other 
hand, the set {5} = is not even locally definable since all 
blocks of attribute—value pairs containing case 5 contain 


also the case 7 as well. Obviously, if a set is not B- 
locally definable then it cannot be expressed by rule sets 
using attributes from B. Thus we should induce rules 
from sets that are at least A-locally definable. 

For incomplete decision tables lower and upper ap- 
proximations may be defined in a few different ways. 
We suggest three different definitions of lower and 
upper approximations for incomplete decision tables, 
following [23.14—-16]. Let X be a concept, a subset of 
U, let B be a subset of the set A of all attributes, and 
let R(B) be the characteristic relation of the incomplete 
decision. Our first definition uses an idea similar to the 
first method in Sect. 23.2, and is based on constructing 
both approximations from single elements of the set U. 
We will call these approximations singleton. A single- 
ton B-lower approximation of X is defined as follows 


BX = {x € U | Kex) CX}. (23.15) 


A singleton B-upper approximation of X is 
BX = {xE U | Kg) NXFO}. (23.16) 


In our example of the decision table presented in 
Table 23.10, the singleton A-lower and A-upper approx- 
imations of the concept: {1, 2, 3} are: 


Af1, 2,3} = {1,2,3}, (23.17) 
Af1,2,3} = {1,2,3,5}. (23.18) 


We may easily observe that the set {1,2,3,5}= 
(A{1, 2, 3}) is not A-locally definable since in all blocks 
of attribute—value pairs cases 5 and 7 are inseparable. 
Thus, as it was observed in, e.g., [23.14—16], single- 
ton approximations should not be used, theoretically, 
for rule induction. 

The second method of defining lower and upper 
approximations for complete decision tables uses an- 
other idea: lower and upper approximations are unions 
of elementary sets, subsets of U. Therefore, we may 
define lower and upper approximations for incomplete 
decision tables by analogy with the second method in 
Sect. 23.2, using characteristic sets instead of elemen- 
tary sets. There are two ways to do this. Using the first 
way, a subset B-lower approximation of X is defined as 
follows 


BX = U{Kg(x) | x € U, Kex) CX}. (23.19) 
A subset B-upper approximation of X is 


BX = U{Kg(x) | x€ U, K(x) NX AO. (23.20) 
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For any concept X, singleton B-lower and B-upper 
approximations of X are subsets of the subset B-lower 
and B-upper approximations of X, respectively [23.16], 
because the characteristic relation R(B) is reflexive. For 
the decision table presented in Table 23.10, the subset 
A-lower and A-upper approximations are 


A{1, 2,3} = {1,2,3}, 
A{1, 2,3} = {1,2,3, 5,7}. 


The second possibility is to modify the subset defini- 
tion of lower and upper approximation by replacing the 
universe U from the subset definition by a concept X. 
A concept B-lower approximation of the concept X is 
defined as follows 


BX = U{Kg(x) | x € X, Kex) CX}. (23.21) 


Obviously, the subset B-lower approximation of X 
is the same set as the concept B-lower approximation of 
X. A concept B-upper approximation of the concept X 
is defined as follows 


BX = U{Kg(x) | x € X, Kex) N X ZO} 


= Uf{Kp(x) | xE X}. (23.22) 


The concept upper approximations were defined 
in [23.17] and [23.18] as well. The concept B-upper 
approximation of X is a subset of the subset B-upper 
approximation of X [23.16]. For the decision table 
presented in Table 23.10, the concept A-upper approxi- 
mations is 


Af{1,2,3} = {1,2,3}. 


Note that for complete decision tables, all three defi- 
nitions of lower and upper approximations, singleton, 
subset, and concept, are reduced to the same standard 
definition of lower and upper approximations, respec- 
tively. 


23.4.2 Modified LEM2 Algorithm 


The same MLEM? rule induction from Sect. 23.3 may 
be used for rule induction from incomplete data; the 
only difference is a different definition of blocks of 
attribute—value pairs. Let us apply the MLEM2 algo- 
rithm to the data set from Table 23.10. First, we need to 
make a decision as to what kind of approximations we 
are going to use: singleton, subset, or concept. In our ex- 
ample, we use concept approximation. For Table 23.10, 


Af1, 2,3} = Af1, 2,3} = {1,2,3}, 


we will trace the MLEM2 algorithm applied to the set 

{1, 2,3}; this way our certain rule set, for the concept 

[(Trip, yes)], is at the same time certain and possible. 

The tracing of LEM2 is presented in the Tables 23.11. 
The corresponding comments are: 


1. The set G= {1, 2,3}. The best attribute—value pair 
t, with the largest cardinality of the intersection 
of [t] and G (presented in the third column of 
Table 23.11) is (Temperature, medium). The corre- 
sponding entry in the third column of Table 23.11 is 
bulleted. However, 


[(Temperature, medium)| 
= {1,2,3,5,7} Z {1,2,33=B, 


hence we need to look for the next t. 

2. Set G is the same, G = {1,2,3}. There are two 
attribute—value pairs with |[tM G| = 2. One of them, 
(Humidity, low) has the smallest cardinality of [t], 
so we select it. This time 


{1,2, 3,5, 7} {1,2} = {1,2} € {1,2,3}. 


However, (Temperature, medium) is redun- 
dant, since [(Humidity, low)] C {1,2,3}, hence 
{(Humidity, low)} is the first element T of T. 

3. The new set G = B—[T] = {1,2,3}— {1,2} = {3}. 
The pair [(Humidity, medium) has the smallest 
cardinality of [t], so it is the best choice. Ad- 
ditionally, [(Humidity, medium)| = {3} C {1, 2,3}, 
hence we are done, the set T = {(Humidity, 
medium)}. 


Table 23.11 Computing a rule set for the concept [(Trip, 
yes)], Table 23.10 


(a,v) =t [@, v)] {1, 2, 3} {1,253} _ {3} 


(Wind, low) {1, 3, 4, 6} {1, 3} {1, 3} {3} 
(Wind, e date | SBR {3} {3} 
medium) 

(Humidity, {1,2} ALOR dL% |= 
low) 

(Humidity, {3} {3} {3} {3} e 
medium) 

(Humidity, {6, 7, 8} — — = 
high) 

(Temperature, {2,6,7} {2} = = 
low) 

(Temperature, {1,2,3,5,7} {1,2,3}e — {3} 
medium) 

(Temperature, {2,4,7,8} {2} = = 
high) 

Comments 1 2 3 
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Therefore, T = {{(Humidity, low)}, { (Humidity, 
medium)}}. The MLEM2 algorithm induces the 
following rule set for Table 23.10: 


© 1,2,2 

(Humidity, low) — (Trip, yes) , 
e ],1,1 

(Humidity, medium) — (Trip, yes) . 
23.4.3 Probabilistic Approximations 


In this section we are going to generalize singleton, 
subset, and concept approximations from Sect. 23.4.1 
to corresponding approximations that are defined us- 
ing an additional parameter (or threshold), denoted by 
a, and interpreted as a probability. A generalization of 
standard approximations, called probabilistic approx- 
imations, has been studied in many papers [23.19- 
26]. 

Let B be a subset of the attribute set A and X be 
a subset of U. 

A B-singleton probabilistic approximation of X with 
the threshold a, 0 < œ < 1, denoted by apply (X), 
is defined as follows 


{x |x eU, Pr(X | Kp(x)) 2a}, 
where 


IX N Ka(x)| 


Pr(X | Kg(x)) = K] 


is the conditional probability of X given Kg(x) and |Y| 
denotes the cardinality of set Y. 

A B-subset probabilistic approximation of the 
set X with the threshold a, O <œ < 1, denoted by 
appràts'(X), is defined as follows 


U{Kp(x) | x€ U, Pr(X | Ka(x)) > a} . 


A B-concept probabilistic approximation of the 
set X with the threshold a, O <œ < 1, denoted by 
appry g” (X), is defined as follows 


U{Kp(x) | x €X, Pr(X | Kg(x)) > a}. 
For simplicity, if B=A, an A-singleton, B- 


subset, and B-concept probabilistic approximations 
will be called singleton, subset, and concept prob- 


abilistic approximations, and will be denoted by 
appr t” X),  apprset(x), and appriy™“P*(X), 
respectively. 

Obviously, for the concept X, the probabilis- 
tic approximation of a given type (singleton, sub- 
set, or concept) of X computed for the threshold 
equal to the smallest positive conditional probabil- 
ity Pr(X | [x]) is equal to the standard upper ap- 
proximation of X of the same type. Additionally, 
the probabilistic approximation of a given type of X 
computed for the threshold equal to 1 is equal to 
the standard lower approximation of X of the same 


type. 
For the data set from Table 23.12, the set of blocks 
of attribute—value pairs is 


[(Wind, low)] = {1, 3,5}, 
[(Wind, high)] = {4, 6,7, 8}, 
[(Humidity, low)] = {1,2,3,5}, 
(Humidity, high)] = {1, 4, 6,7, 8}, 
[(Temperature, low)] = {1,2,5, 6}, 
[(Temperature, high)] = {1,4, 6,7, 8}. 


The corresponding characteristic sets are 


Ka(1) = Ka(3) = {1,3,5}, 
Ka(2) = {1,2,5}, 

Ka(4) = (4, 6, 8}, 

Ka(5) = 1,5}, 

K4(6) = Ka(8) = {4, 6, 8}, 
Ka(T) = {4, 6,7, 8}. 


Conditional probabilities of the concept {1, 2, 3, 4} 
given a characteristic set K4(x) are presented in Ta- 
ble 23.13. 


Table 23.12 An incomplete decision table 


Attributes Decision 
Case Wind Humidity Temperature Trip 
1 low low yes 
2 ? low low yes 
3 low low ? yes 
4 high high high yes 
5} low i low no 
6 high high te no 
7 high ? high no 
8 high high high no 
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Table 23.13 Conditional probabilities 


Ka (62) {1, 2, 5} 
Pr({1,2, 4, 6} | Ka(x)) 0.667 


(1,3, 5} 
0.667 


For Table 23.13, all probabilistic approximations 
(singleton, subset, and concept) are 
appry 8 "({1,2,3,4}) =U, 
appre s"(41,2,3,4}) = {1,2,3,4,5, 6,8}, 
appr $°" ({1,2,3,4}) = {1, 2,3, 5}, 
appie ({1,2,3,4}) = (1,2, 3}, 
appr "({1,2,3,4}) = 9, 
appr 3s (1,2,3,4) =U, 
appr os<'({1,2, 3,4}) = {1,2,3,4,5,6,8}, 
apprg's“({1, 2, 3, 3}) = {1,2,3,5}, 
appro-cer ({1, 2, 3, 4}) = {1,2,3,5}, 
appr "(11 2, 3,44) =ð, 
appt, 9s ({1,2,3,4}) = {1,2, 3,4, 5,6, 8}, 
appry 333, ({1, 2, 3,4}) = {1, 2, 3,4, 5, 6, 8}, 
appros ({1,2,3,4}) = {1,2,3,5}, 
appro es ({1, 2, 3,4}) = (1,2, 3, 5}, 
appr ™ '({1, 2, 3,4) =ð. 


For rule induction from probabilistic approxima- 
tions of the given concept a technique similar to the 


Table 23.14 A modified decision table 


Attributes Decision 
Case Wind Humidity Temperature Trip 
1 low low yes 
2 ? low low yes 
3 low low ? yes 
4 high high high SPECIAL 
5 low be low no 
6 high high Be SPECIAL 
7 high ? high SPECIAL 
8 high high high SPECIAL 


{1,5} (4, 6, 8} 
0.5 0.333 


{4, 6, 7, 8} 
0.25 


one in Sect. 23.2 may be used. For any concept and the 
probabilistic approximation of the concept we will cre- 
ate a new decision table. Let us illustrate this idea with 
inducing a rule set for the concept [(7rip, yes)] from 
Table 23.12 using concept probabilistic approximation 
with œ = 0.5. The corresponding modified decision ta- 
ble is presented in Table 23.14. 

In the data set presented in Table 23.14, all val- 
ues of Trip are copied from Table 23.12 for all cases 
from 


apptgs  ({1,2,3,4}) = {1,2,3, 5}, 


while for all remaining cases values of Trip are replaced 
by the SPECIAL value. The MLEM2 rule induction al- 
gorithm, using concept upper approximation should be 
used with the corresponding type of upper approxima- 
tion (singleton, subset, and concept). In our example, 
the MLEM2 rule induction algorithm, using concept 
upper approximation, induces the following rule set 
from Table 23.14: 


e 1,3,4 
(Humidity, low) — (Trip, yes), 
© 1,4,4 
(Wind, high) — (Trip, SPECIAL), 
è 2,1,2 
(Wind, low)&(Temperature, low) — (Trip, no). 
The only rules that are useful should have (Trip, yes) 


on the right-hand side. Thus, the only rule that survives 
is: 


e 1,3,4 
(Humidity, low) — (Trip, yes). 
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23.5 Conclusions 


Investigation of rule induction methods is subject to 
intensive research activity. New versions of rule in- 
duction algorithms based on probabilistic approxima- 
tions have been explored [23.27, 28]. Novel rule in- 
duction algorithms in which computation of proba- 
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As quantitative generalizations of Pawlak rough 
sets, probabilistic rough sets consider degrees of 
overlap between equivalence classes and the set. 
An equivalence class is put into the lower approx- 
imation if the conditional probability of the set, 
given the equivalence class, is equal to or above 
one threshold; an equivalence class is put into the 
upper approximation if the conditional probabil- 
ity is above another threshold hold. We review 
a basic model of probabilistic rough sets (i.e., 
decision-theoretic rough set model) and varia- 
tions. We present the main results of probabilistic 
rough sets by focusing on three issues: (a) interpre- 
tation and calculation of the required thresholds, 
(b) estimation of the required conditional proba- 
bilities, and (c) interpretation and applications of 
probabilistic rough set approximations. 
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24.1 Motivation for Studying Probabilistic Rough Sets 


Rough set theory [24.1,2] provides a simple and ele- 
gant method for analyzing data represented in a tabular 
form called an information table. The rows of the table 
represent a finite set of objects, the columns represent 
a finite set of attributes, and each cell represents the 
value of an object on the corresponding attribute. With 
a limited number of attributes, we may only be able 
to describe some subsets of objects precisely [24.3, 4]. 
Those subsets that can be precisely described are called 
definable sets, and all other subsets are called undefin- 
able sets. A fundamental notion of rough set theory is 
the approximation of a subset of objects by a pair of de- 
finable sets from below and above, or equivalently, by 
three pairwise disjoint positive, negative, and boundary 
regions [24.4]. 

Pawlak rough set approximations are characterized 
by a zero tolerance of errors. That is, an object in the 
lower approximation certainly belongs to set and an ob- 
ject in the complement of the upper approximation cer- 
tainly does not belong to the set. This has motivated the 
introduction of many different generalizations of rough 
sets. By introducing certain levels of errors, probabilis- 
tic rough sets [24.5, 6] are quantitative generalizations 
of the qualitative Pawlak rough sets. Although several 
specific models of probabilistic rough sets had been 
considered by some authors [24.7—10], a more gen- 
eral model, called decision-theoretic rough set (DTRS) 


24.2 Pawlak Rough Sets 


We present a semantically meaningful definition of 
rough set approximations and a simple method for con- 
structing rough set approximations. 


24.2.1 Rough Set Approximations 
In rough set theory, a finite set of objects is described by 
using a finite set of attributes in a tabular form, called an 
information table [24.2]. Formally, an information table 
can be expressed as 

S = (U,AT, {V,a |a € AT}, {la |a E€ ATY), 


where 


U is a finite nonempty set of objects called universe , 


AT is a finite nonempty set of attributes , 


model, was first proposed by Yao etal. [24.11,12] 
based on the well-established Bayesian decision theory. 
Other probabilistic models include variable precision 
rough sets [24.13, 14], Bayesian rough sets [24.15- 
18], parameterized rough sets [24.19,20], game- 
theoretic rough sets [24.21,22], variable-consistency- 
indiscernibility-based and dominance-based rough 
sets [24.23,24], stochastic dominance-based rough 
sets [24.25], naive Bayesian rough sets [24.26], 
information-theoretic rough sets [24.27], confirmation- 
theoretic rough sets [24.28], and many different types 
of probabilistic rough set approximations [24.29, 30]. 

In this chapter, we present a basic model of prob- 
abilistic rough sets and a brief review of other prob- 
abilistic rough set models. We examine in particular 
three fundamental issues, namely, the interpretation and 
computation of the pair of thresholds, the estimation 
of probability, and an application of three regions. We 
also show how a probabilistic approach can be applied 
when information related to some order representing 
the extent to which some property related to considered 
attributes has to be taken into account. This situation is 
handled by the well-known rough set extension called 
dominance-based rough set approach [24.31-34]. A full 
understanding of these issues will greatly increase the 
chance of success when applying probabilistic rough 
sets in real-world applications. 


Va is a nonempty set of values for a € AT , 


I, : U — Va is an information function . 


The information table provides all available information 
about the set of objects, based on which we can perform 
tasks of analysis and inference. 

In an information table, one can introduce a de- 
scription language, as suggested by Marek and 
Pawlak [24.3], to formally describe objects. We con- 
sider a language DL that is recursively defined as 
follows 


(1) (a =v) € DL, where a € AT, vEV,, 
(2) if p,q € DL, then-p,pAqg,pVqeEDL. 


Formulas defined by (1) are called atomic formulas. 
The satisfiability of formula p by an object x, written 
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x p, is defined as follows 


© xH (a= v) iff L(x) =v, 

Gi) xE =p, iff “GF p), 
Gii) x= p^q, iff x} pandxEq, 
(iv) x= pVq,iffxFEporxEq. 


If p is a formula, the set m(p) C U defined by 


m(p) = {xe U| xF p} (24.1) 


is called the meaning set of p. That is, the meaning set 
m(p) consists of all those objects that satisfy the for- 
mula p. 

With the introduction of a description language, we 
can formally describe an important characteristics of an 
information table, namely, some subsets of objects are 
definable or describable while others are not. A subset 
of objects X C U is called a definable set [24.3, 4] if 
there exists a formula p such that 


X=m(p), (24.2) 


otherwise, X is called an undefinable set. The formula 
p is called a description of X. Let DEF(U) © 2” denote 
the family of all definable sets, where 2” is the power 
set of U. By definition, DEF(U) contains the empty set 
Ø, the entire universe U and is closed under set comple- 
ment, intersection, and union. In other words, DEF(U) 
is a sub-Boolean algebra of the power set 2”. 

For any subset of objects X C U, may be either de- 
finable or undefinable, we define the following pair of 
lower and upper approximations 


apr(X) = the largest definable set contained by X 
=|_J{G € DEF(U)|GC X}, 

apr(X) = the smallest definable set containing X 
= ( {G € DEF(U)|X € G}. (24.3) 


By definition, it follows that apr(X) C X C apr(X) for 
any X C U, and apr(X) = X =apr(X) if and only if 
X € DEF(U). The definition is semantically meaning- 
ful in the sense that it clearly explains the motivation 
for introducing rough set approximations and provides 
an interpretation of the approximations. However, one 
cannot use this definition to construct rough set approx- 
imations easily. 


24.2.2 Construction 
of Rough Set Approximations 


A simple method for constructing rough set approxima- 
tions is through an equivalence relation. For an attribute 
a € AT, the information function J, maps an object in 
U to a value of V,, that is, 7,(x) € Va. For an attribute 
a € AT, we can define an equivalence relation E, as fol- 
lows: for x,y € U 


xEy <=> hX) = 1.0). (24.4) 


The equivalence class containing x is denoted by [x]q. 
Similarly, for a subset of attributes A C AT, we define 
an equivalence relation E4 


xEay = > Va € A(la (x) = In(y)) . (24.5) 


The equivalence class containing x is denoted by [x],. 
By definition, it follows that, fora € AT and A C AT, 


Etay = Ea Pha = ba , 
Ea = (E bh = (Nha. (24.6) 
acA acA 


That is, we can construct the equivalence relation in- 
duced by a subset of attributes A by using equivalence 
relations induced by individual attributes in A. 

Consider the equivalence relation E4 C U x U in- 
duced by a subset of attributes A C AT. The equivalence 
relation E, induces a partition U/E, of U, i.e., a fam- 
ily of nonempty and pairwise disjoint subsets whose 
union is the universe. For an object x € U, its equiva- 
lence class is given by 


[xl = {ye U | xEay} . (24.7) 


By taking the union of a family of equivalence classes, 
one can construct an atomic sub-Boolean B(U/E,) 
of 2U with U/E, as the set of atoms 


B(U/Es) ={\ J F| FS U/Es} (24.8) 


That is, B(U/E,) contains the empty set Ø, the whole 
set U, and is closed with respect to set complement, in- 
tersection, and union. The three notions of equivalence 
relation E, the partition U/E,, and atomic Boolean al- 
gebra B(U/E,) uniquely determine each other. We can 
therefore use E4, U/E,, and B(U/E,) interchangeably. 

The pair apr = (U, E4), equivalently, the pair apr = 
(U, U/E,) or the pair apr = (U, B(U/E,)), is called an 
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approximation space. Although three different repre- 
sentations are equivalent, each of them provides a dif- 
ferent hint when we generalize rough sets. The pair 
apr = (U, F4) is useful for generalizing rough sets us- 
ing a nonequivalence relation [24.35]. The partition 
U/E, may be viewed as a granulation of the uni- 
verse U and the pair apr = (U, U/E,) relates rough 
sets and granular computing [24.36]. The pair apr = 
(U, B(U/E,)) leads to a subsystem-based formulation 
and generalizations [24.37]. 

For a subset of attributes A C AT, if we restrict the 
formulas of DL by using only attributes in A, we obtain 
a sublanguage DL(A) C DL. It can be proved that the 
family of all definable sets DEF; (U) defined by DL(A) 
is exactly the sub-Boolean algebra B(U/E,). With re- 
spect to a subset of attributes A C AT, each object x is 
described by a logic formula 


|\\a=hQ). 


acA 


(24.9) 


where J,(x) € Va and the atomic formula a = Ia (x) in- 
dicate that the value of an object on attribute a is I4 (x). 
The equivalence class containing x, namely, [x]z,, is the 
set of those objects that satisfy the formula ^ac4a = 
I,(x). The formula can be viewed as a description of ob- 
jects that are equivalent to x with respect to A, including 
x itself. 

Based on the equivalence of DEF,(U) and 
B(U/E,), we can equivalently define rough set ap- 
proximations by using the equivalence classes [x]4. For 
simplicity, we also simply write [x] when no confusion 
arises. 


For a subset of objects X C U, the pair of lower and 
upper approximations can be equivalently defined by 
apr(X) = {xe U| [x] CX}, 


apr(X) = {xe U | K] NX AQ}. (24.10) 


Construction of rough set approximation by this defini- 
tion is much easier. Alternatively, one can also define 
three pairwise disjoint positive, negative, and boundary 
regions [24.38] 

POS(X) = {xE U | [x] CX}, 

NEG(X) = {xE U| fx] NX = 9}, 

BND(X) = {xe U| k] Z Xak] NX AG}. 

(24.11) 

The pair of approximations and three regions deter- 
mines each other as follows 

POS(X) = apr(X) , 

NEG(X) = (apr(X))° , 


BND(X) = apr(X) —apr(X) ; (24.12) 
and 

apr(X) = POS(X) , 

apr(X) = POS(X) UBND(X) , (24.13) 


where (-)° denotes the complement of a set. Each repre- 
sentation provides a distinctive interpretation of rough 
set approximations. We will use the three-region ap- 
proximation in the rest of this chapter, due to its close 
connections to three-way decisions. 


24.3 A Basic Model of Probabilistic Rough Sets 


Decision-theoretic rough set (DTRS) model proposed 
by Yao et al. [24.11, 12] gives rises to a general form of 
probabilistic rough set approximations by using a pair 
of thresholds on conditional probabilities. The results 
enable us to formulate a basic model of probabilis- 
tic rough sets. However, we introduce the model in 
a way that is different from DTRS. We first interpret 
Pawlak rough sets in terms of probability and the two 
extreme value of probability (i.e., 1 and 0) and then 
generalize 1 and 0 into a pair of thresholds (a, 6) with 
0<B<a<l. 

The Pawlak rough sets consider only qualitative 
relationship between an equivalence class and a set, 
namely, an equivalence is a subset of the set or has 


a nonempty intersection with the set. This qualitative 
nature becomes clearer with a probabilistic interpreta- 
tion [24.6]. Suppose Pr(X|[x]) denotes the conditional 
probability that an object is in X given that the object 
is in [x]. The conditions for defining rough set three re- 
gions can be equivalently expressed as 


[x] € X <= > Pr(X|[x]) = 1; 

bx] OX = ø 4 Pr(X|[x]) < 0; 

[x] ZX AL] NX AO => 0 < Pr(X|[x]) <1. 
(24.14) 


Although a probability can never be greater than | or 
less than 0, we purposely use the conditions > 1 and 
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< 0 whose intended meaning will become clearer later. 
By those conditions, Pawlak three regions can be equiv- 
alently expressed as 


POS(X) = {x € U | Pr(X|[x)) > 13, 
NEG(X) = {x € U | Pr(X|[x]) < 05 , 


BND(X) = {x€ U |0 < Pr(X|[x]) <1}. (24.15) 


They show that Pawlak rough sets only use the two ex- 
treme values, i. e., 1 and 0, of probability. 

It is natural to generalize Pawlak rough sets by re- 
placing 1 and O with some other values in the unit 
interval [0,1]. Given a pair of thresholds a, with 
0 < B <a <1, the main results of probabilistic rough 
sets are the (œ, 6 )-probabilistic regions defined by 


POS (ep) (X) = {x € U | Pr(X|[x]) = a}, 

NEG (e.g) (X) = {x € U | Pr(X|[x]) < B}. 

BND (q,6)(X) = {x € U | B < Pr(X|[x]) < œ} . 
(24.16) 


The Pawlak rough set model is a special case in which 
a = l and f =0. In the case when 0 < f =a < 1, the 
three regions are given by 


POS (aw) (X) = {x € U | Pr(X|[x]) > a}, 
NEG (aa) (X) = {x € U | Pr(X|[x]) <a}, 
BNDia.q) (X) = {x € U | Pr(X|[x]) =a}. (24.17) 


It may be commented that this special case is perhaps 
more of mathematical interest, rather than practical ap- 
plications. We use this particular definition in order 
to establish connection to existing studies. As will be 
shown in subsequent discussions, when Pr(X|[x]) = 
a = ß, the costs of assigning objects in [x] to the pos- 
itive, boundary, and negative regions, respectively, are 
the same. In fact, one may simply define two regions by 
assigning objects in the boundary region into either the 
positive or boundary region. 

The main results of the basic model of probabilis- 
tic rough sets were first proposed by Yao et al. [24.11, 
12] in a DTRS model, based on Bayesian decision 
theory. The DTRS model covers all specific mod- 
els introduced before it. The interpretation of Pawlak 
rough sets in terms of conditional probability, i.e., 
the model characterized by a=1 and f =0, was 
first given by Wong and Ziarko [24.10]. A 0.5- 
model, characterized by a = 6 = 0.5, was introduced 
by Wong and Ziarko [24.8] and Pawlak et al. [24.7], 
in which the positive region is defined by probabil- 
ity greater than 0.5, the negative by probability less 
than 0.5, and the boundary by probability equal to 
0.5. A model characterized by a>0.5 and $ = 0.5 
was suggested by Wong and Ziarko [24.9]. Most re- 
cent developments on decision-theoretic rough sets can 
be found in a book edited by Li et al. [24.39] and pa- 
pers [24.21, 40-50] in a journal special issue edited by 
Yao et al. [24.51]. 


24.4 Variants of Probabilistic Rough Sets 


Since the introduction of decision-theoretic rough set 
model, several new models have been proposed and in- 
vestigated. They offer related but different directions in 
generalizing Pawlak rough sets by incorporating proba- 
bilistic information. 


24.4.1 Variable Precision Rough Sets 


The first version of variable precision rough sets was 
introduced by Ziarko [24.14], in which the standard 
set inclusion [x] C X is generalized into a graded set 
inclusion s([x],X) called a measure of the relative 
degree of misclassification of [x] with respect to X. 
A particular measure suggested by Ziarko is given 
by 


IP] NX] 
Id] 


s([x],X) =1- (24.18) 


where |-| denotes the cardinality of a set. By introducing 
a threshold 0 < z < 0.5, one can define three regions as 
follows 


VPOS,(X) = {x € U | s([x], X) < 2}, 

VNEG,(X) = {xe U|s([x],X) > 1-2, 

VBND.(X) = {x € U | z<s([x],X) <1—z}. 
(24.19) 


A more generalized version using a pair of thresholds 
was late introduced by Katzberg and Ziarko [24.13] as 
follows: forO<l<u<1, 


VPOS a (X) = {x € U | s(x], X) < }, 
VNEG(,,,) (X) = {x € U | s([x], X) = u}, 


VBND iu) (X) = {x € U | 1 < s(x], X) <u}. 
(24.20) 
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The one-threshold model may be considered as a spe- 
cial case of the two-threshold model with /=z and 
u= l=; 

One may interpret the ratio in (24.18) as an estima- 
tion of the conditional probability Pr(X|[x]), namely 


Ik] 0X! 
Ikl 


By setting œ = 1—/ and 6 = 1—u, we immediately 
have 


s([x], X) = 1— 


=1—Pr(X|[x]). (24.21) 


POSa—,1—u) (X) = {x € U | Pr(X|[x]) = 1-5 
= {xeU|s(f1,.X) <} 
= VPOS q) (X) , 
NEG4a—,1—u) (X) = {x € U | Pr(X|[x]) = 1- u} 
= {x € U | s([x],X) > u} 
= VNEGq) (X) , 
BND —1,1—1 (X) = {x € U | 1—1<Pr(X|[x])<1— u} 
= {xe U|1< s(x], X) <u} 
= VBND (q, (X) . 


(24.22) 


It follows that, when the particular set-inclusion mea- 
sure defined by (24.18) is used, the variable precision 
rough sets are coincident with the decision-theoretic 
rough sets. 

Variable precision rough sets provide an alter- 
native direction in generalizing Pawlak rough sets 
by considering a graded set-inclusion relation, which 
is not necessarily restricted to a probabilistic in- 
terpretation. If we use other set-inclusion measures, 
we will obtain other types of quantitative rough 
sets [24.38,52]. Unfortunately, subsequent develop- 
ments lose this crucial feature in an attempt to unify 
variable precision rough sets into probabilistic rough 
sets [24.53]. 


24.4.2 Parameterized Rough Sets 


Parameterized rough sets, proposed by Greco 
etal. [24.19,20], generalize probabilistic rough 
sets by introducing a Bayesian confirmation measure 
and a pair of thresholds on the confirmation measure, 
in addition to a pair of thresholds on conditional 
probability. According to Fitelson [24.54], measures of 
confirmation quantify the degree to which a piece of 
evidence E provides evidence for or against or support 
for or against a hypothesis H. 


A measure of confirmation of a piece of evidence E 
with respect to a hypothesis H is denoted by c(E, H). 
A confirmation measure c(E, H) is required to satisfy 
the following minimal property: 


>0 if Pr (A\E) > Pr (A) 
c(E,H)=4 =0_ if Pr(H\E) = Pr (H) (i) 
<0 if Pr(A\E) < Pr (H). 


Two well-known Bayesian confirmation measures 
are [24.55] 


calb], X) = Pr(X|[x]) — Pr(X) , 
Pr(X|b) 


CAPA) = Pr(X) 


(24.23) 


These measures have a probabilistic interpretation. The 
parameterized rough sets can be therefore viewed as 
a different formulation of probabilistic rough sets. 

A first discussion about relationships between con- 
firmation measures and rough sets were proposed by 
Greco et al. [24.56]. Other contributions related to the 
properties of confirmation measures with special atten- 
tion to application to rough sets are given in [24.57]. 

Given a pair of thresholds (s,t) with t< s, three 
(a, P, s, t)-parameterized regions are defined in [24.19, 
20] 


PPOS (a. g.s.) (X) = {x € U | Pr(X|[x]) > a@ 
A c([x],X) = s}, 
PNEG a, 6.5.1) (X) = {x € U | Pr(X|[x]) < £ 
Ac([x],X) St, 
PBND a, 6,5,1)(X) = {x € U | (Pr(X|[x]) > £ 
V c([x], X) > 2) 
A (Pr(X|[x]) < a 
vek], X) <s)}. 


(24.24) 


There exist many Bayesian confirmation measures, 
which makes the model of parameterized rough sets 
more flexible. On the other hand, due to lack of 
a general agreement on a Bayesian confirmation mea- 
sure, choosing an appropriate confirmation measure for 
a particular application may not be an easy task. 


24.4.3 Confirmation-Theoretic Rough Sets 


Although many Bayesian confirmation measures are 
related to the conditional probability Pr(X|[x]), Zhou 
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and Yao [24.26] argued that the conditional proba- 
bility Pr(X|[x]) and a Bayesian confirmation measure 
have very different semantics and should be used 
for different purposes. For example, the conditional 
probability Pr(X|[x]) gives us an absolute degree of 
confidence in classifying objects from [x] as belong- 
ing to X. On the other hand, a Bayesian measure, 
for example, cg or c,, normally reflects a change of 
confidence in X before and after knowing [x]. Thus, 
a Bayesian confirmation measures is useful to weigh 
the strength of evidence [x] with respect to the hy- 
pothesis X. A mixture of conditional probability and 
confirmation measure in the parameterized rough sets 
may cause a semantic difficulty in interpreting the three 
regions. 

To resolve this difficulty, Zhou and Yao [24.28] sug- 
gested a separation of the parameterized model into two 
models. One is the conventional probabilistic model 
and the other is a confirmation-theoretic model. For 
a Bayesian confirmation measure c([x], X) and a pair of 
thresholds (s,f) with £ < s, three confirmation regions 
are defined by 


CPOS (9 (X) = {Lx] € U/R |c], X) > s}, 
CNEG (5,1) (X) = ib] € U/R | eh], X) < t, 
CBND,,.4) (X) = {[x] € U/R | t < c([x], X) < s}. 


(24.25) 
For the case with s = t, we define 
CPOS. (X) = {[x] € U/R | c([x], X) > s}, 
CNEGg,9 (X) = {[x] € U/R | elk]. X) < s}, 
CBND s,s) (X) = ik] € U/R | c(h], X) = 5}. 
(24.26) 


In the definition, each equivalence class may be viewed 
as a piece of evidence. Thus, the partition U/E, instead 
of the universe, is divided into three regions. An equiva- 
lence class in the positive region supports X to a degree 
at least s, an equivalence class in the negative region 
supports X to a degree at most ¢ and may be viewed 
as against X, and an equivalence class in the boundary 
region is interpreted as neutral toward X. 


24.4.4 Bayesian Rough Sets 
Bayesian rough sets were proposed by Slezak and 


Ziarko (24.15, 16] as a probabilistic model in which the 
required pair of thresholds is interpreted using the a pri- 


ori probability Pr(X). They introduced Bayesian rough 
sets and variable precision Bayesian rough sets. 

For the Bayesian rough sets, the three regions are 
defined by 


BPOS(X) = {x € U | Pr(X|[x]) > Pr(X)} , 
BNEG(X) = {x € U | Pr(X|[x]) < Pr(X)}, 
BBND(X) = {x € U | Pr(X|[x]) = Pr(X)} . (24.27) 


Bayesian rough sets can be viewed as a special case of 
the decision-theoretic rough sets when œ = 6 = Pr(X). 
Semantically, they are very different, however. In con- 
trast to decision-theoretic rough sets, for a set with 
a higher a priori probability Pr(X), many equivalence 
classes may not be put into the positive region in the 
Bayesian rough set model, as the condition Pr(X|[x]) > 
Pr(X) may not hold. For example, the positive re- 
gion of the entire universe is always empty, namely, 
BPOS(U) = Ø. This leads to a difficulty in interpret- 
ing the positive region as a lower approximation of 
a set. 

The difficulty with Bayesian rough sets stems from 
the fact that they are in fact a special model of 
confirmation-theoretic rough sets, which is suitable 
for classifying pieces of evidence (i.e., equivalence 
classes), but is inappropriate for approximating a set. 
Recall that one Bayesian confirmation measure is given 
by calix], X) = Pr(X|[x]) — Pr(X). Therefore, Bayesian 
rough sets can be expressed as confirmation-theoretic 
rough sets as follows, 


BPOS(X) = {x € U | Pr(X|[x]) > Pr(X)} , 
= {xe U | ca([x], X) > 0}, 
=|] CPOS (0.9) (X) , 
BNEG(X) = {x € U | Pr(X|[x]) < Pr(X)}, 
= {x € U | ca([x], X) < 0}, 
=|] CNEGo,.9 ®© , 
BBND(X) = {x € U | Pr(X|[x]) = Pr(X)} 
= {x € U | ca([x],X) = 0}, 
=|_JCBND¢@»)(X). (24.28) 


That is, the Bayesian rough sets are a model of 
confirmation-theoretic rough sets characterized by the 
Bayesian confirmation measure cy with a pair of thresh- 
olds s = t = 0. Slezak and Ziarko [24.17] showed that 


HHz |) Hed 


394 PartC 


Rough Sets 


S°4Z |) Hed 


Bayesian rough sets can also be interpreted by using 
other Bayesian confirmation measures. 

The three regions of the variable precision Bayesian 
rough sets are defined as follows [24.16]: for € € [0, 1) 


VBPOS, (X) = {x € U | Pr(X|[x]) 
>1—e(1—Pr(X))}, 

VBNEG« (X) = {x € U | Pr(X|[x]) < ePr(X)} , 

VBBND, (X) = {x € U | €Pr(X) < Pr(X|[x]) 


<1—e(1—Pr(X))}. (24.29) 


Consider the Bayesian confirmation measure 
c(h], X) = Pr(X|[x))/Pr(X) . 


For the condition of the positive region, when Pr(X°) # 
0 we have 


Pr(X|[x]) > 1—€(1 — Pr(X)) = > c(h], X) <€. 
(24.30) 


Similarly, for the condition defining the negative region, 
when Pr(X) 4 0 we have 

Pr(X|[x]) < €Pr(X) => c(h], X) <€. (24.31) 
That is, [x] is put into the positive region if it confirms 
X° to a degree less than or equal to € and is put into the 
negative region if it confirms X to a degree less than or 
equal to e. In this way, we get a confirmation-theoretic 
interpretation of variable precision Bayesian rough sets. 

Unlike the confirmation-theoretic model defined 
by (24.25), the positive region of variable precision 
Bayesian rough sets is defined based on the confirma- 
tion of the complement of X and negative region is 
defined based on the confirmation of X. This definition 
is a bit awkward to interpret. Generally speaking, it may 
be more natural to define the positive region by those 
equivalence classes that confirm X to at least a certain 
degree. This suggests that one can redefine variable pre- 
cision Bayesian rough sets by using the framework of 
confirmation-theoretic rough sets. Moreover, one can 
use a pair of thresholds instead of one threshold. 


24.5 Three Fundamental Issues of Probabilistic Rough Sets 


For practical applications of probabilistic rough sets, 
one must consider at least the following three funda- 
mental issues [24.58, 59]: 


@ Interpretation and determination of the required pair 
of thresholds, 

@ Estimation of the required conditional probabilities, 
and 

@ Interpretation and applications of three probabilistic 
regions. 


For each of the three issues, this section reviews one 
example of the possible methods. 


24.5.1 Decision-Theoretic Rough Set Model: 
Determining the Thresholds 


A decision-theoretic model formulates the construction 
of rough set approximations as a Bayesian decision 
problem with a set of two states and a set of three 
actions [24.11, 12]. The set of states is given by 22 = 
{X, X°} indicating that an element is in X and not in 
X, respectively. For simplicity, we use the same symbol 
to denote both a subset X and the corresponding state. 
Corresponding to the three regions, the set of actions 


is given by A = {ap,ag,ay}, denoting the actions in 
classifying an object x, namely, deciding x € POS(X), 
deciding x € BND(X), and deciding x € NEG(X), re- 
spectively. The losses regarding the actions for different 
states are given by the 3 x 2 matrix 


X (P) X° (N) 
ap App ÀN 
aB App Ap 
an ANp Ann 


In the matrix, App, Age, and Ayp denote the losses in- 
curred for taking actions ap, ag, and ay, respectively, 
when an object belongs to X, and Apy, Agy and Any 
denote the losses incurred for taking the same actions 
when the object does not belong to X. 

The expected losses associated with taking different 
actions for objects in [x] can be expressed as 


R(ap|[x]) = AppPr(X|P]) + ApyPr(X*|[x)) , 
R(ap|[x]) = AwePr(X|P]) + AwPr(X°|[x)) , 


R(ay|[x]) = AnpPr(X|[x]) + AnwPr(X"|[a]) « 
(24.32) 
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The Bayesian decision procedure suggests the follow- 
ing minimum-risk decision rules 


(P) If R(ap|[x]) < R(az|[x]) 

and R(ap|[x]) < R(ay|[x]), decide x € POS(X) ; 
(B) If R(ag|[x]) < R(ae|[x)) 

and R(ag|[x]) < R(ay|[x]), decide x € BND(X) ; 
(N) If R(ay|[x]) < R(ap|[x]) 

and R(ay|[x]) < R(ag|[x]), decide x € NEG(X) . 


In order to make sure that the three regions are mutually 
disjoint, tie-breaking criteria should be added when two 
or three actions have the same risk. We use the follow- 
ing ordering for breaking a tie: ap, ay, ap. 

Consider a special class of loss functions with 


(c0) App < App < Anp. Ann < Àw < Àr. 
(24.33) 


That is, the loss of classifying an object x belonging 
to X into the positive region POS(X) is less than or 
equal to the loss of classifying x into the boundary 
region BND(X), and both of these losses are strictly 
less than the loss of classifying x into the negative 
region NEG(X). The reverse order of losses is used 
for classifying an object not in X. With the condition 
(c0) and the equation Pr(X|[x]) + Pr(X°|[x]) = 1, we 
can express the decision rules (P)—(N) in the following 
simplified form (for a detailed derivation, see refer- 
ences [24.58]) 


(P) If Pr(X|[x]) > @ 

and Pr(X|[x]) > y, decide x€ POS(X); 
(B) If Pr(X|[x]) <@ 

and Pr(X||[x]) > 8, decide x € BND(X) ; 
(N) If Pr(X|[x]) < £ 

and Pr(X|[x]) < y, decide x€ NEG(X) , 


where 
ae: (Apy — Àa) 
(Am — åa) + (Amp — App) ` 
B= (Any — Aww) 
(Any — An) + (Anp — Àp) ` 
(Apy — Aww) 


= ý 24.34 
ae vom 


Each rule is defined by two out of the three parameters. 
By setting œ > f, namely 


(Apy — Apy) 
(Apy — Apy) + (App — App) 
Cae (24.35) 


Cig ee 


we obtain the following condition on the loss func- 
tion [24.58] 


Ane —Apge _— Agp—App 


(cl) (24.36) 


iv Any Ary —Apy 


The condition (c1) implies that 1 > œ > y > f > 0. In 
this case, after tie-breaking, we have the simplified 
rules [24.58] 


(P) If Pr(X|[x]) > œ, decide x € POS(X); 
(B) If B < Pr(X|[x]) <a, decide x € BND(X) ; 
(N) If Pr(X|[x]) < B, decide x € NEG(X) . 


The parameter y is no longer needed. Each object can 
be put into one and only one region by using rules (P), 
(B), and (N). The (œ, £)-probabilistic positive, negative 
and boundary regions are given, respectively, by 


POS. p) (X) = {x € U | Pr(X|[x]) 2 a}, 
BND, p) (X) = {x € U | B < PrI) <a}, 
NEG.) (X) = {x € U | Pr(X|[x]) < 83. (24.37) 


The formulation provides a solid theoretical basis and 
a practical interpretation of the probabilistic rough sets. 
The threshold parameters are systematically calculated 
from a loss function. 

In the development of decision-theoretic rough sets, 
we assume that a loss function is given by experts in 
a particular application. There are studies on other types 
of loss functions and their acquisition [24.60]. Several 
other proposals have also been made regarding the in- 
terpretation and computation of the thresholds, includ- 
ing game-theoretic rough sets [24.21, 22], information- 
theoretic rough sets [24.27], and an optimization-based 
framework [24.43, 61, 62]. 


24.5.2 Naive Bayesian Rough Set Model: 
Estimating the Conditional 
Probability 


Naive Bayesian rough set model was proposed by Yao 
and Zhou [24.59] as a practical method for estimating 
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the conditional probability. First, we perform the logit 
transformation of the conditional probability 


Pr(X||x 
logit(Pr(X|[x])) =1 a 
B Pr(X|[x]) 
= 0g ED (24.38) 


which is a monotonically increasing transformation of 
Pr(X||[x]). Then, we apply the Bayes’ theorem 


Pr((x]|X)Pr(X) 
Pr([x]) 

to infer the a posteriori probability Pr(X|[x]) from the 

likelihood Pr([x]|X) of [x] with respect to X and the 


a priori probability Pr(X). Similarly, for X° we also 
have 


Pr(X|[x]) = (24.39) 


Pr({a]|X°)Pr(X°) 
Pre) 


By substituting results of (24.39) and (24.40) into 
(24.38), we immediately have 


Pr(X°|[x]) = (24.40) 


logit(Pr(X|[x))) = log O(X|[x1) 
Prik) 
Prek) 

jë Pr([x]|X) , Pr(X) 
Pr([x]|X°)  Pr(X°) 


Pr([x]|X) 
Pr([x|X9) + log O(X) , 


(24.41) 


where O(X|[x]) and O(X) are the a posterior and the 
a prior odds, respectively, and Pr([x]|X)/Pr([x]|X°) is 
the likelihood ratio. 

A threshold value on the probability can be ex- 
pressed as another threshold value on logarithm of the 
likelihood ratio. For the positive region, we have 


Pr(X|[x]) > œ 


Pr(X|[x]) 
= los Fea) ETa 
Pr([x]|X)  Pr(X) a 
B ee x) ah oe 
Pr([x]|X) Pr(X°) a 
= los po = 8 Bray ETa 
=g. (24.42) 


Similar expressions can be obtained for the negative and 
boundary regions. The three regions can now be written 
as 


POS (@,8) (X) = fr €U |log Pr(bd|X) v 


Pr([x]|X°) ~ 


BND (af) (X) = Ire U |B’ <e PS 7 
NEG. 00 = fee U oe Frey E 
(24.43) 
where 
e N 
ae ae + log f . (24.44) 


With the transformation, we need to estimate the likeli- 
hoods that are relatively easier to obtain. 

Suppose that an equivalence relation E4 is defined 
by using a subset of attributes A C AT. In the naive 
Bayesian rough set model, we estimate the likelihood 
ratio Pr([x]4|X)/Pr([x]4|X°) through the likelihoods 
Pr([x]a|X) and Pr([x]q|X°) defined by individual at- 
tributes, as the latter can be estimated more accurately. 
For this purpose, based on the results in (24.6), we make 
the following naive conditional independence assump- 
tions 


Pr([xla|X) = Pr (Murs = | [Pr 0 . 
acA acA 

Pr([x]s|X°) = Pr (Aux) = | [ Prax). 
acA acA iisi 


By inserting them into (24.42) and assuming that [x] is 
defined by a subset of attributes A C AT, namely, [x] = 
[x], we have 


PAND y 
~ Pr([x]4|X°) ~ 
Tea Pr(([xJalX) Sy! 


>1 
E Trea Prax ~ 
Pr([x]alX) |X) / 
<=> J log 8 IX) >a (24.46) 


acA 
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24.5 Three Fundamental Issues of Probabilistic Rough Sets 


Similar conditions can be derived for negative and 
boundary regions. Finally, the three regions can be de- 
fined as 


Pr([x]q|X) 7 
POS p) = {xe U| ea a >a | ‘ 
28 FIR) 
BND, g) (X) = 4y EU | p’ 


Pr([x]alX) |X) / 
sA <7 l. 


acA 
P X 

NEG iep (C) = dxe U| D loe ED <p | , 
acA 

(24.47) 

where 
, (X°) a 
a ss ee 

, (X°) B 

B l Pr(X) + log ip (24.48) 


We obtain a model in which we only need to estimate 
likelihoods of equivalence classes induced by individ- 
ual attributes. 

The likelihoods Pr([x],|X) and Pr([x]q|X°) may be 
simply estimated based on the following frequencies 


_ IP]Ja NAXI 
Pr([xJalX) = K ’ 
cy lka NX] 
Pr([xJalX°) = x] , 
where [x], = {y € U | LO) = L(x)}. An equivalence 


class defined by a single attribute is usually large in 
comparison with an equivalence classes defined by 
a subset of attributes. Probability estimation based on 
the former may be more accurate than based on the lat- 
ter. 

Naive Bayesian rough sets provide only one of pos- 
sible ways to estimate the conditional probability. Other 
estimation methods include logistic regress [24.46] and 
the maximum likelihood estimators [24.63]. 


24.5.3 Three-Way Decisions: Interpreting 
the Three Regions 


A theory of three-way decisions [24.64] is motivated by 
the needs for interpreting the three regions [24.65-67] 


and moves beyond rough sets. The main results of three- 
way decisions can be found in two recent books edited 
by Jia et al. [24.68] and Liu et al. [24.69], respectively. 
We present an interpretation of rough set three regions 
based on the framework of three-way decisions. 

In an information table, with respect to a subset of 
attributes A C AT, an object x induces a logic formula 


)\ a= ha) ; 


acA 


(24.49) 


where J, (x) € V, and the atomic formula a = I,(x) in- 
dicates that the value of an object on attribute a is I4 (x). 
An object y satisfies the formula if J,(y) = I4 (x) for all 
a € A, that is 


( = /\ a= no) 4> Ya EA (hO)= hx). 
acA 
(24.50) 


With these notations, we are ready to interpret rough set 
in three regions. 

From the three regions, we can construct three 
classes of rules for classifying an object, called the pos- 
itive, negative, and boundary rules [24.58, 66, 67]. They 
are expressed in the following forms, for y € U: 


@ Positive rule induced by an equivalence class 
[x] E POS (a, 6) (X) 


if y = Na =LA), accept y € X 


acA 


@ Negative rule induced by an equivalence class 
[x] E NEG .8) (X) 


if y = \\ a= hi). reject y € X 


acA 


@ Boundary rule induced by an equivalence class 
k]  BND@a,g)(X) 


if y H /\ a=1I,(x), neither accept 
acA 


nor reject y EX. 


The three types of rules have very different semantic 
interpretations as defined by their respective decisions. 
A positive rule allows us to accept an object y to be 
a member of X, because y has a higher probability of be- 
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ing in X due to the facts that y € [x], and Pr(X|[x]4) > a. 
A negative rule enables us to reject an object y to be 
a member of X, because y has lower probability of be- 
ing in X due to the facts that y € [x], and Pr(X|[x],) < £. 
When the probability of y being in X is neither high nor 
low, a boundary rule makes a noncommitment decision. 
Although we explicitly give the class of boundary rules 
for convenience and completeness, we do not really 
need this class, once we have both classes of positive 
and negative rules. Whenever we can not accept nor 
reject an object to be a member of X, we choose a non- 
commitment decision. 

Both actions of acceptance and rejection as associ- 
ated with errors and costs. The error rate of a positive 


rule is given by 1 — Pr(X|[x]), which, by definition of 
the three regions, is at or below | — a. The error rate of 
negative rule is given by Pr(X|[x]) and is at or below 
B. It becomes clear that the introduction of a non- 
commitment decision is to ensure both a low level of 
acceptance error and a low level of rejection error. 
According to the 3 x 2 table in Sect. 24.5.1, the cost 
a positive rule is AppPr(X|[x]a) + Ap (1 — Pr(X|[xh)) 
and is bounded above by a@App + (1 —a)Apy. The cost 
a negative rule is AypPr(X|[x]4) + Amy — Pr(X|[x]4)) 
and is bounded above by BAyp+(1—)Ayw. From 
view of cost, a noncommitment decision is preferred 
if its cost is less than an action of acceptance or 
rejection. 


24.6 Dominance-Based Rough Set Approaches 


Very often value sets V, of some attributes a € AT 
are ordered in the sense that it is meaningful to con- 
sider a binary relation =, on Va such that for x, y € U, 
la (x) Za la (y) means that x possesses some property re- 
lated to attribute a at least as much as y. In this case, 
it is natural to consider =, as complete preorder on V4, 
i.e., a transitive and strongly complete binary relation 
on V, (let us remember that strong completeness means 
that for all v4, Ua E€ Va we have vg Xa Ug OF Ug Za Va 
and that this implies the reflexivity of %4). Observe 
that the binary relation ZY on U defined as x ZY y if 
la(x) Za la(y) for all x,y € U is a complete preorder. 
The first type of properties considered in this perspec- 
tive were preferences encountered in Multiple Criteria 
Decision Aiding (MCDA) (for a comprehensive collec- 
tion of state of the art surveys see [24.70]), where for 
x,y € U, Iy(x) = Ig(y) means x is at least as good as 
y with respect to attribute a that in this case is called 
criterion. If there are attributes a € AT related to some 
complete preorder X4, then the indiscernibility relation 
is unable to produce granules in U taking into account 
the order generated by =,. To do so, the indiscerni- 
bility relation has to be substituted by a new binary 
relation on U that, using a term coming from MCDA, is 
called dominance relation. Suppose, for simplicity, all 
attributes a from AT are criteria related to correspond- 
ing complete preorders %4. 

We say that x dominates y with respect to AC 
AT (shortly, x A-dominates y) denoted by x =y y, if 
I,(x) Za (y) for all a € A. Since zu is a complete pre- 
order on U for each a € AT, Xa is a partial preoder on 
U, i.e. X4 is a reflexive and transitive binary relation 
on U. 


For any x € U and for each nonempty A C AT, we 
can define a positive and a negative cone of dominance, 
denoted by D} (x) and Dj (x), respectively, 


DS @=hev lysis. 


(24.51) 
Dy Q= fyeU | xx y}. 


For simplicity, we also simply write Dt (x) and D7 (x) 
when no confusion arises. 

Let us explain how the rough set concept has 
been generalized to the dominance-based rough set ap- 
proach (DRSA) in order to enable granular computing 
with dominance cones (for more details, see Chap. 22, 
and [24.3 1-34, 71-74)]). 

For any X CU we define upward lower and up- 
per approximations apr* (X) and aprt (X), as well as 
downward lower and upper approximations apr (X) 
and apr (X), as follows = 

apr* (X) = {xe U| DF (a) CX}, 

aprt (X) = {xE U| D(x) NX FQ}, 
apr (X)={xeU|D (x) CX}, 

apr (X)={xeU|Dt()NXFDB}. — (24.52) 


For any X C U, using cones of dominance Dt (x) and 
D- (x), we can define three upward pairwise disjoint 
positive, negative and boundary regions 


POST (X) = {xe U| Dt (x) CX}, 
NEG (X) = {xe U| D(x) NX = 9}, 
BND} (X) = {xe U|DT (x) ZX 


and D` (x) NX Æ Ø}. (24.53) 
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Analogously, for any X C U, we can define three 
downward pairwise disjoint positive, negative, and 
boundary regions 

POS” (X) = {xE U| D(x) CX}, 

NEGT (X) = {x€U| DT (x) NX = 9}, 

BND (X)={xeU|D x) X 


and Dt (x) NX Z Ø}. (24.54) 


Observe that the following complementarity prop- 
erties hold: For all X C U 
POST (X) = NEGT (U — X), 
POST (X) = NEGĦ (U — X), 
BND* (X) = BND (U—X), 


BND” (X) = BND* (U—X). (24.55) 


For all X CU, the pair of upward approximations 
and three upward regions determine each others as 
follows 


POSt (X) = apr? (X) , 
NEGT (X) = (a@prt (X))°, 


BND* (X) = apF" (X) — apr* (X) , (24.56) 
and 

apr* (X) = POS* (X) , 

apr* (X) = POS* (X) UBND(X). (24.57) 


Analogously, for all X C U, the pair of downward 
approximations and three downward regions determine 
each others as follows 


POST (X) = apr (X) , 
NEG™ (X) = (apr (X))° 


BND (X) = apr (X)—apr (X), (24.58) 
and 

apr (X) = POS (X), 

apr (X) = POS (X) UBND(X) . (24.59) 


24.7 A Basic Model of Dominance-Based Probabilistic Rough Sets 


DRSA considers only qualitative relationship between 
positive and negative cones D*(x) and D~(x), and 
a set X, namely, a positive or negative cone is a sub- 
set of the set or has a nonempty intersection with 
the set. This qualitative nature becomes clearer with 
a probabilistic interpretation. Suppose Pr(X|D* (x)) 
denotes the conditional probability that an object is 
in X, given that the object is in Dt (x), as well 
as Pr(X|D~(x)) denotes the conditional probabil- 
ity that an object is in X, given that the object 
is in D~ (x). The conditions for defining rough set 
three upward regions can be equivalently expressed 
as 


D* (x) CX <> Pr(X|Dt (x)) = 1; 
D7 (x) NX = ð 4> Pr(X|D7 (x)) < 0; 
Dt (x) ZLXAD- (x) NX £0 

<=> Pr(X|Dt (x)) <1 


A Pr(X|D" (x)) > 0. (24.60) 


Analogously, the conditions for defining rough set 
three upward regions can be equivalently expressed 


as 
D(x) CX &> Pr(X|D (x) > 1; 
Dt (x) NX = 4> Pr(X|Dt (x) < 0; 
D (xy) LXAD*(xX)NX FD 
<=> Pr(X|D~(x)) < 1A Pr(X|D* (x)) > 0. (24.61) 


By those conditions, DRSA upward and downward 
three regions can be equivalently expressed as 


POS* (X) = {x € U | Pr(X|Dt (x) > 1}, 
NEG™ (X) = {x € U | Pr(X|D™ (x)) < 0}, 
BND? (X) = {x € U | Pr(X|D* (x)) < 1 

A Pr(X|D~ (x)) > 0}, 
POST (X) = {x € U | Pr(X|D7 (x)) > 1}, 
NEGT (X) = {x € U | Pr(X|D*(x)) < 0}, 
BND (X) = {x € U | Pr(X|D~(x)) < 1 

A Pr(X|D* (x)) > 0}. 


(24.62) 


Observe that DRSA approximations use only the two 
extreme values, i. e., 1 and 0, of probability. 
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It is natural to generalize DRSA approximations 
by replacing 1 and 0 with some other values in the 
unit interval [0,1]. Given a pair of thresholds œ, B 
with 0< 6 <a <1, the main results of probabilis- 
tic DRSA are the (a, £)-probabilistic regions defined 
by 


Pr(X|D* (x)) > a}, 


NEG, p) © = tr € U | Pr(X|D (x) < B}. 
BND¢, p (X) = {x € U | Pr(X|D* (x)) <a 


A Pr(X|D~ (x)) > p}, 

Pr(X|D (x) >a}, 

Pr(X|Dt (x)) < B}, 

BND @ p) (X) = {x € U | Pr(X|D~ (x) < a 
A Pr(X|Dt (x)) > b}. 


(24.63) 


The DRSA rough set model is a special case in which 
a = 1 and 6 = 0. In the case when 0 < f =a < 1, the 
three regions are given by 


POSE, 4) (X) = {xe U| Pr(X|Dt (x)) >a}, 
NEGE, 9) (X) = {x € U | Pr(X|D7 (x) <a}, 


BND o) (X) = {x € U | Pr(X|DT (x) < œ 

A Pr(X|D~(x)) > a}, 
POS a (X) = {x € U | Pr(X|D~ (x)) > a}, 
NEG a) X) = {x € U | Pr(X|D* (x) <a}, 
BND; o) (X) = {x € U | Pr(X|D~ (x) <a 

A Pr(X|D* (x)) >a}. 


(24.64) 


24.8 Variants of Probabilistic Dominance-Based Rough Set Approach 


Several models generalizing dominance-based rough 
sets by incorporating probabilistic information can be 
considered. 


24.8.1 Variable Consistency 
Dominance-Based Rough Sets 


In a first version of variable consistency dominance- 
based rough sets [24.23] (see also [24.24]) the stan- 
dard set inclusions D+ (x) C X and D7 (x) C X can be 
generalized into graded set inclusion s+ (Dt (x), X) 
and s~ (D` (x), X) called measure of the relative up- 
ward and downward degree of misclassification of 
Dt (x) and D~(x) with respect to X, respectively. 
A particular upward and downward measure is given 
by 


cayenne = [Dt (x) NX| 
s'(D'(x),X)=1 DFO 
Sa _ | DT@)nX| 
S (D (x), X) =1— Po 


(24.65) 


By introducing a threshold 0 < z < 0.5, one can de- 
fine three upward and downward regions as follows 


VPOS? (X) = {x € U | s+ (Dt (x), X) < z}, 
VNEG+ (X) = {x € U |s (D7 (x), X) > 1—2}, 
VBND? (X) = {x € U |s t(D œ), X) >z 
As (D (x),X) <1—z}, 
VPOS; (X) = {x € U | s7 (D7 (x), X) < z}, 
VNEG; (X) = {x € U | s7 (D+ @),X) > 1—2}, 
VBND_ (X) = {x € U | s7 (D7 (x), X) >z 
Ast (Dt (x), X) <1—z}. (24.66) 


A more generalized version using a pair of thresholds 
can be defined as follows: for0 </<u <1, 


VPOS 4, (X) = {xe U| st (DT (x),X) , 
VNEG{  (X) = {x € U | s~ (D~ (x), X) =u}, 
VBND? (X) = {xe U|st(Dt (x), X) > 1 


lu 
= As (D (x),X) <u}, 

VPOS Gy (X) = {x € U |s (D7), X) $B, 
VNEG{_, (X) = {x € U | st (DT (x), X) >u}, 
VBND(,,, (X) = {x € U |s (D7 (@), X) > 1 
Ast (D* (x), X) <u}. 


(24.67) 
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24.8 Variants of Probabilistic Dominance-Based Rough Set Approach 


The one-threshold model may be considered as a spe- 
cial case of the two-threshold model with / =z and 
u=1-z. 

One may interpret the ratio in (24.65) as an esti- 
mation of the conditional probability Pr(X|Dt (x)) and 
Pr(X|D~ (x)), namely, 


|D* (x) NX| 

st (Pr(X|Dt (x)),X) = 1- "DO 
= 1—Pr(X|D*(x)), 

s7 (Pr(X|D~ (x)),X) = 1 — VE a eA 5 = | 


= 1—Pr(X|D (x)). 
(24.68) 
By setting a = 1—/ and f = 1 — u, we immediately get 
Pr(X|D* (x)) = 1-9 
st (DT (x),.X) <B 


POS} 11-0) = EU 


Pr(X|D (x)) < 1- u} 
s (D (x), X) > u} 
= VNEG ®© , 
BNDÝ iy (X) = (x € U | Pr(X|DT (x)) < 1-1 
A Pr(X|D~ (x)) > 1—u} 
={keU|s*(DT@,X)>1 
As (DT (x), X) <u} 
= VBND{,_ (X) . 
POSG X) = fx E U | Pr(X|D (x) = 1-3 
={xeU|s (D7 (x), X) < } 
= VPOS;,, &) , 
NEGa—; 1- (X) = xE U Pr(X|D* (x)) < 1—u} 
= {re U|st (DT (x), X) > u} 
= VNEG )(X), 
(X) = {x € U | Pr(X|D (x) < 1-1 
A Pr(X|D* (x)) > 1-3 
= {xe U|s~ (D(a), X) > 1 
Ast (Dt (x), X) <u} 
= VBND7,, (X) . 


+ 
BND á —,1—u) 


(24.69) 


24.8.2 Parameterized Dominance-Based 
Rough Sets 


Parameterized rough sets based on dominance [24.24] 
generalize variable consistency DRSA by introduc- 
ing a Bayesian confirmation measure and a pair of 
thresholds on the confirmation measure, in addition 
to a pair of thresholds on conditional probability. Let 
ct (Dt (x), X) and c7 (D7 (x), X) denote a Bayesian 
upward and downward confirmation measure, respec- 
tively, that indicate the degree to which positive or neg- 
ative cones D+ (x) and D7 (x) confirm the hypothesis 
X. The upward and downward Bayesian confirmation 
measures corresponding to those ones introduced in 
Sect. 24.4.3 are 


ct (Dt (x), X) = Pr(X|D* (x)) — Pr(X), 
cq (D` (x), X) = Pr(X|D™ (x) — Pr(X) , 


Pr(X|Dt 
ct (Dt (),X) = rare 
FDO a . acta) 


Given a pair of thresholds (s,f) with t< s, three 
(a, B, s, t)-parameterized regions can be defined as fol- 
lows 


Bo. = {x €U | Pr(X|D* (x) > oF 
Act (DTA), X) = s}, 
Pr(X|D (x)) < f 
Ac (D7 (x),X) R, 
(Pr(X|Dt (x)) < a 
vct (Dt (x),X) <s) 
A (Pr(X|D™ (x)) > B 
Vc (D7 (x), X) >D} 
PPOS @ g.s.) (X) = {x € U | Pr(X|D7 (x) > a 
Ac (D (x), X) 2 s}, 
PNEG 5s.) X) = {x € U | Pr(X|D* (x) < B 
Act (Dt (x),X) <4, 
PBND@. 5.1) (X) = {x € U | (Pr(X|D~ (x) < a 
Vc’ (D(x), X) <s) 
A (Pr(X|Dt (x)) > B 
vet(Dt(x),X) >A}. (24.71) 


Let us remember that a family of consistency 
measures, called gain-type consistency measures, and 
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inconsistency measures, called cost-type consistency 
measures, larger than confirmation measures, and the 
related dominance-based rough sets have been con- 
sidered in [24.24]. For any xe U and X CU, for 
a consistency measure m,(x, X), x can be assigned to 
the positive region of X if m.(x,X) >a, with œ being 
a proper threshold, while for an inconsistency mea- 
sure mj-(x, X), x can be assigned to the positive region 
of X if m.(x,X) < œ. A consistency measure m,(x, X) 
or an inconsistency measure mj.(x, X) are monotonic 
(Sect. 22.3.2 in Chap. 22) if they do not deteriorate 
when: 


(m1) The set of attributes is growing, 

(m2) The set of objects is growing, 

(m3) x improves its evaluation, so that it dominates 
more objects. 


Among the considered consistency and inconsis- 
tency measures, one that can be considered very inter- 
esting because it enjoys all the considered monotonity 
properties (m1)-(m3) while maintaining a reasonably 
easy formulation is the inconsistency measures £’ which 
is expressed as follows: 


@ Inthe case of dominance-based upward approxima- 


tion 
Dt (x)N(U-X 
stap- DOW») 
|X| 
@ In the case of dominance-based downward approx- 
imation 
D- (x) N (U-X 
Jag- PONU- 


|x| 


Observe that as explained in [24.24], consistency 
and inconsistency measures can be properly reformu- 
lated in order to be used in indiscernbility-based rough 
sets. For example, inconsistency measure £” in case of 
indicernibility-based rough sets becomes 


Faget X= Ik] A (U—X)| 
|X| 
24.8.3 Confirmation-Theoretic 
Dominance-Based Rough Sets 


A separation of the parameterized model into two 
models within DRSA can be constructed as follows. 
One is the conventional probabilistic model and the 
other is a confirmation-theoretic model. For an up- 
ward and a downward Bayesian confirmation mea- 
sure (ct (D+ (x), X) and c7 (D7 (x), X)), and a pair of 


thresholds (s,f) with t < s, three confirmation regions 


are defined by 
CPOS )(X) = {xe U 
CNEGE  (X) = {xE U 
CBNDẸ y(X) = {x€ U 


(s.t 


ct (Dt (x),X) > 5}, 
c (D (x),X) <h, 
ct (Dt (x),X) <s 


Ac (D (x), X) >t}, 


CPOSÇ y (X) = {x E€ U 
CNEG(, p (X) = {x € U 
CBND,, p(X) = {x€ U 


Act (D 


For the case with s = t, we d 


CPOS X) = {xe U 


CNEG (Xx) = {xe U 
CBND{ , (X) = {xe U 


Ac (D 
CPOS (X) = {xe U 
CNEGG (X) = {xe U 
CBND,, (X) = (re U 


ct (D7 (x), X) > s}, 
c (Dt (x), X) <8, 
c (D (x), X) <s 
A,X) >t. (24.72) 
efine 

ct (Dt (x),X) > s}, 

é€ (O°), X) <s}, 

ct (Dt (x),X) <s 
(x), X) > s} 

ct (Dt (x), X) > s}, 

c (D (x),X) <s}, 
ct(Dt(x),X) <s 


Act (Dt (œ), X) >s}. (24.73) 


24.8.4 Bayesian Dominance-Based 
Rough Sets 


Bayesian DRSAmodel in which the required pair of 
thresholds is interpreted using a priori probability 
Pr(X) can be defined as an extension of the Bayesian 
DRSA and variable consistency Bayesian DRSA, as ex- 
plained below. 

For the Bayesian DRSA, the three upward and 
downward regions are defined by 


Pr(X|D* (x)) > Pr(X)}, 
BNEG* (X) = {x € U | Pr(X|D~ (x)) < Pr(X)}, 
BBND* (X) = {x € U | Pr(X|Dt (x)) < Pr(X) 

A Pr(X|D" (x)) > Pr(X)}, 

BPOS (X) = {x € U | Pr(X|D (x)) > Pr(X)} , 
BNEG™ (X) = {x € U | Pr(X|Dt (x)) < Pr(X)}, 
BBND™ (X) = {x € U | Pr(X|D™ (x)) < Pr(X) 

A Pr(X|D* (x)) > Pr(X)}. (24.74) 


BPOS (X) = {x€ U 
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Bayesian dominance-based rough sets can be viewed 
as a special case of the decision-theoretic DRSA when 
a= p =Pr(X). 

Recalling the upward and downward DRSA, 
Bayesian confirmation measures 


cf (Dt (x), X) = Pr(X|D* (x)) — Pr(X) 
and 
Cy (D7 (x), X) = Pr(X|D™ (x)) — Pr(X) , 


Bayesian dominance-based rough sets can be expressed 
as confirmation-theoretic dominance-based rough sets 
as follows 


BPOS (X) = {x € U | Pr(X|Dt (x)) > Pr(X)}, 
={xEeU ct (Dt (x),X) > 0}, 
BNEG™ (X) = {x € U | Pr(X|D~ (x)) < Pr(X)}, 
= {x € U | cg (D7 (x), X) < 0}, 
BBND(X)* = {x € U | Pr(X|DT (x) < Pr(X) 
^ Pr(X|D~ (x)) = Pr(X)} 
={xeU|ct(Dt(a),X) <0 
Acq (D (x), X) = 0}, 


BPOST (X) = {x € U | Pr(X|D™ (x)) > Pr(X)} , 
={xeU|c,(D (x), X) > 0}, 
BNEG™ (X) = {x € U | Pr(X|D~ (x)) < Pr(X)}, 
= {xE U |c] (D*(x),X) <0}, 
BBND(X)* = {x € U | Pr(X|D™ (x)) < Pr(X) 
A Pr(X|DT (x)) > Pr(X)} 
={xeU|c,(D (x), X) <0 
Act (Dt (x),X) = 0}. 


(24.75) 


That is, the Bayesian rough sets are models of 
confirmation-theoretic rough sets characterized by the 
upward and downward Bayesian confirmation measures 
cT and c, with a pair of thresholds s = t = 0. 

The three upward and downward regions of the 
variable precision Bayesian rough sets are defined as 
follows: for € € [0, 1), 


VBPOS¢ (X) = {x € U | Pr(X|[x]) 
211 — Pr(X)); , 
VBNEG,; (X) = {x € U | Pr(X|[x]) < €Pr(X)} , 
VBBND, (X) = {x € U | €Pr(X) < Pr(X|[x]) 
<1—e(1—Pr(X))}. (24.76) 


24.9 Three Fundamental Issues of Probabilistic Dominance-Based 


Rough Sets 


Also for probabilistic dominance-based rough sets, one 
must consider the three fundamental issues of inter- 
pretation and determination of the required pair of 
thresholds, estimation of the required conditional prob- 
abilities, and interpretation and applications of three 
probabilistic regions. 

These three issues are considered in this section 
with respect to dominance-based rough sets. 


24.9.1 Decision-Theoretic 
Dominance-Based 
Rough Set Model: 
Determining the Thresholds 


Following [24.75], a decision-theoretic model formu- 
lates the construction of dominance-based rough set ap- 
proximations as a Bayesian decision problem with a set 
of two states 2 = {X, X°}, indicating that an element 
is in X and not in X, respectively. In the case of up- 


ward dominance-based rough sets, we consider a set 
of three actions A+ = far, aj, ar}, with at decid- 
ingx € POS* (X), ag deciding x € BNDT (X), and ay 
deciding x € NEG (X), respectively. In case of down- 
ward dominance-based rough sets, we consider a set of 
three actions AT = {ap , az ,ay }, witha, deciding x € 
POS” (X), ag deciding x € BND (X), and ay deciding 
x € NEG (X), respectively. The losses regarding the ac- 
tions for different states are given by the 6 x 2 matrix 


X(P) X(N) 
+ 


ap Ap ÀN 
ay Np ÀW 
i Ame An 
ap Ap Àw 
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In the matrix: 


@ In the case that upward dominance-based rough 
approximations are considered, A, A, and A 
denote the losses incurred for taking actions ar 
at, and at, respectively, when an object belongs 
to X, and pia A and Ane denote the losses in- 
curred for taking the same actions when the object 
does not belong to X, 

@ In the case that downward dominance-based rough 
approximations are considered, App, Age, and Ajp 
denote the losses incurred for taking actions ap , dg , 
and ay , respectively, when an object belongs to X, 
and Apy. Agy and Axy denote the losses incurred for 
taking the same actions when the object does not 
belong to X. 


In the case that upward dominance-based rough 
approximations are considered, the expected losses as- 
sociated with taking different actions for objects in 
D+ (x) can be expressed as 


Raz |Dt (x) = APr(X|D* (x) 
+ APr(X*|Dt (x), 
R(ag |D* (x) = Ab Pr(X|Dt (x) 
+ AgPr(X*|Dt (x), 
R(ax DE (x) = ANpPr(X|DT (x) 

+ AX Pr(X°|D* (x) . 


(24.77) 


In the case that downward dominance-based rough 
approximations are considered, the expected losses as- 
sociated with taking different actions for objects in 
D- (x) can be expressed as 


R(ap |D (x)) = AppPr(X|D™ (x) 

+ ApyPr(X"|D- (x) , 
R(ag |D" (x)) = AgpPr(X|D™ (x) 

+ Àw Pr(X |D (x)), 
R(ay |D (x)) = AypPr(X|D™ (x) 

+å Pr X ID (x)) . (24.78) 


In the case that upward dominance-based rough 
approximations are considered, the Bayesian decision 
procedure suggests the following minimum -risk deci- 


sion rules 

(P+) If R(at|D* (x)) < R(at |Dt œ) 
and R(aj |[x]) < R(ax |D* (x), 
decide x € POST (X); 

(BT) If (az |D* (x) < R(ap |D* (x) 
and R(ap Ik) < Rat Dt (x), 
decide x € BND* (X); 

(N+) If R(at|Dt (x)) < R(at |Dt (x) 
and Rat |[x]) < Raz D(a), 
decide x € NEGT (X) . 


In the case that downward dominance-based rough 
approximations are considered, the Bayesian decision 
procedure suggests the following minimum-risk deci- 
sion rules 
(P7) If R(ap |D (x) < Rag |D (x) 
and R(ap |]) < Ray |D~ (x) 
decide x € POS” (X); 

(B) If R(ag |D (x) < R(ap |D (x) 
and R(ag |[x]) < R(ay |D (x)) , 
decide x € BND (X); 

N`) If Ray |D™ (x)) < R(ap |D™ (x) 
and R(ay |[x]) < R(ag |D (x)) , 
decide x € NEG (X). 


Also in the case that dominance-based rough ap- 
proximations are considered, when two or three actions 
have the same risk, one can use the same ordering for 
breaking a tie used in case indiscernibility-based rough 
approximations are used: a ae = in case upward 
rough approximations are considered, and ap , dy , dg 
in case downward rough approximations are consid- 
ered. 

Analogously to Sect. 24.5.1, let us consider the spe- 
cial class of loss functions with 


(0f). AF <A < Aip 


07). App < App <Anp, 


paula ekg 
Aw SÀN = oe 
(24.79) 


With the conditions (c0*+) and (c07), and the equations 
Pr(X|D* (x)) + Pr(X°|D* (x) = 1 
and 


Pr(X|D" (x)) + Pr(X |D (x) = 1, 
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we can express the decision rules (P+)-(N+) and 
(P~)-(N~ ) in the following simplified form 

(PH) If Pr(X|D* (x) > at 
and Pr(X|Dt (x))>y*, 
decide x € POST (X); 

(B+) If Pr(X|Dt (x) < at 
and Pr(X|D* (x)) > BY, 
decide x € BND* (X) ; 

NE) If Pr(X|D¥ (a) < BT 
and Pr(X|Dt (x)) < yt, 
decide x € NEG™ (X); 

(PT) If Pr(X|D~ (x)) > «7 
and Pr(X|D~(x))>y*, 
decide x € POS” (X); 

(B7) If Pr(X|D~ (x)) < a7 
and Pr(X|D~ (x)) > B*, 
decide x € BND (X); 

N) If Pr(X|D™ (x) < BT 
and Pr(X|D™ (x)) < yt, 
decide x € NEG (X). 


where 
FO ARAR 
Se ome here eee ene 
Bt- (gy — Any) . 
Cee nn + Ai —Ab) 
pe (ny Any) 
ARAR AA) 
E (Amy — Am) 
Aw- Am) + Ap Am) 
jpa On in 
Aw = Ann) oF (Aip = App) 
y= Any — Aw) (24.80) 


By setting at > B+ and a~ > fT, we obtain that 
l>at>yts pt >Oandl>a->y—>p->0, 
that, after tie breaking, we give the following simplified 
tules 


(P+) If Pr(X|Dt(x)) = at , 
decide x € POST (X) ; 

(Bt) If B+ < Pr(x|Dt (x)) <at, 
decide x € BND* (X) ; 

(N+) If Pr(X|D* (x)) < Bt, 
decide x € NEG™ (X) , 

(PT) If Pr(X|D7 (x)) > a7, 
decide x € POS” (X); 

(BT) If 87 < Pr(X|D~(x)) <a7, 
decide x € BND” (X); 

(N`) If Pr(X|D" (x) < 87, 
decide x € NEG (X), 


so that the parameters y+ and y7 are no longer needed. 
Each object can be put into one and only one upward 
region, and one and only one downward region by us- 
ing rules (P+), (B+) and (Nt), and (P7), (B7) and 
(NT), respectively. The upward (a+, 8+)-probabilistic 
positive, negative, and boundary regions and downward 
(a—, B)-probabilistic positive, negative, and bound- 
ary regions are given, respectively, by 


POSE. g4)(X) = {xe U | Pr(X|Dt (x) = aT}, 

BNDĖ + 9+) ={xeU| Bt < Pr(X|Dt (x) 
<at}, 

NEG* 4 gO = {ve U| Pr(X|D* (a) < B}, 


POS- g- X) = tx EU Pr(X&|D (x) => a7}, 
BND(q- g—-)(X) = {x € U | BO < Pr(X|D" (x) 
<a}, 


NEG(q.—,g-)(X) = {x € U | Pr(X|D" (x) < BT}. 
(24.81) 


An alternative decision theoretic model for 
dominance-based rough sets taking into account in the 
cost function the conditional probabilities P(X|D* (x)) 
and P(X°|D™ (x)) for upward rough approximations, as 
well as the conditional probabilities P(X|D™ (x)) and 
P(X°|D* (x)) for downward rough approximations, can 
be defined as follows. 

In the case that upward dominance-based rough 
approximations are considered, the expected losses as- 
sociated with taking different actions for objects in 
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D* (x) can be expressed as 

R(ajz |Dt (x), D7 (x)) = AfPr(X|Dt œ) 
+ ApyPr(X°|D~ (x), 
R(ajz |D (x), D7 (x)) = ApePr(X|DT (x)) 
+ A Pr(X°|D~ (x) . 
Rax |D* (x), D7 (x) = AsePr(X|D* (x) 


+ AX, Pr(X°|D7 (x) . 
(24.82) 


In the case that downward dominance-based rough 
approximations are considered, the expected losses as- 
sociated with taking different actions for objects in 
D- (x) can be expressed as 


R(ap |D* (x), D7 (x) = AppPr(X|D~ (x)) 
+ AmPr(X°|DT (x) , 
R(az |D* (x), D7 (x) = AgPr(X|D™ (x) 
+ APr(X°|Dt (x)) , 
R(ay IDT (x), D~ (x)) = AxpPr(X|D™ (x)) 


+ Au Pr(X°|Dt (x)) . 
(24.83) 


24.9.2 Stochastic Dominance-Based 
Rough Set Approach: 
Estimating the Conditional 
Probability 


Naive Bayesian rough set model presented for rough 
sets based on indiscernibility in Sect. 24.5.2 can be ex- 
tended quite straightforwardly to rough sets based on 
dominance. Thus, in this section, we present a different 
approach to estimate probabilities for rough approxima- 
tions: stochastic rough set approach [24.25] (see also 
Sect. 22.3.3 in Chap. 22). It can be applied also to rough 
sets based on indiscernibility, but here we present this 
approach taking into consideration rough sets based on 
dominance. In the following, we shall consider upward 
dominance-based approximations of a given X C U. 
However, the same approach can be used for downward 
dominance-based approximations. From a probabilis- 
tic point of view, the assignment of object x to X C 
U can be made with probability Pr(X |Dt (x)) and 
Pr(X|D~ (x)). This probability is supposed to satisfy 
the usual axioms of probability 


Pr(U|D* (x)) =1, 


Pr(U—X|Dt (x)) = 1—Pr(X|Dt(@), 
Pr(U|D (x) = 1, 
Pr(U—X|D (x)) = 1—Pr(X|D (x)) . 


Moreover, this probability has to satisfy an axiom re- 
lated to the choice of the rough upward approximation, 
i.e., the positive monotonic relationships one expects 
between membership in X C U and possession of the 
properties related to attributes from AT, i.e., the domi- 
nance relation =: for any x, y € U such that x = y 


(i) Pr(X|D* (x)) > Pr(X|D* (y)) , 
(ii) Pr(U—X|D~— (x)) < Pr(U—X|D~— (y)). 


Condition (i) says that if objects x possesses properties 
related to attributes from AT at least as object y, i.e., 
x = y, then the probability that x belongs to X has to 
be not smaller than the probability that y belongs to X. 
Analogously, Condition (ii) says that since x = y , then 
the probability that x does not belong to X should not 
be greater than the probability that y does not belong to 
X. Observe that (ii) can be written also as 


(ii) Pr(X|D™ (x)) = Pr(X|D“(y)) . 


These probabilities are unknown but can be esti- 
mated from data. For each X C U, we have a bi- 
nary problem of estimating the conditional proba- 
bilities Pr(X|Dt (x)) = 1—Pr(U—X|Dt(x)) and the 
conditional probabilities Pr(X|D™ (x)) = 1 — Pr(U — 
X|D~(x)). It can be solved by isotonic regres- 
sion [24.25]. For X C U and for any x € U, let y(x, X) = 
1 if xe€X, otherwise y(x,X)=0. Then one can 
choose estimates Pr*(X|D+(x)) and Pr*(X|D~(x)) 
with Pr*(X|D*(x)) and Pr*(X|D~(x)) which min- 
imize the squared distance to the class assignment 
y(x, X), subject to the monotonicity constraints related 
to the dominance relation = on the attributes from AT 
(see also Sect. 22.3.3 in Chap. 22) 


Minimize 

Yl OG, X) — Pr(X|D* (a)? 
xEU 

+ OG, X) — Pr(X|D~ (x)))? 
subject to 


Pr(X|D* (x)) > Pr(X|Dt (z)) and 
Pr(X|D (x)) > Pr(X|D™ (z)) if x = z, 
forallx,zeU. 
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Then, stochastic w-lower approximations of X C U 
can be defined as 


P(X) = {x€ U: Pr(X|D* (x)) >a}, 
P®(U—X) = {x€ U: Pr(U—X|D (x) >a}. 


Replacing the unknown probabilities 
Pr(X|D* (x) 
and 
Pr(U—X|D" (x)) 
by their estimates 
Pr*(X|D* (x) 
and 
Pr*(U—X|D™ (x)) 
obtained from isotonic regression, we get 
p@(X) = fx EU: Pr*(X|D+ (x) > al , 
P” (U—X) = {x€ U : Pr*(U—X|D (x)) > a}, 


where parameter œ € [0.5,1] controls the allowed 
amount of inconsistency. 

Solving isotonic regression requires O(|U|*) time, 
but a good heuristic needs only O(|U|’). 

In fact, as shown in [24.25] and recalled in 
Sect. 22.3.3 in Chap. 22, we do not really need to know 
the probability estimates to obtain stochastic lower 
approximations. We only need to know for which ob- 
ject x€ U, Pr*(X|Dt(x)) > œ and for which x € U, 
Pr* (U — X|D7 (x)) >a (i.e., Pr* (X|D7 (x)) < 1-a). 
This can be found by solving a linear programming (re- 
assignment) problem. 

As before, y(x,X)=1 if xexX, otherwise 
y(x,X)=0. Let d(x,X) be the decision variable 
which determines a new class assignment for object 
x. Then, reassign objects to X if d* (x, X) = 1, and to 
U -—X if d* (x, X) = 0, such that the new class assign- 
ments are consistent with the dominance principle, 
where d*(x,X) results from solving the following 
linear programming problem 


Minimize >, wew lyx, X) — d(x, X)| 
zEU 
subject to d(x, X) > d(z,X) if x =z 
for all x, z € U 


where w; and wọ are arbitrary positive weights. 


Due to unimodularity of the constraint matrix, the 
optimal solution of this linear programming problem is 
always integer, i.e., d*(x,X) € {0, 1}. For all objects 
consistent with the dominance principle, d* (x, X) = 
y(x, X). If we set wo =a and w; =a—1, then the 
optimal solution d*(x,X) satisfies: d*(x,X)=1< 
Pr* (X|D* (x)) > a. If we set wọ = 1—a@ and w; =a, 
then the optimal solution d* (x, X) satisfies: d* (x, X) = 
0 S Pr*(X|D~ (x)) < 1-a. 

Solving the reassignment problem twice, we can 
obtain the lower approximations P“ (X), P% (U — X), 
without knowing the probability estimates. 


24.9.3 Three-Way Decisions: 
Interpreting the Three Regions 
in the Case of Dominance-Based 
Rough Sets 


In this section, we present an interpretation of domi- 
nance-based rough set three regions taking into consid- 
eration the framework of three-way decisions. 

In an information table, with respect to a subset of 
attributes A C AT, an object x induces logic formulae 


N uLO® Zava, (24.84) 
acA 
NN Va Za l), (24.85) 
acA 


where 74 (x), Va € Va and 


@ The atomic formula va Xa la (x) indicates that object 
x taking value J, (x) on attribute a possess a property 
related to a not more than any object y taking value 
I,(y) = va on attribute a. 

@ The atomic formula J,(x) Xa Va indicates that object 
x taking value J, (x) on attribute a possess a property 
related to a not less than any object y taking value 
I,(y) = va On attribute a. 


Thus, an object y satisfies the formula 


A la(x) Za Va if laO) X va forallaeA, 


acA 


that is, 


b E N tal) Xa va => Va EA, (LO) Za ¥ 
acA 
(24.86) 
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Analogously, an object y satisfies the formula 


/\ Va Za la) if va Z laO) for allae A, 


acA 


that is, 


(> = /\ Va Xa la(x) = Ya EA, (Va Xa no) : 
acA 
(24.87) 


With these notations, we are ready to interpret upward 
and downward dominance-based rough set three re- 
gions. 

From the upward and downward three regions, we 
can construct three classes of rules for classifying an 
object, called the upward and downward positive, neg- 
ative, and boundary rules. 

They are expressed in the following forms: for 
yeu, 


@ Positive rule induced by an upward cone 
+ + : 
D™ (x) C POS p) : 


if yH NALO Xa Va, accept yE X, 


acA 


@ Negative rule induced by the complement of an up- 
ward cone 


U -Dt (x) € NEGE, p% : 


ifyH => N ba Xa Va , rejectyE X, 


acA 


© Boundary rule induced by an upward cone Dt (x) 
and its complement U — D* (x) such that 


Dt (x) Z POSE, p (X) 
and (U—D*(x)) Z NEG, 4)(X): 
ify A uO) Za va ^5 NN aQ) Za tia, 


acA acA 


neither accept nor reject y € X , 


@ Positive rule induced by an downward cone 
D (x) C POS gy (X): 


if yH /\ Va Xa la(x), accept y € X , 


acA 


@ Negative rule induced by the complement of a 
downward cone 


U- D(x) CNEGG, g)(X): 


ify= ~ VAN Va Za la(x), reject ye X, 


acA 


@ Boundary rule induced by a downward cone D7 (x) 
and its complement U — D7 (x) such that 


D(x) Z POS% p) (X) 
and U- D~ (x) € NEG g)(X) : 
if y VAN Va Za lax) An VAN Ua Za I,(x), 


acA acA 


neither accept nor reject y € X . 


The three types of rules have a semantic interpreta- 
tions analogous to those induced by probabilistic rough 
sets based on indiscernibility presented in Sect. 24.5.3. 
Let us consider the rules related to POSF and NEGF. 
A positive rule allows us to accept an object y to 
be a member of X, because y has a higher probabil- 
ity of being in X due to the facts that y € DĦ (x) and 
Pr(X|Dt (x)) > a*. A negative rule enables us to re- 
ject an object y to be a member of X, because y has 
lower probability of being in X due to the facts that 
y € D* (x) and Pr(X|Dt (x)) < B+. When the proba- 
bility of y being in X is neither high nor low, a boundary 
rule makes a noncommitment decision. 

The error rate of a positive rule is given by 
1 — Pr(X|Dt (x)), which, by definition of the three 
regions, is at or below 1—a*. The error rate of neg- 
ative rule is given by Pr(X|Dt (x)) and is at or below 
B+. The cost of a positive rule is Ape Pr(X|D* (x)) + 
AR ( 1—Pr(X|D*(x))) and is bounded above by 
atA + ash, The cost of a negative rule 
is AXSPr(X|Dt (x)) + As, (1 — Pr(X|Dt (x))) and is 
bounded above by pray +(1— BHIA. 
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24.10 Conclusions 


A basic probabilistic rough set model is formulated 
by using a pair of thresholds on conditional probabil- 
ities, which leads to flexibility and robustness when 
performing classification or decision-making tasks. 
Three theories are the supporting pillars of proba- 
bilistic rough sets. Bayesian decision theory enables 
us to determine and interpret the required thresh- 
olds by using more operable notions such as loss, 
cost, risk, etc. Bayesian inference ensures us to esti- 
mate the conditional probability accurately. A theory 
of three-way decisions allows us to make a wise de- 
cision in the presence of incomplete or insufficient 
information. 
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This chapter reviews three formulations of rough set 
theory, i.e., element-based definition, granule- 
based definition, and subsystem-based definition. 
These formulations are adopted to generalize rough 
sets from three directions. The first direction is to 
use an arbitrary binary relation to generalize the 
equivalence relation in the element-based defini- 
tion. The second is to use a covering to generalize 
the partition in the granule-based definition, and 
the third to use a subsystem to generalize the 
Boolean algebra in the subsystem-based defini- 
tion. In addition, we provide some insights into 
the theoretical aspects of these generalizations, 
mainly with respect to relations with non-classical 
logic and topology theory. 


In the Pawlak rough set model, the relationships of 
objects are defined by equivalence relations [25.1, 
2]. In addition, we may obtain two other equiva- 
lent structures: the partition, induced by the equiv- 
alence relations, and an atomic Boolean algebra, 
formed by the equivalence classes as its set of 
atoms [25.2,3]. In other words, we have three 
equivalent formulations of rough sets, namely, the 
equivalence relation-based formulation, the partition- 
based formulation, and the Boolean algebra-based 
formulation [25.4]. The approximation operators 
apr and apr are defined by an equivalence re- 
lation E, a partition U/E, and Boolean alge- 
bra B(U/E), respectively [25.3,5]. Although math- 
ematically equivalent, these three formulations give 
different insights into the theory. More interest- 
ingly, when rough sets are generalized, the three 
formulations are no longer equivalent and thus 
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give new directions for the exploration of rough 
sets. 

This chapter aims to explore these different gener- 
alizations. The discussion is organized in two parts. In 
the first part, we review and summarize relation-based, 
covering-based, and subsystem-based rough sets, based 
on several articles by Yao [25.4, 6, 7]. In the second part, 
we will give some insight into the theoretical aspects of 
these generalizations, mainly with respect to relations 
with nonclassical logic (modal and many-valued) and 
topology theory. It is to be noted that this second part 
partially overlaps with the first one, however, the scopes 
are different. Indeed, whereas the first part explains the 
models and their genesis, the second one is only devoted 
to some theoretical aspects. As such, the second part 
can be skipped by readers who are not so interested in 
fine details but may still have a clear view of the whole 
landscape of these kinds of generalized rough sets. 
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25.1 Definition and Approximations of the Models 


In this section we discuss three equivalent formulations, 
namely, the equivalence relation-based formulation, the 
partition-based formulation, and the Boolean algebra- 
based formulation. 


25.1.1 A Framework 
for Generalizing Rough Sets 


For a systematic study on the generalization of Pawlak 
rough sets, Yao provided a framework to classifying 
commonly used definitions of rough set approxima- 
tions into three types: the element-based definition, the 
granule-based definition, and the subsystem-based def- 
inition [25.5]. He argued that these types offer three 
directions for generalizing rough set models. We adapt 
this framework in the following discussion. 

Suppose the universe U is a finite and nonempty set 
and let E C Ux U be an equivalence relation on U. The 
equivalence class containing x is denoted as 


kle = tyly € U, xEy} . 


The family of all equivalence classes is known as a quo- 
tient set denoted by 


U/E = {lxlelx € U} . 


U/E defines a partition of U. A family of all definable 
sets form B(U/E). A family of all definable sets can 
be obtained from U/E by adding the empty set Ø and 
making itself closed under set union. A family of all 
definable sets is a subsystem of 2” | that is, B(UU/E) € 
2¥ [25.1]. The standard rough set theory deals with the 
approximation of any subset of U in terms of definable 
subsets in B(U/E). From different representations of an 
equivalence relation, three definitions of Pawlak rough 
set approximations can be obtained as follows: 


Pawlak rough set model Generalized rough set model 


Element-based 
definition 


n ; Generalize 
Equivalence relation E ————__> 


0 


Partition U/E 


RSUxU 


Generalize 


Granule-based — > Covering C 


definition 


Generalize 


Subsystem-based Boolean algebra > Any subsystem 


definition B(UI/E) 


Fig. 25.1 Different formulations of approximation operators 


Any binary relation 


© Element-based definitions [25.3, 5] 
apr(A) = {x|x € U, [x]z CA} 
= {x|x € U, Vy € ULxXEy > y E€ Al}, 
apr(A) = {x|x € U, [I]se NA F 85 
= {x|x € U, dy € UlxEyAyeA]}. (25.1) 


@ Granule-based definitions [25.3,5, 8] 
apr(A) = |_){blelbde € U/E, be € A} 
=| J{XIX € U/E,X C A}, 


apr(A) = |_){blelbde € U/E, be NA FO} 
=| JX € U/E,X NA # 8}. 


(25.2) 
@ Subsystem-based definition [25.3, 5, 8] 
apr(A) = |_J{XIX € B(U/E), X C A}, 
apr(A) = ( {XIX € B(U/E),A CX}. (25.3) 


The three equivalent definitions offer different in- 
terpretations of rough set approximations [25.5]. Ac- 
cording to the element-based definition, an element x 
is in the lower approximation apr(A) of a set A if all 
of its equivalent elements are in A; the element is in 
the upper approximation apr(A) if at least one of its 
equivalent elements is in A [25.5]. According to the 
granule-based definition, apr(A) is the union of equiva- 
lence classes that are subsets of A; apr(A) is the union of 
equivalence classes that have a nonempty intersection 
with A [25.5]. According to the subsystem-based defini- 
tion, apr(A) is the largest definable set in the subsystem 
B(U/E) that is contained in A; apr(A) is the smallest 
definable set in the subsystem B(U/E) that contains 
A [25.5]. 

Figure 25.1, adapted from Yao and Yao [25.4], 
shows three directions in generalized rough set mod- 
els. In the Pawlak model, the definitions of approx- 
imation operators based on the equivalence relation, 
partition and Boolean algebra B(U/E) are equivalent. 
The symbol < is used to show a one-to-one two-way 
construction process. However, the generalized defini- 
tions of approximation operators using arbitrary binary 
relations, coverings, and subsystems are not equivalent. 
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In other words, the corresponding subsystem or cov- 
ering may not be found based on an arbitrary binary 
relation. With each formulation, various definitions of 
approximation operators can be examined. One may 
consider an arbitrary binary relation in generalizing 
the equivalence relation in the element-based defini- 
tion, a covering in generalizing the partition in the 
granule-based definition, and other subsystems in gen- 
eralizing the Boolean algebra in the subsystem-based 
definition [25.4, 5]. 


25.1.2 Binary Relation-Based Rough Sets 


In the development of the theory of rough sets, ap- 
proximation operators are typically defined by using 
equivalence relations which are reflexive, symmetric, 
and transitive [25.2]. The Pawlak rough set model can 
be extended by using any arbitrary binary relation to re- 
place the equivalence relation. Wybraniec-Skardowska 
introduced different rough set models based on vari- 
ous types of binary relations [25.9]. Pawlak pointed 
out that any type of relations may be assumed on 
the universe for the development of a rough set the- 
ory [25.10]. Yao etal. extended conventional rough 
set models by considering various types of relations 
by drawing results from modal logics [25.11]. Simi- 
larly to defining different types of modal logic sys- 
tems, different rough set models were defined by us- 
ing classes of binary relations satisfying various sets 
of properties, formed by serial, reflective, symmetric, 
transitive, and Euclidean relations, and their combina- 
tions. Slowinski and Vanderpooten considered a special 
case in which a reflexive (not necessarily symmet- 
ric and transitive) similarity relation was used [25.12]. 
Greco etal. examined a fuzzy rough approximation 
based on fuzzy similarity relations [25.13]. Guan and 
Wang investigated the relationships among 12 differ- 
ent basic definitions of approximations and suggested 
the suitable generalized definitions of approximations 
for each class of generalized indiscernibility rela- 
tions [25.14]. 

A binary relation R may be conveniently repre- 
sented by a mapping n: U > 2", i.e., n is a neigh- 
borhood operator and n(x) consists of all R-related 
elements of x. In the element-based definition, the 
equivalence class [x], can be viewed as a neighborhood 
of x consisting of objects equivalent to x. In general, 
one may consider any type neighborhood of x, con- 
sisting of objects related to x, to form more general 
approximation operators. By extending (25.1), we can 
define lower and upper approximation operators as fol- 


lows [25.15] 


apr (A) = {x|x € U,n(x) C A} 
= {x|x € U, Vy € U(y E n(x) > y E A)}, 
apr, (A) = {x|x € U, n(x) NA F Ø} 
= {x|x € U, 3yo E n(x) Ay EA)}. 
(25.4) 


The set apr, (A) consists of elements whose R-related 
elements are all in A, and apr,,(A) consists of elements 
such that at least one of whose R-related elements is 
in A. The lower and upper approximation operators apr 
and apr,, pair are a generalized rough set of A induced 
by the binary relation R. 

A neighborhood operator can be defined by using 
a binary relation [25.6, 12,16]. Suppose RC Ux U is 
a binary relation on the universe U. A successor neigh- 
borhood operator R- : U > 2” can be defined as 


xR: = {y|y € U, xRy}. 


Conversely, a binary relation can be constructed from 
its successor neighborhood as 


xRy & yexR.. 


Generalized approximations by a neighborhood oper- 
ator can be equivalently formulated by using a binary 
relation [25.4]. This formulation connects generalized 
approximation operators with the necessity and possi- 
bility operators in modal logic [25.6]. There are many 
types of generalized approximation operators defined 
by neighborhood operators that are induced by a binary 
relation or a family of binary relations [25.3, 15-19]. 

For an arbitrary relation, generalized rough set op- 
erators do not necessarily satisfy all the properties in 
the Pawlak rough set model. Nevertheless, the follow- 
ing properties hold in rough set models induced by any 
binary relation [25.3, 6, 20] 


(L1) apr(A) = (@pr(A’)), 

(L2) apr(U) = U, 

(L3) apr(A N B) = apr(A) N apr(B) , 
(LA) apr(A U B) > apr(A) U apr(B) , 
(L5) A E B = apr(A) © apr(B), 

(K) apr(A® U B) S (apr(A))* U apr(B) , 
(U1) TFTA) = (apr(A9) 
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(U2) apr(B) = G, 

(U3) apr(A U B) = apr(A) U apr(B) , 
(U4) apr(A N B) © apr(A) N apr(B) , 
(US) A C B => apr(A) € apr(B). 


A relation R is a serial relation if for all x € U there 
exists a y € U such that xRy; a relation is a reflexive rela- 
tion if for all x € U the relationship xRx holds; a relation 
is symmetric relation if for all x, y € U, xRy implies yRx 
holds; a relation is transitive relation if for three ele- 
ments x,y,z € U, xRy and yRZ imply xRz; a relation 
is Euclidean when for all x,y,z € U, xRy and xRz im- 
ply yRz [25.6, 15]. By using mapping n, we can express 
equivalently the conditions on a binary relation as fol- 
lows [25.6, 15, 21]: 


Serial xEU,n(x) 4G 

Reflexive xeU,x€n(x) 

Symmetric x,y € U, x € n(y) > y € n(x) 
Transitive x,y € U, y € n(x) > n(y) E n(x) 
Euclidean x,y €U, y€ n(x) > n(x) € n). 


Different binary relations have different properties. 
The five properties of a binary relation, namely, the 
serial, reflexive, symmetric, transitive, and Euclidean 
properties, induce five properties for the approximation 
operators [25.6, 20,21]. We use the same labeling sys- 
tem as in modal logic to label these properties [25.6]: 


Serial Property (D) 
apr(A) C apr(A) holds 
Reflexive Property (T) 


apr(A) C A holds 
Symmetric Property (B) 

A C apr(apr(A))holds 
Transitive Property (4) 

apr(A) C apr(apr(A)) holds 
Euclidean: Property (5) 

apr(A) © apr(apr(A)) holds. 


By combining these properties, one can construct 
more rough set models [25.6, 20]. Other than the above 
mentioned properties, (K) denotes the property that any 
binary relation holds, i.e., no special property is re- 
quired. We use a series of property labels, i.e., (K), 
(D), (T), (B), (4), (5), to represent the rough set models 
built on relations with these properties. For example, the 
KTB rough set model is built on a compatibility relation 
R, i. e., with reflexive and symmetric properties. In such 
a model, properties (K), (D), (T) and (B) hold, how- 
ever, properties (4) and (5) do not hold. Property (D) 
does not explicitly appear in this label because (D) can 


be obtained from (T). If R is reflexive, symmetric, and 
transitive, i.e., R is an equivalence relation, we obtain 
the Pawlak rough set model [25.6, 20]. The approxima- 
tion operators satisfies all properties (D), (T), (B), (4), 
and (5). 

Figure 25.2 summarizes the relationships between 
these models [25.6, 20, 21]. The label of the model in- 
dicates the characterization properties of that model. 
A line connecting two models indicates that the model 
on the upper level is also a model on the lower level. 
For example, a KTS model is a KT4 model, as KT5 con- 
nects down to KT4. It should be noted that the lines that 
can be derived by transitivity are not explicitly shown. 
The model K may be considered as the basic model be- 
cause it does not require any special property on the 
binary relation. All other models are built on top of the 
model K and it can be regarded as the weakest model. 
The model KTS, i. e., the Pawlak rough set model, is the 
strongest model. 

With the element-based definition, we can obtain bi- 
nary relation-based rough set models by generalizing 
the equivalence relation to binary relations. Different 
binary relations can induce different rough set mod- 
els with different properties, as was discussed above. 
This generalization not only deepens our understand- 
ing of rough sets, but also enriches the rough set 
theory. 


25.1.3 Covering-Based Rough Sets 


A covering of a universe is a family of subsets of the 
universe such that their union is the universe. By al- 
lowing nonempty overlap of two subsets, a covering 
is a generalized mathematical structure of a parti- 
tion [25.22]. These subsets in a covering or a partition 
can be considered as granules based on the concepts 


Fig. 25.2 Rough set models (after [25.6]) 
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in granular computing [25.23]. By generalizing the 
partition to covering in granule-based approximation 
definitions, we form a more general definition and we 
call this approach a granule-based definition. In this 
section, we mainly investigate covering-based rough 
sets. 

Zakowski proposed the notion of covering-based 
rough set approximations in 1983 [25.24]. Lower 
and upper approximation operators are defined by 
a straightforward generalization of the rough set defi- 
nition proposed by Pawlak. However, the generalized 
approximation operators are not dual to each other with 
respect to set complements [25.15, 25]. Pomykala stud- 
ied two pairs of dual approximation operators [25.25]. 
The lower approximation operator in one pair is the 
same as the Zakowski lower approximation operator, 
and the upper approximation operator in the other pair 
is same as the Zakowski upper approximation oper- 
ator [25.25]. Pomykala also suggested and examined 
additional pairs of dual approximation operators that 
are induced by a covering. Furthermore, he considered 
coverings produced by tolerance relations in an incom- 
plete information table [25.26]. 

Instead of using duality, Wybraniec-Skardowska 
studied pairs of approximation operators linked to- 
gether by a different type of relations [25.9]. Given an 
upper approximation operator, the corresponding lower 
approximation operator is defined based on the up- 
per approximations of singleton subsets. Several such 
pairs of approximation operators were studied based on 
a covering and a tolerance relation defined by a cover- 
ing, including some of those used by Zakowski [25.24] 
and Pomykala [25.25]. Yao investigated dual approx- 
imation operators by using coverings induced by the 
predecessor and/or successor neighborhoods of serial 
or inverse serial binary relations. The two pairs of 
dual approximation operators introduced by Pomykala 
were examined and the conditions for their equiva- 
lence to those obtained from a binary relation were 
given [25.5, 15]. Couso and Dubois proposed a loose 
pair and a tight pair [25.27]. They presented an inter- 
esting investigation of the two pairs of approximation 
operators within the context of incomplete information. 
The two pairs of operators were shown to be related to 
the family of approximation operators produced by all 
partitions consistent with a covering induced by an ill- 
known attribute function in an incomplete information 
table. Restrepo et al. investigated different relationships 
between commonly used operators using concepts of 
duality and other properties [25.28]. They also showed 
that a pair of lower operators and an upper approxi- 


mation operator can be dual and adjoint at the same 
time. 

By using the minimum neighborhood of an object 
(i. e., the intersection of subsets in the minimal descrip- 
tion of the object), Wang et al. introduced a pair of 
dual approximation operators [25.29]. The same pair of 
approximation operators was also used and examined 
by Xu and Wang [25.30] and Xu and Zhang [25.31]. 
Zhu’s team systematically studied five types of approx- 
imation operators [25.19, 32-36]. The lower approxi- 
mation operator is the Zakowski lower approximation 
operator, and the upper approximation operators are 
different. They investigated properties of these oper- 
ators and their relationships and provided set of ax- 
ioms for characterizing these operators. Liu examined 
covering-based rough sets from constructive and ax- 
iomatic approaches [25.37]. The relationships among 
four types of covering-based rough sets and the topolo- 
gies induced by different covering approximations were 
discussed. Zhang and Luo investigated relationships 
between relation-based rough sets and covering-based 
rough sets [25.38]. They also presented some suffi- 
cient and necessary conditions for different types of 
covering-based rough sets to be equal. 

We will elaborate on how to obtain rough sets 
by generalizing a partition to a covering, as well as 
duality, loose pairs, and tight pairs of approximation 
operators. Let C be a covering of the universe U. By 
replacing a partition U/E with a covering C and equiv- 
alence classes with subsets in C in the granule-based 
definition, a pair of approximation operators can be ob- 
tained [25.24]. However, they are not a pair of dual 
operators [25.25]. To overcome this problem, Yao sug- 
gested that one can generalize one of them and define 
the other by duality [25.7, 15]. The granule-based defi- 
nition can be generalized in two ways, i. e., (1) the lower 
approximation operator is extended from partition to 
covering and the upper approximation operator is rede- 
fined by duality, (2) the upper approximation operator 
is extended from partition to covering and the lower ap- 
proximation operator is redefined by duality [25.4, 7]. 
The results are two pairs of dual approximation opera- 
tors [25.4] 


apr’ (A) = |_){x|x eC, xX CA} 
= {x|x € U,AX e Cx € X,X CA}}, 
apr’ (A) = (apr’(A°))° 
= {xlxe U, VX ECheXS XNAF Øh. 
(25.5) 
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and 
apr” (A) = (apr’(A‘))° 
= {xlxe U, VX EC eX> XCA}, 
apr’ (A) =|_){x|xeC,XNAZ 9B} 
= {x|x € U,AX e Ck E X, XNA # Øh. 
(25.6) 


We may define two pairs of dual approximation opera- 
tors for each covering. Both of pairs are consistent with 
the Pawlak definition. The following relationships hold 
for the above approximation operators [25.4, 25] 


apr” (A) C apr’(A) C A apr’ (A) C ap” (A). 
(25.7) 


Therefore, the pair (apr (A), apr (A)) is called a pair of 
tighter approximation and the pair (apr” (A), apr” (A)) 
is called a looser approximation [25.25]. Furthermore, 
any approximation produced by other authors are 
bounded by 


(apr’(A), apr’ (A)) 
and 
(apr” (A), apr’’(A)) 
if 
apr(A) C A C apr). 


In addition to this fundamental generalization of 
a rough set to a covering, more than 20 different ap- 
proximation pairs have been defined [25.4, 39]. Their 
properties were studied in [25.39], where the inclusion 
relationship occurring among two sets and their approx- 
imations was considered. All these approaches were 
categorized in a recent study [25.4]. 

We recall some notions that will be useful in 
Sect. 25.2.2 when dealing with the topological charac- 
terization of approximations. 


Definition 25.1 [25.39] 
Let C be a covering on a universe U and x € U. We 
define: 


@ The neighborhood of x: y(x) =N{CEC:xeE C}. 

@ The friends of x: d(x) = U{C Ee C:x€ Ch. 

© The partition generated by a covering: Mc (x) = 
fyeU:VCEC, (xEeCoyedO)}. 


Using the above operators, the following approxi- 
mation pairs are introduced 


Li (A) = U{S(x) : 6(x) C A}, 

U (A) = L(A‘), (25.8) 
L(4)=U{C CA}, 

U>(A) = L (4°) = N{C°: CEC, CNA =D}, 


(25.9) 
L3(A) = L(A), 
U3(A) = U{C : CNA # G}\ L (A), (25.10) 
L(A) = Ufc) CA}, 
U4(A) = Uc (x) NA F Oy, (25.11) 
Ls5(A) = {xE U: y(x) CA}, 
Us(A) = {x EU: y(x) NA £ Ø}, (25.12) 
Lg (A) = U6(A‘)*, 
U6(A) = U{y (x) :x € A}. (25.13) 


These approximation pairs have been introduced 
and studied in several papers: approximation pair 1 
(25.8) can be found in [25.40], approximation pair 2 
(25.9) in [25.5, 40—43], approximation pair 3 (25.10) 
in [25.42], approximation pair 4 (25.11) in many 
papers starting from [25.44], approximation pair 5 
(25.12) in [25.41, 45], and approximation pair 6 (25.13) 
in [25.45,46]. As we will discuss, they all show nice 
topological properties. 

By simply replacing a partition with a covering, 
a generalized mathematical structure of a partition, in 
the granule-based definition, we form new rough sets. 
The lower and upper approximation operators are not 
necessary dual. We may redefine one of them to obtain 
the dual approximation operators. There are two types 
of approximation, a tight pair and a loose pair. The two 
pairs provide the boundary when new approximation 
operators are introduced. 


25.1.4 Subsystem-Based Rough Sets 


In the Pawlak rough set model, the same subsystem 
is used to define lower and upper approximation oper- 
ators. When generalizing the subsystem-based defini- 
tion, two subsystems may be used, one for the lower 
approximation, which is closed under union, and the 
other for the upper approximation, which is closed 
under intersection [25.4, 7,47]. To ensure duality of ap- 
proximation operators, the two subsystems should be 
dual systems with respect to set complement [25.4]. 
Given a closure system S, its dual system S can be con- 
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structed as § = {~ X|X € S}. The system S contains the 
universe U and is closed under set intersection. The sys- 
tem S contains the empty set Ø and is closed under set 
union [25.4]. A pair of lower and upper approximation 
operators with respect to S is defined as [25.4] 


apr(A) = |_}{x|x E€ S,X CA}, 


apr(A) = ( XIX €S,A C X}. (25.14) 


In the Pawlak rough set model, the two systems § 
and § are the same, namely, S$ = S, which is closed 
under set complement, union, and intersection. That 
is, it is a Boolean algebra. The subsystem-based def- 
inition provides a way to approximate any set in 2” 
by a pair of sets in S and S, respectively [25.4]. The 
subsystem-based definition can be generalized by using 
different mathematical structures, such as topological 
spaces [25.7,47,48], closure systems [25.7,47], lat- 
tices [25.7, 49], and posets [25.7, 50]. 

For an arbitrary topological space, the family of 
open sets is different from the family of closed sets. Let 
(U, O(U)) be a topological space, where O(U) € 2” is 
a family of subsets of U called open sets. The family 
of (topological) open sets contains Ø and U. The family 
of open sets is closed under union and finite intersec- 
tion. The family of all (topological) closed sets C(U) = 
{7X|X € O(U)} contains Ø and U, and is closed un- 
der intersection and finite union. A pair of generalized 
approximation operators can be defined by replacing 
B(U/E) with O(U) for the lower approximation opera- 
tor, and B(U/E) with C(U) for the upper approximation 
operator [25.5]. The definitions of approximation oper- 
ators are [25.5, 7, 50] 


apr(A) = | _J{XIX € OU), X CA}, 


apr(A) = ( {XIX € C(U),A C X}. (25.15) 


The rough set model can be generalized by us- 
ing closure systems. A family of subsets of U, C(U), 
is called a closure system if it contains U and is 
closed under intersection. By collecting the comple- 
ments of members of C(U), we can obtain another 
system O(U) = {-X|X € C(U)}, which contains the 
empty set @ and is closed under union. In this case, 
a pair of approximation operators in a closure system 
can be defined by replacing B(U/E) with O(U) for the 
lower approximation operator, and B(U/E) with C(U) 
for the upper approximation operator. The definitions of 


approximation operators are [25.5, 7] 


apr(A) = (_{X|X € O(U),.X CA}, 


apr(A) = ( {XIX € C(U),A C X}. (25.16) 

The power set of the universe is a special lattice. 
Suppose (8, =, A, V, 0, 1) is a finite Boolean algebra 
and (Bo, =, A, V, 0,1) is a sub-Boolean algebra. One 
may approximate an element of B by using elements 
of Bo [25.7] 


apr(x) = \/ bly € Bo.y <x}, 


apr(x) = NDI» € Bo. x <y}. (25.17) 

We consider a more generalized definition in which 
the Boolean algebra B is replaced by a completely dis- 
tributive lattice [25.51], and one subsystem is used. 
A subsystem O(B) of B satisfies the following ax- 
ioms [25.7]: 


(01) 0€ O(B), 1 € O(B); 

(O2) for any subsystem D C O(B), if there exists 
a least upper bound LUB(D) = V D, it belongs 
to O(B); 

(03) O(B) is closed under finite meets. 


Elements of O(B) are referred to as inner defin- 
able elements. The complement of an inner definable 
element is called an outer definable element. The set 
of outer definable elements C(B) = {-x|x € O(B)} is 
characterized by the following axioms [25.7]: 


(C1) 0 € C(B), 1 € C(B); 

(C2) for any subsystem D C C(B), if there exists 
a greatest lower bound GLB(D) = A D, it be- 
longs to C(B); 

(C3) C(B) is closed under finite joins. 


From the sets of inner and outer definable elements, 
we define the following approximation operators [25.7] 


apr(x) = \/ ply € O(B),y <x}, 


apr(x) = NOl € C(B),x <y}. (25.18) 

Let (L, <,0, 1) be a bounded lattice. Suppose O(L) 
is a subset of L such that it contains 0 and is closed un- 
der join, and C(L) a subset of L such that it contains 1 
and is closed under meets. They are complete lattices, 
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although the meet of O(L) and the join of C(L) may be 
different from those of L. Based on these two systems, 
we can define two other approximation operators as fol- 
lows [25.7] 


apr(x) = \/ oly € O(L).y <x}, 


apr(x) = /\ Oly € C(L),x < yt. (25.19) 
The operator apr is a closure operator [25.7]. C(L) 
corresponds to the closure system in the set-theoretic 
framework. However, since a lattice may not be com- 
plemented, we must explicitly consider both O(L) 


25.2 Theoretical Approaches 


In this section, we further develop the previously out- 
lined links with modal logics and topology. 


25.2.1 Logical Setting 


The minimal modal system K [25.52] is at the ba- 
sis of any modal logic. Its language £ is the usual 
one of propositional logic plus necessity O and possi- 
bility Q. That is, £=a € V|-ala A B|O(@), where 
V = {a,b,c,...} is the set of propositional variables 
and —, ^A, O are the negation, conjunction, and neces- 
sity connectives. As usual, other connectives can be 
derived: disjunction œ V B stands for (a@’ A f’)’, im- 
plication œ — £ stands for a’ v B, and possibility is 
Oa = -O (7a). 

The axioms are those of Boolean logic plus the ax- 
ioms to characterize the modal connectives: 


(Bl) a > (B >a) 


(B2) ($ > @> u)) > ($ > a) > ($ > y)) 
(B3) (w > B’) > (B > a) 
(K) O(a > B) > (Ga > Of). 


The rules are modus ponens: If a and F œ —> f 
then B and necessitation: fF a then F- Oa. 

In our context, the semantics is given through 
a model M = (X, R, v), where (X, R) is an approxima- 
tion space (that is, a universe with a binary relation) 
and v is the interpretation that given a variable returns 
a subset of elements of the universe: v(a) C X. Using 
the standard modal logic terminology, X is the set of 
possible worlds, R the accessibility relation, and v(a) 
represents the set of possible worlds where a holds. The 


and C(L). That is, the system (L,O(L),C(L)), 
or equivalently the system (L,apr,apr), is used 
for the generalization of Pawlak approximation 
operators. 

The subsystem-based formulation provides an im- 
portant interpretation of rough set theory. It allows us 
to study rough set theory in the contexts of many al- 
gebraic systems [25.47]. This naturally leads to the 
generalization of rough set approximations. With the 
subsystem-based definition, we examine the gener- 
alized approximation operators by using topological 
space, closure systems, lattices, and posets in this sub- 
section. 


interpretation v can recursively be extended to any for- 
mula @ as 


v~a) = væ)", 
v(&œı A a2) = v(a1) N v(&2), 
v(œı V @2) = v(œ1) U v(&2), 


and modal operators are mapped to lower and upper ap- 
proximations according to Definition 25.4 


v(Oa) = apr (v(a)), 
v(Qe) = Api, (v(a)). 


It is well known from modal logic [25.52] that, once 
the basic axioms (B1)-(B3) and (K) are fixed, then 
a different modal axiom according to Table 25.1 cor- 
responds to any relation property. 

Clearly, these axioms reflect the properties on rough 
approximations given in Sect. 25.1.2 and can be used to 
generate all the logics given in Fig. 25.2. 

Other kinds of generalized rough sets models have 
been studied in the literature under the framework of 
modal logic. In particular, nondeterministic information 


Table 25.1 Correspondence between modal axioms and re- 
lation properties 


Name Axiom Property 
T a> a Reflexive 
4 Da > O(a) Transitive 
5 Oa > O($(«)) Euclidean 
D Oæ —> a Serial 

B a > 09a Symmetric 
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logic (NIL) [25.53] is defined to capture those informa- 
tion tables in which more than one value can correspond 
to each pair (object, attribute). For instance, if we have 
a feature color then it is allowed that to each object can 
be assigned more than one color. Given this extended 
definition, several new relations can be introduced; 
these where studied in the Orlowska—Pawlak seminal 
paper [25.53] and some subsequent studies [25.54—56]. 
Some of these relations are: 


@ Similarity (connection) 

xSy iff f(x,a) Of, a) 4 @ for alla € A. 
@ Inclusion 

xy iff f(x, a) C f(y, a) for alla €A. 
© Indiscernibility 

xInd y iff f(x, a) = f(y, a) for alla € A. 
© Weak indiscernibility 

x W; iff f(x, a) =f (y, a) for some a € A. 
@ Weak similarity 

xW, iff f(x, a) N f(y, a) 4 @ for some a € A. 
© Complementarity 

x Comy iff f(x, a) = VAL, VO, a). 


We also mention the logic for data analysis 
(DAL) [25.57], which is meant to deal with approxi- 
mation spaces with more than one equivalence relation 
(X, Ri). 

Besides modal logic, in standard rough set theory 
based on one equivalence relation, several authors have 
dealt with a many-valued logic approach [25.58, 59], 
also with some criticism from the point of view of 
the interpretation of results [25.60]. On the other hand, 
there have been only a few attempts to link generalized 
rough sets and many-valued logic. One of the reasons 
is the intrinsic difficulty that arises when trying to de- 
fine intersection and union of rough sets (in an algebraic 
context defining a lattice and not only a poset) without 
imposing some restrictions (see, for example, [25.61, 
62)). 

A recent work [25.63] deals with many-valued logic 
in coverings and in particular the apr”, apr’ approxima- 
tions defined in (25.6). The novelty of the approach in 
the introduction of a subordination relation among ob- 
jects 


xxy iff VCECWEeC = xeC), 


which is strictly linked to the notion of neighborhood. 
Indeed, 


xxy iff xey). 


We also remark that a similar preorder relation defined 
by a topology is used in the bitopological approach to 
dominance-based rough sets [25.64]. This link could 
bring new insight into the many-valued approach to 
covering-based rough sets. 

The syntax of the logic in [25.63] consists of two 
types of variables: object variables x, y, . . . and set vari- 
ables A,B,... Atomic formulae are x< y and x€A 
(where A can be a set variable or a composition of set 
variables) and compound formulae are obtained with 
the usual logical connectives =, A, V. The axioms are 
given in the form of sequent calculus, and the interpre- 
tation mapping 7 is given with respect to a covering C 
in atomic formulae as 


bajs t if v(x) = v(y) 
f otherwise 

where v maps each object variable to an object in the 

universe U and 


t ifvie apr (w(A)) 
IxEA)= 4f ifvaaye apr” (w(A)‘) , 


u otherwise 


where w maps each set variable and set formula to 
a subset of objects and u is a third truth value rep- 
resenting the unknown. The interpretation extends to 
compound formulae by truth functional application of 
Kleene three-valued logic. The logic is proven to be 
sound but complete only with respect to the sublan- 
guage of atomic formulae. 

We remark that this logic suffers from the problems 
of using three-valued logic to capture an epistemic no- 
tion such as is the case of Kleene-valued logics with 
respect to uncertainty. For instance, even if we are not 
sure if an element x € A, we can undoubtedly say that 
(x€A) or =—(x € A) (tertium non datur). On the con- 
trary, with the above interpretation we can obtain that 
I((x € A) A(x € A)) = u, whenever x is in the bound- 
ary of A, that is I(x € A) = u. 


25.2.2 Topology 


We saw in Sect. 25.1.4 that the subsystem approach can 
be generalized by the help of topological notions. Here, 
we further develop this topic and show which covering- 
based approximations have a topological behavior. 

Let us consider a lattice structure and define on it 
a notion of closure [25.65, 66]. 
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Definition 25.2 
Given a lattice £, a map c : £ +> £ is a closure operator 
if for all x,y € £: 


(Clop) x < c(x) 
(C2op) If x < y then c(x) < c(y) 
(C30p) c(c(x)) = c(x) 


The map c is a topological closure if in addition 
(Clop)—(C3op) satisfies 


(C4op) c(a) v c(b) = c(a v b). 


The map c is an Alexandroff closure if in addition 
(Clop)—(C3op) satisfies 


(CSop) vjela;) = e(vjqi). 


Of course, any Alexandroff closure is a topological one 
and on a finite universe, the two notions coincide. 

On a complemented lattice, an interior operator is 
defined by duality as i(x) = c(x’)/ and properties dual 
to (Clop)-(CSop) hold. On the other hand, if the lat- 
tice is not complemented, an interior operator must be 
explicitly defined, as discussed in Sect. 25.1.4. 

To the above algebraic definition of a closure opera- 
tor there corresponds an equivalent one based on closed 
sets as we saw in Sect. 25.1.4. More precisely: 


Definition 25.3 

Let £ be a lattice and C C £ a subset of elements 
which is closed under arbitrary intersections, that is, 
axioms (C1)-(C3) are satisfied. Then, a closure oper- 


25.3 Conclusion 


Three equivalent approaches to Pawlak rough sets can 
be given based on an equivalence relation, a partition 
of the universe, or a Boolean algebra. These differ- 
ent views generate three different possible generaliza- 
tions of the classical model: binary relation, covering, 
and subsystem-based rough sets. We have reviewed 
these models and given the definitions of rough ap- 
proximations in the different contexts. It can be seen 
that different models show interesting mathematical 
properties. In particular, binary relations-based rough 
sets have their roots in modal logic, whereas cov- 
ering and subsystem-based rough sets are linked to 
topology. 


ator satisfying properties (Clop—C3op) is defined as 
c(a) = A{uEe C:aK< ut}. 

A topological closure is such that the union of a fi- 
nite family of closed elements is closed, i. e., (Vierc;) € 
C with Z a finite set of indexes and an Alexandroff topol- 
ogy if closed under arbitrary union. 


Now, if the subsystem rough sets are naturally based 
on a topological ground, also covering rough sets can be 
classified with respect to topological properties. First of 
all, let us consider the approximations apr’ (A), apr (A) 
defined in (25.5). They are an interior and a closure op- 
erator, respectively. On the other hand, approximation 
apr’ (A) in (25.6) is not a closure, since in general, it 
does not satisfy condition (C3). 

Moreover, let us consider a covering C (X) of a uni- 
verse and the neighborhood of an element x € X with 
respect to C(X) defined as y(x) in Definition 25.1. 
It is well known that an Alexandroff closure operator 
is induced as the map cy : P(X) —> P(X) defined as 
cy (A) = U{y (a) : a € A}, which correspond to the up- 
per approximation Us in (25.13), and consequently the 
dual operator Le is an interior operator. 

More generally, all the upper approximations U1- 
Us are closure operators. In particular U4—U6 are also 
topological closures, and since duality holds with respect 
to all approximation pairs but (L3, U3) and since L3 = 
Ly, then all lower approximations are interior operators. 
This result can be easily established by checking that the 
properties satisfied by the approximations include those 
of Definition 25.2 (see Table 25.1 in [25.39]). 


Nowadays, generalized rough sets are continuously 
defined and we can encounter, for instance, more 
than 20 definitions of approximations based on cover- 
ings [25.4]. There is, however, a lack of interpretation 
in this collection. Efforts should be made to under- 
stand the meaning and usefulness of the already defined 
approximations. This should also be considered when 
defining new approximations. Besides an intrinsic the- 
oretical interest, a logical approach could also be useful 
in this direction. Indeed, if in the case of binary relation- 
based rough sets we have a clear logical framework, 
the same cannot be said about covering and subsystem- 
based rough sets, where only few results are known. 
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26. Fuzzy-Rough Hybridization 


Masahiro Inuiguchi, Wei-Zhi Wu, Chris Cornelis, Nele Verbiest 


Fuzzy sets and rough sets are known as uncer- 
tainty models. They are proposed to treat different 
aspects of uncertainty. Therefore, it is natural to 
combine them to build more powerful mathemat- 
ical tools for treating problems under uncertainty. 
In this chapter, we describe the state-of-the-art in 
the combinations of fuzzy and rough sets dividing 
into three parts. 

In the first part, we describe two kinds of 
models of fuzzy rough sets: one is classification- 
oriented model and the other is approximation- 
oriented model. We describe the fundamental 
properties and show the relations of those mod- 
els. Moreover, because those models use logical 
connectives such as conjunction and implication 
functions, the selection of logical connectives can 
sometimes be a question. Then we propose a log- 
ical connective-free model of fuzzy rough sets. 

In the second part, we develop a generalized 
fuzzy rough set model. We first introduce general 
types of belief structures and their induced dual 
pairs of belief and plausibility functions in the 
fuzzy environment. We then build relationships 
between belief and plausibility functions in the 
Dempster-Shafer theory of evidence and the lower 
and upper approximations in rough set theory in 
various situations. We also provide the potential 
applications of the main results to intelligent in- 
formation systems. 

In the third part, we give an overview of the 
practical applications of fuzzy rough sets. The main 
focus will be on the machine-learning domain. In 


26.1 Introduction 
to Fuzzy-Rough Hybridization .............. 425 


26.2 Classification- Versus 
Approximation-Oriented 


Fuzzy Rough Set Models....................... 427 
26.2.1 Classification-Oriented 

Fuzzy Rough Sets... 427 
26.2.2 Approximation-Oriented 

Fuzzy Rough Sets..sicccsesiners 431 
26.2.3 Relations Between Two Kinds 

of Fuzzy Rough Sets................6. 434 


26.2.4 The Other 

Approximation-Oriented 

Fuzzy Rough Sets..............::0:c.00 434 
26:25 REMAINS. si cckensicacssvsanncedianecadsnnn 436 


26.3 Generalized Fuzzy Belief Structures 
with Application 


in Fuzzy Information Systems. ............... 437 
26.3.1 Belief Structures 
and Belief Functions.................. 437 
26.3.2 Belief Structures 
of Rough Approximations........... 439 
26.3.3 Conclusion of This Section.......... 443 
26.4 Applications of Fuzzy Rough Sets.......... 444 = 
26.4.1 Applications D 
in Machine Learning.................. 44L a 
26.4.2 Other Applications.................. 446 — 
e e EE eametagerbescs 44T A 


particular, we review fuzzy-rough approaches for 
attribute selection, instance selection, classifica- 
tion, and prediction. 


26.1 Introduction to Fuzzy-Rough Hybridization 


Rough set approaches [26.1,2] have been successfully 
applied to various fields related to data analysis, knowl- 
edge discovery, decision analysis, and so on. In order 
to expand the application area and to develop its theory 


further, rough sets have been generalized under various 
settings. There are two different generalizations. One 
relaxes the precision so that the sizes of lower and upper 
approximations are controlled by a precision parameter. 
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This generalized rough set is called a variable preci- 
sion rough set. The other generalizes the approximation 
space, i.e., the structure of background knowledge. 
Many researchers generalized an equivalence relation 
which is often referred to as an indiscernibility relation 
to a general binary relation or a family. Many other re- 
searchers [26.3—23] generalized an equivalence relation 
to a fuzzy binary relation or a family of fuzzy sets. 

In this chapter, we describe the generalizations 
of rough sets in the latter sense. More precisely, 
we concentrate on the fuzzy generalizations of rough 
set approaches called fuzzy rough hybridizations. 
Fuzzy rough sets were originally proposed by Naka- 
mura [26.3] and by Dubois and Prade [26.4,5]. 
The fundamental properties of fuzzy rough sets have 
been investigated by Dubois and Prade [26.4,5] and 
Radzikowska and Kerre [26.9]. In those studies, an 
equivalence relation of approximation space in the orig- 
inal rough sets is generalized to a fuzzy equivalence 
relation. Greco et al. [26.7] proposed fuzzy rough sets 
under a fuzzy dominance relation. Those fuzzy rough 
sets are based on possibility and necessity measures di- 
rectly. Moreover, this type of fuzzy rough sets is defined 
under more generalized settings [26.11, 15] and differ- 
ent types of fuzzy rough sets were proposed based on 
certainty qualifications by Inuiguchi and Tanino [26.10, 
12] and also based on modifier functions by Greco 
et al. [26.24, 25]. The fuzzy rough set model can be 
used to deal with attribute reduction in information sys- 
tems with fuzzy decision while the fuzzy rough set 
model can be employed in reasoning and knowledge 
acquisition with decision tables with real-valued condi- 
tional attributes or quantitative data (see, for example, 
[26.26-36]). 

In the first part of this chapter, we introduce three 
models of fuzzy rough sets. Those fuzzy sets are classi- 
fied into two groups, i. e., classification-oriented fuzzy 
rough set models and approximation-oriented fuzzy 
rough set models proposed by Inuiguchi [26.37] orig- 
inally in the crisp settings. In the classification-oriented 
models, we are interested in a set to which objects 
belong. We evaluate each object whether its member- 
ship to a set X is consistent with all information we 
have at hand or not. The positive region of X is de- 
fined by collecting all objects whose memberships to 
X are consistent with whole information. The possi- 
ble region of X is defined by collecting all objects 
whose memberships to X are conceivable from some 
part of information but not consistent with all infor- 
mation. Then the fuzzy rough set of X is defined by 
a pair of the positive and possible regions of X. On 


the contrary, in approximation-oriented models, we are 
interested in the approximations of a set by using ele- 
mentary sets of a family. We approximate a set X by 
unions of the elementary sets and by intersections of the 
complementary sets of the elementary sets. The lower 
and upper approximations are defined by the inner and 
outer approximations of X, respectively. A rough set 
of X is defined by a pair of the lower and upper ap- 
proximations. We describe that one of the three models 
belongs to the group of classification-oriented models 
and the remaining two models belong to the group of 
approximation-oriented models. 

Another important method used to deal with un- 
certainty in intelligent systems is the Dempster-Shafer 
theory of evidence [26.38]. Shafer’s belief and plausi- 
bility functions are constructed under the assumption 
that the focal elements in the belief structure are all 
crisp. In some situations, it seems to be quite natural 
that the evidence mass may be assigned to a fuzzy sub- 
set of the universe of discourse. In fact, combining the 
Dempster-Shafer theory and fuzzy set theory has been 
suggested to be a way to deal with different kinds of un- 
certain information in intelligent systems in a number of 
studies. It is demonstrated that the lower and upper ap- 
proximation operators in rough set theory have strong 
relationship with the belief and plausibility functions in 
the Dempster-Shafer theory of evidence [26.21, 23, 39- 
44]. The Dempster-Shafer theory of evidence may be 
used to analyze knowledge acquisition in information 
systems (see, for example, [26.45—49]). 

In the second part of this chapter, we will explore 
the relationships between belief and plausibility func- 
tions in the Dempster-Shafer theory of evidence and 
the lower and upper approximations in rough set theory 
with their potential applications to intelligent informa- 
tion systems. 

Both fuzzy set and rough set theories have fostered 
broad research communities and have been applied in 
a wide range of settings. More recently, this has also ex- 
tended to the hybrid fuzzy rough set models. The third 
part of this chapter tries to give a sample of those appli- 
cations, which are in particular numerous for machine 
learning but which also cover many other fields, like 
image processing, decision making, and information re- 
trieval. 

Note that we do not consider applications that 
simply involve a joint application of fuzzy sets and 
rough sets, like for instance a rough classifier that in- 
duces fuzzy rules. Rather, we focus on applications that 
specifically involve one of the fuzzy rough set models 
discussed in the previous sections. 


Fuzzy-Rough Hybridization | 26.2 Classification- Versus Approximation-Oriented Fuzzy Rough Set Models 


This chapter is organized as follows. In the next 
section, three models of fuzzy rough sets are ex- 
plained dividing into two groups. In Sect. 26.3, we 
introduce generalized fuzzy belief structures with ap- 


plication in fuzzy information systems. In Sect. 26.4, 
we give an overview of the practical applications of 
fuzzy rough sets focusing on the machine-learning 
domain. 


26.2 Classification- Versus Approximation-Oriented 


Fuzzy Rough Set Models 


In this section, we review three kinds of fuzzy rough 
sets from classification-oriented and approximation- 
oriented points of view. Focusing on the membership 
of an object to a set X under the indiscernibility re- 
lation, the classical rough set defined by a pair of 
lower and upper approximations of a set X can be seen 
as a classifier of objects into three disjoint regions: 
positive, negative, and boundary regions of a set X. 
Namely, the lower approximation defines the positive 
region, the complement of the upper approximation de- 
fines the negative region and the difference between 
upper and lower approximations defines the boundary 
region. On the other hand, focusing on the approx- 
imations of X by means of elementary sets of the 
partition, the rough set of X defines the inner and 
outer approximations of X. Namely, the lower approx- 
imation defines the inner approximation and the upper 
approximation defines the outer approximation. Those 
two different views of rough sets give different defi- 
nitions of rough sets in the generalized settings (see 
Inuiguchi [26.50]). In this section, we describe fuzzy 
rough sets in a generalized setting from those points of 
view and show the fundamental properties, differences, 
and similarities. 


26.2.1 Classification-Oriented 
Fuzzy Rough Sets 


Definitions in Crisp Setting 
In this subsection, we define fuzzy rough sets under 
the interpretation of rough sets as classification of ob- 
jects into positive, negative, and boundary regions of 
a set and describe their properties. As the introduc- 
tion, we first describe the definitions of positive and 
possible regions of a set in the crisp setting. Let U be 
a set of all objects. Assume that we do not know ob- 
jects which fit with a particular concept C but we have 
pieces of information that tell some objects fit with C 
and that the other objects do not fit. Let X C U be the 
set of objects which are supposed to fit with C in the 
information and U — X the set of objects which are sup- 


posed not to fit with C in the information. On the other 
hand, there is knowledge about C expressed by a bi- 
nary relation P C Ux U. Under the binary relation P, 
we presume y fits with C from facts (y, x) € P and x fits 
with C. 

Under this circumstance, we investigate credible 
members of X and plausible members of X. Objects 
whose membership to X is consistent with the knowl- 
edge can be understood as credible members of X, 
while objects whose membership to X is presumable 
from the information and the knowledge can be un- 
derstood as plausible members. For convenience, we 
define P(x) = {y € U | (y, x) € P} which is the set of ob- 
jects whose membership to X is presumed from the fact 
x € X. Therefore, if x € X satisfies Yy € P(x), y€ X or 
simply, P(x) C X, x can be considered a credible mem- 
ber of X. Thus, the set of credible members of X is 
defined by 


P(X) = {x € X | P(x) € X} Gaal 
=XN{xeU| P(x) CX}. i 

On the other hand, we may presume x € X if x € X or 

dy € X, x € P(y) under the information and the knowl- 


edge. Then the set of plausible members of X can be 
defined by 


P*(X)=XU{xeU|Jy EX, xe P(y) FB}. 
(26.2) 


P(X) is called the positive region of X and P* (X) is 
called the possible region of X. Moreover, we do not as- 
sume the reflexivity of P, i.e., Yx € U, (x, x) € P. This 
is why we take the intersection with X in the defini- 
tion of P(X) and the union with X in the definition 
of P* (X). Those intersection and union can be dropped 
when P is reflexive. 

When there is knowledge about C expressed by a bi- 
nary relation Q C U x U instead of P. Under the binary 
relation Q, we presume y does not fit with C from facts 
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(y,x) € Q and x does not fit with C. In this case, we 
directly obtain positive and possible regions of U — X, 
respectively, by 


Q.(U—X) = {xe U—-X| Ox) CU-X} 
= (U—X)N {xe U| QQ) CU-X}, 
(26.3) 


O*(U—X) =(U-X) U{xeU| aye U-X, 
xE Oy) FA. (26.4) 


Because an object that is not a member of Q..(U — X) 
can be seen as a plausible member of X and an object 
which is not a member of Q*(U — X) can be seen as 
a credible member of X, we may define positive and 
possible regions of X by 


Q(X) = U-O*(U—X), 
Q*(X) = U- Q(U- X). (26.5) 


Inuiguchi [26.50] investigated the properties of those 
positive and possible regions. 


Definitions in Fuzzy Setting 

and Their Properties 
We now extend those definitions of positive and pos- 
sible regions into the fuzzy setting. First, we as- 
sume a fuzzy set X CU and a fuzzy binary rela- 
tion PC UxXU are given. Their membership func- 
tions x(x) and up(y,x) show the membership de- 
gree of x<¢U to a fuzzy set X and the degree to 
what extent we presume that y is a member of 
X from the fact x is a member of a fuzzy set 
X, where uy: U > [0,1] and wp: Ux U > [0,1]. We 
define P(x) by its membership function py.) (y) = 
Up, x). 

To define the positive region under this circum- 
stance, we should consider the consistency degree of 
the information that x is a member of X to member- 
ship degree uy(x) with the knowledge P. This can be 
measured by the truth value of statement y € P(x) im- 
plies y€ X under fuzzy sets P(x) and X. The truth 
value of this statement can be defined by a neces- 
sity measure infyey I(f1p(,) O), ux(y)) with an implica- 
tion function /: [0, 1] x [0, 1] — [0, 1] such that /(0, 0) = 
1(0, 1) =7(1, 1) = 1, 10,0) = 0, I(-, a) is decreasing 
for any a € [0, 1] and /(a,-) is increasing for any a € 
[0, 1]. Therefore, in the analogy to (26.1), the member- 
ship function of the positive region P(X) of X can be 


defined by 


upoo (9 = min ( x0), inf ney 0)- se) J 


= min (ux), inf 44.2), x0) ) 
l (26.6) 


where we note the intersection CMD of two fuzzy 
sets C,DCU is normally defined by ucnplx) = 
min(uc(x), Hp(x)), Vx EU. ucnp, Hc and up are 
membership functions of CMD, C and D. How- 
ever, some researchers use t-norms [26.51] instead 
of the min operation. A t-norm ¢ is a conjunction 
function t: [0, 1] x [0,1] + [0,1] such that (tl) Ya € 
[0, 1], z(a, 1) = t(1,a) =a (boundary condition), (t2) 
Va, b € [0, 1], t(a, b) = t(b, a) (commutativity) and (t3) 
Va,b,c € [0,1], t(a, t(b, c)) = t(t(a, b), c) (associativ- 
ity). 

Now let us define the possible region when X 
and P are a fuzzy set and a fuzzy binary relation, 
respectively. To do this, we should define the truth 
value of statement there exists ye X such that x€ 
P(y) under fuzzy sets X and P(x). The truth value 
of this statement can be obtained by a possibility 
measure sup,cy T (upo) (Xx), Ux(y)) with a conjunction 
function T: [0, 1] x [0, 1] — [0, 1] such that T(1, 1) = 1, 
T(0,0) = T(0, 1) = TC, 0) = 0 and T is increasing in 
both arguments. Therefore, in the analogy to (26.2), the 
membership function of the possible region P* (X) of X 
can be defined by 


p+ (x) (x) = max (1 (x), ap T (poy), z 


= max (o. ap T(up(x, y), z ; 
l (26.7) 


where we note the union CUD of two fuzzy 
sets C,DCU is normally defined by wcup(x) = 
max(uc(x), 4p(x)), Vx € U. ucup is a membership 
functions of CU D. However, some researchers use 
t-conorms [26.51] instead of the max operation. A t- 
conorm s is a function s: [0, 1] x [0, 1] — [0, 1] such that 
(sl) Ya € [0,1], s(a, 0) = s(0,a) = a (boundary con- 
dition), (s2) Va, b € [0,1], s(a, b) = s(b,a) (commu- 
tativity), (s3) Ya, b, c € [0, 1]s(a, s(b, c)) = s(s(a, b), c) 
(associativity). and (s4) Ya, b, c,d such that a > c and 
b > d; s(a, b) > s(c, d) (monotonicity). 

Note that we do not assume the reflexivity of 
P, i.e., up(x,x)=1, Vxe U so that we take the 
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minimum between py and infyey I(urœw O), uxO)) 
in Eq. (26.6) and the maximum between py and 
supyey T (upo (x), Hx(y)) in (26.7). When P is reflex- 
ive, [(1,a) <a and T(1,a) =a for all a€ [0,1], we 
have 


Hp. (xy) = inf Iuew O), Hx) » 


26.8 
[p* (x) (x) = Sup Turo (x), ux). on 


Those definitions of lower and upper approximations 
have been proposed by Dubois and Prade [26.4,5] 
and Radzikowska and Kerre [26.9]. They assumed the 
reflexivity of P and J(1,a) = T(1,a) =a, for all a € 
[0, 1]. Moreover, the definitions of (26.8) are used even 
when P is not reflexive and neither Z nor T satisfy 
the boundary conditions /(1,a) = T(1,a) = a, for all 
a € [0, 1] [26.15, 52]. In such generalized situation, we 
may loose the inclusiveness of P4 (X) in X and that of X 
in P*(X) for Ps (X) and P*(X) defined by (26.8). The 
definitions of Px (X) and P* (X) by (26.6) and (26.7) ob- 
tained from the interpretations of positive and possible 
regions of X satisfy the inclusiveness of Px (X) in X and 
that of X in P* (X) even in the generalized situation. 

Using the positive region P(X) and the possible 
region P*(X), we can define a fuzzy rough set of X as 
a pair (Ps (X), P* (X)). We can call such fuzzy rough 
sets as classification-oriented fuzzy rough sets under 
a positively extensive relation P of X (for short CP- 
fuzzy rough sets). Note that the relation P depends on 
the meaning of a set X. Thus, we cannot always define 
the CP-rough set of U — X by the same relation P. 

To define a CP-rough set of U—X, we should in- 
troduce another fuzzy relation QC U xU such that 
How Y) = Holy, x) represents the degree to what ex- 
tent we presume an object y as a member of U — X from 
the fact x is a member of U — X, where ug: U x U > 
[0, 1] is a membership function of a fuzzy relation Q. In 
the same way, we define positive and possible regions 
of U—X under fuzzy relation Q by the following mem- 
bership functions 


LQ..(U—X) (x) 

= min (roxo), inf 104010. n(uee))) , 
(26.9) 

Ho* (U—x) (x) 

= max (roxo ap T(uo(x, y), nawo) , 


(26.10) 


where U—X is defined by a membership function 
n(jx(-)) andn: [0, 1] — [0, 1] is a strong negation which 
is a decreasing function such that n(n(a)) = a,a € [0, 1] 
(involutive). The involution implies the continuity of n. 

Using Q..(X) and Q* (X), in analogy to (26.5), we 
can define the positive region Q,.(X) and the possible 
region Q* (X) of X by the following membership func- 
tions 


Hoo 0) 

= min (a), ingar uono). 
(26.11) 

How (x) 


= max (no. sup n(I(Lo(y, x), nao) s 
l (26.12) 


_ We can define a fuzzy rough set of X as a pair 
(Qx (X), O* (X)) with the positive region Q» (X) and the 
possible region Q* (X). We can call this type of rough 
sets as classification-oriented fuzzy rough sets under 
a negatively extensive relation Q of X (for short CN- 
fuzzy rough sets). 

Let us discuss the properties of CP- and CN-fuzzy 
rough sets. By definition, we have 


P,Q CX C P*(X), 
OX) CX CQ*(X), (26.13) 
P. (Ø) = P* (Ø) = Ox(@) = O* (Ø) =ø, (26.14) 
P,(U) = P* (U) = O«(U) = Q*(U) = U 


(26.15) 


P4 (XN Y) = Px (X) A P4 (Y), 
P* (XU Y) = P* (X) U P* (Y), (26.16) 
Qx (XN Y) = Q(X) N Q4), 
O* XUY) = O*(X)UO*(Y), (26.17) 
X CY implies P(X) C P4 (Y), 
X C Y implies P* (X) c P* (Y), (26.18) 
X C Y implies Q+ (X) C Qx (Y) , 
X C Y implies Q* (X) C Q* (Y) , (26.19) 
Pa (XU Y) D Py (X) U P4 (Y), 
P*(XA Y) C P*(X)NP*(Y), (26.20) 


Qx (XU Y) 2 Q4 (X)U Q4 (Y), 
O* (XN Y) c O*(X)NO*(Y), (26.21) 
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where the inclusion relation between two fuzzy sets A 
and B is defined by ua (x) < p(x), for all x € U. 

The properties satisfied under some conditions are 
listed as follows (see Inuiguchi [26.37]): 


(1) When I(a,b) =n(T(a,n(b))), for all a,b € [0, 1] 
and Q is the converse of P, i. e., Wa(x, y) = p(y, x), 
for all x,y € U, we have 

P4 (X) = U—Q*(U—X) = Q4 (X) , 
P*(X) = U- Q(U- X) = Q* (X). 


(26.22) 
(26.23) 


(2) When T (a, I(a, b)) < b holds for all a, b € [0, 1], we 
have 


X 2 P* (P4 (X)) 2 Px (X) 2 Px(Px(X)) , 
(26.24) 


X C Q4 (Q* (X)) € O*(X) € O* (O*(X)). 
(26.25) 


(3) When I (a, T (a, b)) > b holds for all a, b € [0, 1], we 
have 


X C P4 (P* (X)) C P* (X) C P* (P*(X)), 
(26.26) 


X 2 O* (Qx (X)) 2 Q4 (X) 2 Q4 (0 (X)) . 
(26.27) 


(4) Let P and Q be T’-transitive. The following asser- 
tions are valid: 
(a) When J is upper semicontinuous and satis- 
fies I(a, I(b, c)) = I(T' (b, a), c) for all a,b,c € 
[0, 1], we have 
Pa (Px (X)) = Px(X), Q*(Q*(X)) = O*(X). 
(26.28) 
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(b) When T =T is lower semicontinuous and 
satisfies T(a,T(b,c))=T(T(a,b),c) for all 
a,b,c € [0, 1] (associativity), we have 
P*(P*(X)) = P*(X), Qs (Qx (X)) = Q+ (X) . 

(26.29) 


(5) When P and Q are reflexive and T-transitive, the 
following assertions are valid: 
(a) If I(a,-) is upper semicontinuous, I(1,a) < a, 
and T = [I] is associative, then we have 
P* (P4 (X)) = Px(X), Ox(Q*(X)) = Q* (X) . 
(26.30) 


(b) If I(a, b) = n(& [I] (a, n(b))) and the conditions 
of (a) are satisfied, then we have 

P4 (P*X)) = P*(X), Q*(Qx(X)) = Ox(X). 

(26.31) 


Here a fuzzy relation P is said to be T’-transitive, if 
and only if P satisfies p(x, z) > T’ (p(x, y), up, Z)) 
for all x, y, z € U and for a conjunction function T’. We 
can generate a function &[/]: [0, 1] x [0, 1] — [0, 1] by 
E[7](a, b) = inf{s € [0, 1] | (a, s) > b} when a function 
I: [0, 1] x [0, 1] = [0, 1] is given. &[/] is a conjunction 
function when 7 satisfies 7(1, a) < 1 for all a € [0, 1). 

Concerning to the assumption of (1), it is known 
that a function J’ defined by I’ (a, b) = n(T(a,n(b))) is 
an implication function and that a function T’ defined 
by T’(a,b) =n((a,n(b))) is a conjunction function 
(see, for example, Inuiguchi and Sakawa [26.51,53)). 
The assumption of (2) corresponds to modus ponens, 
i.e., A and (A — B) implies B. Therefore, it is a nat- 
ural assumption. However, this cannot hold for any 
implication and conjunction functions. For example, 
consider functions T(a,b) = min(a,b) and I(a, b) = 
max(1—a,b) which are often used in possibility the- 
ory. T(a, I(a, b)) < b does not always hold. On the other 
hand, the assumption holds for any T and J such that 
T(a, b) < min(a, b) for all a,b € [0,1] and Z(a, b) < b 
for all a,b € [0, 1] satisfying a > b. Thus, a t-norm T 
and a residual implication / of a t-norm satisfies the as- 
sumption, i. e., J is defined by I(a, b) = sup{s € [0, 1] | 
t(a,s) < b}, for a,b € [0, 1]. The assumption of (3) is 
dual with that of (2). Namely, for any implication func- 
tion 7, there exists a conjunction function T” such that 
I(a, b) = n(T’(a, n(b))), and for any conjunction func- 
tion T, there exists an implication function /’ such that 
T(a, b) = n(I'(a, n(b))). Using T’ and I’, the assump- 
tion I(a, T(a, b)) = b is equivalent to T’(a, T’ (a, b)) < b 
which is the same as the assumption of (2). 

The assumption of (3) is satisfied with Z and T 
such that [(a, b) > max(n(a), b) for all a, b € [0, 1] and 
T(a, b) > b for all a,b € [0, 1] satisfying a > n(b). The 
assumption of (4)-(a) is satisfied with residual im- 
plication functions of lower semicontinuous f-norms 
T’ and S-implication functions with respect to lower 
semicontinuous t-norms T’, where an S-implication 
function J with respect to the t-norm T’ is defined 
by I(a,b) = n(T’(a,n(b))), a,b € [0, 1] with a strong 
negation n. The assumption of (4)-(b) is satisfied with 
lower semicontinuous t-norms T. These assumptions 
are satisfied with a lot of famous implication and con- 
junction functions. 
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26.2.2 Approximation-Oriented 
Fuzzy Rough Sets 


Definitions in Crisp Setting 

In this section, we define fuzzy rough sets under the in- 
terpretation of rough sets as approximation of sets and 
describe their properties. We first describe the defini- 
tions of lower and upper approximations in the crisp 
setting. We assume a family of subsets in U, F = 
{F;|i=1,2,...,p}is given. Each elementary set F; is 
a meaningful set of objects such as a set of objects sat- 
isfying some properties. F;s can be seen as information 
granules with which we would like to express a set of 
objects. Given a set X C U, an understated expression 
of X, or in other words, an inner approximation of X by 
means of unions of F;s is obtained by 


FLO= JF EFFEX. (26.32) 


On the other hand, an overstated expression of X, or in 
other words, an outer approximation of X by means of 
unions of F;s is obtained by 


FE =U FI Ur2x, 
ies ies 
Jepo, (26.33) 


where we define Fo = U. We add Fo considering cases 
where there is no J C {1,2,...,p} such that ),-,; Fi 2 
X. In such cases, we obtain FÖ (X) = U owing to the 
existence of Fo = U. FY (X) and F(X) are called 
lower and upper approximations of X, respectively. 
Applying those approximations to U — X, we obtain 
FY (U—X) and Fö (U — X). From those, we obtain 


FRX) =U- FE (U-X) 


=Vju-Ur|Un2u-x, 
ies ie] 
Jepo, (26.34) 


FAX) =U- Fy (U-X) 
=[( {U -F; |F; c U-X, 
ie€{1,2,...,p,e}}, (26.35) 


where we define Fe = Ø. We note that FẸ (X) and 
F(X) are not always the same as Fy’ (X) and Fö (X), 


respectively. The properties of those lower and upper 
approximations are studied by Inuiguchi [26.50]. 


Definitions by Certainty Qualifications 

in Fuzzy Setting 
We extend those lower and upper approximations to 
cases where F is a family of fuzzy sets in U and X 
is a fuzzy set in U. To do this, we extend the inter- 
section, union, complement, and the inclusion relation 
into the fuzzy setting. The intersection, union, and com- 
plement are defined by the min operation, the max 
operation and a strong negation n, i. e., CN D, CUD and 
U—C for fuzzy sets C and D are defined by member- 
ship functions Wcnp(x) = min(uc(x), up(x)), Vx € C, 
Hcup(x) = max(uc(x), Mp(x)), WeEC, bLu—c(x) = 
n(uc(x)), Vx € C, respectively. The inclusion relation 
C CD is extended to inclusion relation with degree 
Inc(C, D) = inf, I(uc(x), p(x), where J is an impli- 
cation function. 

First let us define a lower approximation by extend- 
ing (26.32). In (26.32), before applying the union, we 
collect F; such that F; C X. This procedure cannot be 
extended simply into the fuzzy setting, because the in- 
clusion relation has a degree showing to what extent the 
inclusion holds in the fuzzy setting. Namely, each F; has 
a degree q; = Inc(F;, X). This means that X includes F; 
to a degree q;. Therefore, by using F;, X is expressed as 
a fuzzy set including F; to a degree q;. In other words, 
X is a fuzzy set Y satisfying 


Inc(F;, Y) = inf I(ur, (x), Uy (x) = qi . (26.36) 


We note that there exists a solution satisfying (26.36) 
because q; is defined by Inc(F;, X). There can be many 
solutions Y satisfying (26.36) and the intersection and 
union of those solutions can be seen as inner and outer 
approximations of X by F;. Because we are now extend- 
ing (26.32) and interested in the lower approximation, 
we consider the intersection of fuzzy sets including F; 
to a degree q;. Let us consider 


Inc(F;, Y) = inf I(r, (x), by(x)) = qi, (26.37) 


instead of (26.36). Equation (26.37) is called a con- 
verse-certainty qualification [26.10] (or possibility- 
qualification). Because /(a, -) is increasing for any a € 
[0, 1] for an implication function J and (26.36) has a so- 
lution, the intersection of solutions of (26.36) is the 
same as the intersection of solutions of (26.37). More- 
over, because J is upper semicontinuous, we obtain the 
intersection of solutions of (26.37) as the smallest solu- 
tion Y of (26.37) defined by the following membership 


7°92 |) Hed 


432 


7°92 |) Hed 


Part C 


Rough Sets 


function 


My (x) = inffs € [0, 1] | (ue, (x), 8) > qi 

= $ [N (Hr, (), qi) - (26.38) 
We have Y C X. Because we have many F; € F, the 
lower approximation F. £ (X) of X is defined by the fol- 
lowing membership function 


Hp (xy) 


= sup él (unc. inf Mu). ) | (26.39) 
FEF ye 


where F can have infinitely many elementary fuzzy 
sets F. 

Because (26.32) is extended to (26.39), (26.35) is 
extended to the following equation in the sense that 


FEX) = U-FE(U-X) 


H FE (x) 


= jot. m (em (ur. in 14r().m(ux0))) 
(26.40) 


where u FEW is the membership function of the upper 
approximation F(X) of X. 

Now let us consider the extension of (26.33). In 
this case, before applying the intersection, we collect 
Uie, Fi such that (J;e; F: 2 X. In the fuzzy setting, 
each (J;e; F; has a degree ry = Inc(X, Uie; Fi). This 
means that X is included in | J;e; F; to a degree ry. 
Therefore, by using F;, i € J, X is expressed as a fuzzy 
set included in |_];<; F; to a degree ry. In other words, X 
is a fuzzy set Y satisfying 


Inc (r U n) = inf (uro. max un) 
x jE 
ies 


= rfj. 
(26.41) 


We note that there exists a solution satisfying (26.41) 
because ry is defined by Inc(X, |),<; Fi). There can be 
many solutions Y satisfying (26.41) and the intersec- 
tion and union of those solutions can be seen as inner 
and outer approximations of X by (J;e; Fi. Because we 
are now extending (26.33) and interested in the upper 
approximation, we consider the union of fuzzy sets in- 


cluding U,<; F; to a degree ry. Let us consider 


Inc (r U n) = inf] (uro. max u) 
ier x ie] 
Zj, 


(26.42) 


instead of (26.41). Equation (26.42) is called a cer- 
tainty qualification [26.10,54]. Because J(-,a) is de- 
creasing for any a € [0, 1] for an implication function 
I and (26.41) has a solution, the union of solutions 
of (26.41) is the same as the union of solutions 
of (26.42). Moreover, because I is upper semicontin- 
uous, we obtain the union of solutions of (26.42) as the 
greatest solution Y of (26.42) defined by the following 
membership function 


M5 (x) = sup fs € [0,1] | I (s max uno) > al 


= o[z] @ max u) , 
ie] 
(26.43) 
where we define o[/](a,b) = sup{s € [0, 1] | Z(s, b) = 
a} for a,b € [0, 1]. We have X C Y. Because we have 


many (J;e; F;, the upper approximation F% (X) of X is 
defined by the following membership function 


Legs(xy (x) 


= inf off] ( inf I . 
jn fol ] (ing (1x0. ap ur) 


sup uro) g (26.44) 


FET 


where F can have infinitely many elementary fuzzy sets 
F. We note that o [J] becomes an implication function. 

Because (26.33) is extended to (26.44), (26.34) is 
extended to the following equation in the sense that 
FE (X) =U-F4(U-X) 


HFI (x) (x) 


= sup n (om (ing (noxo). sup ir) 
TEF yeu FET 
sup 7) . 
FET 


where Hra) is the membership function of the upper 
approximation F5(X) of X. 


(26.45) 
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These four approximations were originally pro- 
posed by Inuiguchi and Tanino [26.10]. They se- 
lected a pair (FiO, Ft (X)) to define a rough set 
of X. However, in connection with the crisp case, 
Inuiguchi [26.37] selected pairs (FÈ (X), F (X)) and 
(Fo (X), Fz f (X)) for the definitions of rough sets of X. 
In this chapter, a pair (FE (X), Fë (X)) is called a £- 
fuzzy rough set and a pair (F (x), Fe (X)) a o-fuzzy 
rough set. 


Properties 
First, we show properties about the representations of 
lower and upper approximations defined by (26.39), 
(26.40), (26.44), and (26.45). We have the following 
equalities (see Inuiguchi [26.37]) 


Hrt O = supti ure), h) |F € Fhe [0,1] 

such that 
Hrg œ (x) = sup fn (om (» sup u) ) | 

FET 

TCF,he [0,1] 

such that 

ol Q sup 10) > n(ux(y)), Yy € u} 

Fe 
(26.47) 

Hgt oo œ = infin G UZ] (ee), h) | 

FefF,he[0,1] 

such that 

Elur), h) < n(ux(y)), Yy € U}, (26.48) 
Le Fe (x) (x) 

= int fot (« sup u) |T CF,he [0,1] 

FET 
such that 
oli] @ sup ur) > ux), Vy € u} ; 
Fe 
(26.49) 


Using these equations, the following properties can be 
easily obtained: 


FEM CXC FE), 
FZX EXC FX), (26.50) 


Fe@) = FEO) =9 


FEU) = Fg(U) =U, (26.51) 
X CY implies F£(X) C F(Y), 

X CY implies F2 (X) C FIY), (26.52) 
X C Y implies Fz (X) C Fz (Y), 

X C Y implies F(X) C F(Y), (26.53) 
FERUY 2 FOU FEC), 

FI XUY) 2 FZ X)U FFY), (26.54) 
Fe(XNY)C FËXNFĚY). 

FŽXNAY) C FŽX) N FEY), (26.55) 
FÊU-X) =U- FEX), 

FZ (U-X) =U- FRX). (26.56) 


Furthermore, we can prove the following properties 
(see Inuiguchi [26.37]): 


(7) The following assertions are valid: 
(a) Ifa > 0, b < 1 imply (a,b) < 1 and 


a . 
inf sup hrx) >0 


then we have 
F(U) = U and Ff (0) =Ø 
(b) If b < 1 implies /(1, b) < 1 and 


inf sup r(x) = 1, 
U FEF 


then we have 
Fe (U) = U and FE (0) =9 


(c) If a>0, b< 1 imply I(a,b)< 1 and Yx € U, 
JF €e F such that urp(x)< 1, then we have 
FZ (Ù) =U and F$ Ø) =Ø 

(d) Ifa > 0 implies Z(a,0) < 1 and Yx € U, JF € F 
such that ur (x) = 0, then we have F2 (U) = 


and Fž (Ø) = 
(8) We have 
FANY) = FF VNF Y), 
F3 XUY) = FX) U FSO). (26.57) 


Moreover, if Va € [0, 1], 7(a,a) = 1 and YF;, F; € 
F, Fi A Fj, Fi Q F; = Ø, we have 


FEXNY) = FEX) N FEY), 
Fe XUY) = Fe (X)U FY). (26.58) 


433 


7°92 |) Hed 


434 Part C | Rough Sets 


7°92 |) Hed 


(9) We have 
FLED) = FÊX) , 
FS CFS = FIX), (26.59) 
FE (FEO) = FE 
FFs) = F% (X). (26.60) 


Inuiguchi and Tanino [26.10] first proposed this 
type of fuzzy rough sets. They demonstrated the ad- 
vantage in approximation when P is reflexive and 
symmetric, J is Dienes implication, and T is minimum 
operation. Inuiguchi and Tanino [26.55] showed that by 
selection of a necessity measure expressible various in- 
clusion situations, the approximations become better, 
i.e., the differences between lower and upper approx- 
imations satisfying (26.58) become smaller. Moreover, 
Inuiguchi and Tanino [26.56] applied these fuzzy rough 
sets to function approximation. 


26.2.3 Relations Between Two Kinds 
of Fuzzy Rough Sets 


Under the given fuzzy relations P and Q described 
in Sect. 26.2.1, we discuss the relations between two 
kinds of fuzzy rough sets. Families of fuzzy sets are 
defined by P = {P(x) | x € U} and Q = {Q(x) |x € U}. 
We have the following assertions: 


(10) When P and Q are reflexive, /(1, a) = a, we have 


P(X) C PE(X), QE(X) CO*(X). 
(26.61) 


(11) When P and Q are reflexive, X is a crisp set, a < b 
if and only if I(a,b) = 1 and T(a, 1) =a for all 
a € [0, 1], we have 


POS EH), On(X) COT). 
(26.62) 


(12) When P and Q are T-transitive, the following as- 
sertions are valid: 
(a) When T = &€[I/] is associative, we have 


PE (X) SPx(X),  O*(X) C Q(X). 
(26.63) 
(b) When T = €[o[/]] and of[/](a, o[/](b,c)) = 


olN|(b, o[1]|(a,c)) for all a,b,c € [0,1], we 
have 


P*(X) C PX), Q(X) c OX). 
(26.64) 


Here we define 
t[T](a, b) = sup{s € [0, 1] | T(a, s) < b}. 


This functional ¢ can produce an implication function 
from a conjunction function T. Note that ¢ [E [Z]] = Z and 
€[¢[T]] = Z for upper semicontinuous J and lower semi- 
continuous T (see Inuiguchi and Sakawa [26.53]). 


26.2.4 The Other Approximation-Oriented 
Fuzzy Rough Sets 


Greco et al. [26.24, 25] proposed fuzzy rough sets cor- 
responding to a gradual rule [26.57], the more an object 
is in G, the more it is in X with fuzzy sets G and X. 
Corresponding to this gradual rule, we may define the 
lower approximation GF (X) of X and the upper ap- 
proximation G} (x) of X, respectively, by the following 
membership functions 


Het oy) = influx(2)|z€ U, ue) = Me}. 
(26.65) 

Lox œw) = supt{ux(z) |z€ U, uek) < ue). 
(26.66) 


When we have a gradual rule, the less an object is in G, 
the more it is in X, we define the lower approximation 
G, (X) of X and the upper approximation G* (X) of X, 
respectively, by the following membership functions 


Lax xy (x) = inf{x(z) | z€ U, We) < Me}. 
(26.67) 


Ho (x) (x) = supfux(z) |z € U, eZ) = Me()} - 
(26.68) 


Moreover, when a complex gradual rule, the more an 
object is in GĦ and the less it is in GT, the more it is in 
X is given, the lower approximation G% (X) and upper 
approximation G7, (X) are defined, respectively, by the 
following equations 


H goo) = influx(@) [Ze U, ug @) > Hg, 


ug) <u, (26.69) 
ugy co (x) = sup{ux(z)|z€U, we @ < ug 0), 
MG (2) = ug). (26.70) 


where we define G = {G+, G7}. 

The fuzzy rough sets are defined by pairs of those 
lower and upper approximations. This approach is ad- 
vantageous in (i) no logical connectives such as im- 
plication function, conjunction function, etc., are used 
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and (ii) the fuzzy rough sets correspond to gradual 
rules (see Greco et al. [26.24,25]). However, we need 
a background knowledge about the monotone proper- 
ties between G (or G) and X. 

This approach can be seen from a viewpoint of mod- 
ifier functions of fuzzy sets. A modifier function ø is 
generally a function from [0, 1] to [0, 1] [26.58]. Func- 
tions defined by g(x) = x7, gx) = yx and 3(x) = 
1 — x are known as modifier functions corresponding to 
modifying words very, more, or less and not. Namely, 
given a fuzzy set A, we may define fuzzy sets very A, 
more or less A and not-A by the following membership 
functions 


Every A (x) = (HA (x))’, 


more or less A (x) = y Ma (x), 


Hnot-a (X) = 1— pa(x). (26.71) 


Such modifier functions are often used in approxi- 
mate/fuzzy reasoning [26.59, 60], especially in the in- 
direct method of fuzzy reasoning which is called also, 
truth value space method. 

Namely, we may define the lower approximation 
®,.(X) of X and the upper approximation ®* (X) of X 
by means of a fuzzy set G by the following membership 
functions 


Le. (x(x) = of * (Ue). 


Horo) = 96x (Me) , (26.72) 
where ® = {p97 *¥, pë y} and modifier functions 
go * and gë y are selected to satisfy 


oy * (ue) < Ux), 
Pox (UG(X)) > mxx), Wr EU. 
Indeed, 


UC, inf M60). wx) 


(26.73) 


and 
ol|Cnt Mux), He), ) 


are modifier functions satisfying (26.73) and these 
are used to define FÈ (X) and FŽ(X) in (26.39) and 
(26.44), respectively. We note that we consider multiple 
fuzzy sets G = F € F in (26.39) and apply the union 
because we have 


SU ](Ha, inf (May). ux) < px0) 
VxeUforallGe F. 


Similarly we consider multiple fuzzy sets G de- 
fined by G(x) = sUppercy r(x), xE U in (26.44) 
and we apply the intersection because we have 
o [Gni ev u60) u0), UG) = ux), Vee U 
for all those fuzzy sets G. 

In the definitions of (26.65)—(26.68), the following 
modifier functions are used, respectively 


yx (a) = supt yt (B) | B € [0,a]}. 


p (a) = infty} (B) | B € [æ, 1]}. (26.74) 
Px (a) = supt Wx (B)| € læ, 1}. 
g~ (a) = inf{y=(B) | B € [0.a]} , (26.75) 


where we define 


Wet (@) = inf{ux(2) | z€ U, uoz) Za}, (26.76) 
Wi (@) = sup{ux(z) |zE€ U, uo) <a}, (26.77) 
We (œ) = inf{ux(z)|z€ U, wc(z) <a}, (26.78) 
w= (a) = sup{ux(z) |ze U, Welz) >a}, (26.79) 


with inf = 0 and sup ð = 1. We note that oe and 
gy are monotonically increasing which gy, and ọŽ 
are monotonically decreasing. These monotonicities are 
imposed in order to fit the supposed gradual rules. How- 
ever, such monotonicities do not hold for functions wet 5 
Wi. y; and w%. In the cases of (26.69) and (26.70), 
we should extend the modifier function to a generalized 
modifier function which is a function from [0, 1] x [0, 1] 
to [0, 1] because we have two fuzzy sets in the premise 
of the corresponding gradual rule. The associated gen- 
eralized modifier functions with (26.69) and (26.70) are 
obtained as 


p3 (@1, 2) 

= sup{wx (B1, B2) | Bi € [0,01], 82 € (a2, IJ}, 
(26.80) 

p3 (a1, 02) 

= sup{y { (1, B2) | Bi € [or 1], Bo € [0,a0]} , 
(26.81) 

where we define 

Wx (Bi, B2) 

= inf{ux(2)|zEU, wd © = Bi, ug < Bo}, 
(26.82) 

Wi (Bi, Bo) 

= sup{ux(z)|ze U, wd (2) < Bi, uG) = Bo}, 
(26.83) 
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with inf@=0O and sup@=1. We note gz and 
gy are monotonically increasing in the first argu- 
ment and monotonically decreasing in the second 
argument. 

Moreover, when we do not have any background 
knowledge about the relation between G and X which 
is expressed by a gradual rule. We may define the 
lower approximation G,.(X) and the upper approxima- 
tion G* (X) by the following membership functions 


Mex) (x) = infix) |z € U, uel) = ue), 
(26.84) 


uow (x) = sup{ux kz) |z € U, uel?) = uco). 
(26.85) 


The modifier functions associate with these approxima- 
tions are obtained as 


(26.86) 
(26.87) 


Px (@) = inf{ux(z) |z E€ U, uez) =a}, 
gy” (a) = sup{ux(z) |z E€ U, Welz) =a}, 


where we define infO=0O and sup@=1. Equa- 
tion (26.87) is same as the inverse truth qualifica- 
tion [26.59, 60] of X based on G. 

We describe the properties of the approximations 
defined by (26.65) to (26.68). However. the other ap- 
proximations defined by (26.69), (26.70), (26.84), and 
(26.85) have the similar results. We have the following 
properties for the approximations defined by (26.65) to 
(26.68) (see Greco et al. [26.25] for a part of these prop- 
erties): 


GMcxcet@, 


Gy X) CXC G*(X), (26.88) 
Gt @) = G4 (Ø) = G 0) = GEG) =9., 
(26.89) 
G} (U) = Gł (U) = G3 (U) = G* (U) =U, 
(26.90) 
GF XNY) =G XQNGI (Y), 
Gy (XN Y) =G (XN Y), (26.91) 
G4 (XU Y) = GÏ (X) U GY (Y) , 
G* (XU Y) = G* (X) UG*(Y), (26.92) 
X C Y implies G} (X) € G} (Y) , 
X CY implies G} (X) C G4(Y), (26.93) 


X CY implies G, (X) € G, (Y), 


X C Y implies G* (X) C G*(Y), (26.94) 
GF AUY) 2 GX UGE Y), 
GÈ ANY EDN), (26.95) 
G, (XUY) 2 G (X)UG,(Y), 
G* (XN Y) C G*(X)NG*(Y), (26.96) 
Gf (U\X) = U\G*(X) = UVUA O$ Œ) 

= (U\G); (U\X), (26.97) 
Gy (U\X) = U\G4(X) = U\(U\G)E) 

= (U\G)+(U\X), (26.98) 
Gt (GE (xX) = GE (GT Œ) = GE), 
G4. (G4 (X)) = Gt (G(X) = G4 (X), (26.99) 
Gy (Gy (X)) = GŽ (G; (X)) = G4 (X) , 
G* (G* (X)) = G, (GE (X)) = G (X), (26.100) 


where U\X is a fuzzy set defined by its membership 
function y\x(x) = N(ux(x)), Vx € U with a strictly 
decreasing function N: [0, 1] — [0, 1]. We found that all 
fundamental properties [26.2] of the classical rough set 
are preserved. 


26.2.5 Remarks 


Three types of fuzzy rough set models have been 
described, divided into two groups: classification- 
oriented and approximation-oriented models. The 
classification-oriented fuzzy rough set models are much 
more investigated by many researchers. However, the 
approximation-oriented fuzzy rough set models would 
be more important because they are associated with 
rules. While approximation-oriented fuzzy rough set 
models based on modifiers are strongly related to the 
gradual rules, approximation-oriented fuzzy rough set 
models based on certainty qualification have relations to 
uncertain generation rule (uncertain qualification rule: 
certainty rule and possibility rule) [26.54], i.e., a rule 
such as the more an object is in A, the more certain (pos- 
sible) it is in B. While approximation-oriented fuzzy 
rough set models based on modifiers need a modi- 
fier function for each granule G, the approximation- 
oriented fuzzy rough set models based on certainty 
qualification need only a degree of inclusion for each 
granule F. Therefore, the latter may work well for data 
compression such as image compression, speech com- 
pression, and so on. 
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26.3 Generalized Fuzzy Belief Structures with Application 


in Fuzzy Information Systems 


In rough set theory there exists a pair of approximation 
operators, the lower and upper approximations, whereas 
in the Dempster-Shafer theory of evidence there ex- 
ists a dual pair of uncertainty measures, the belief and 
plausibility functions. In this section, general types of 
belief structures and their induced dual pairs of belief 
and plausibility functions are first introduced. Relation- 
ships between belief and plausibility functions in the 
Dempster-Shafer theory of evidence and the lower and 
upper approximations in rough set theory are then es- 
tablished. It is shown that the probabilities of lower 
and upper approximations induced from an approxima- 
tion space yield a dual pair of belief and plausibility 
functions. And for any belief structure there must exist 
a probability approximation space such that the belief 
and plausibility functions defined by the given belief 
structure are, respectively, the lower and upper proba- 
bilities induced by the approximation space. The pair 
of lower and upper approximations of a set capture 
the non-numeric aspect of uncertainty of the set which 
can be interpreted as the qualitative representation of 
the set, whereas the pair of the belief and plausibil- 
ity measures of the set capture the numeric aspect of 
uncertainty of the set which can be treated as the quan- 
titative characterization of the set. Finally, the potential 
applications of the main results to intelligent informa- 
tion systems are explored. 


26.3.1 Belief Structures and Belief Functions 


In this section, we recall some basic notions related to 
belief structures with their induced belief and plausibil- 
ity functions. 


Belief and Plausibility Functions Derived 

from a Crisp Belief Structure 
Throughout this section, U will be a nonempty set 
called the universe of discourse. The class of all sub- 
sets (respectively, fuzzy subsets) of U will be denoted 
by P(U) (respectively, by F(U)). For any A € F(U), 
the complement of A will be denoted by ~ A, i.e., 


(~ A)(x) = 1—A(x) forallxe U. 


The basic representational structure in the 
Dempster-Shafer theory of evidence is a belief 
structure. 


Definition 26.1 

Let U be a nonempty set which may be infinite, a set 
function m: P(U) — [0, 1] is referred to as a crisp basic 
probability assignment if it satisfies axioms (M1) and 
(M2) 


(M1) m(@) =0, (M2) > m(X) =1. 


XEU 


A set X € P(U) with nonzero basic probability assign- 
ment is referred to as a focal element of m. We denote 
by M the family of all focal elements of m. The pair 
(M, m) is called a crisp belief structure on U. 


Lemma 26.1 
Let (M, m) be a crisp belief structure on U. Then the 
focal elements of m constitute a countable set. 


Proof: For any n € {1,2,...}, let 
H, = {A € M|m(A) > 1/n}. 


By axiom (M2) we can see that for each n € {1,2,...}, 
H, is a finite set. Since M = Uz H,, we conclude 
that M is countable. m 

Associated with each belief structure, a pair of be- 


lief and plausibility functions can be defined. 


Definition 26.2 

Let (M, m) be a crisp belief structure on U. A set func- 
tion Bel: P(U) —> [0, 1] is called a CC-belief function 
on U if 


Bel(X) = X mM), YXe€?PU). 
MCX 


A set function Pl: P(U) — [0,1] is called a CC- 
plausibility function on U if 


(26.101) 


P= > mM), VXeEP(U). (26.102) 


MOXÆØ 


Remark 26.1 

Since M is a countable set, the change of convergence 
may not change the values of the infinite (countable) 
sums in (26.101) and (26.102). Therefore, Defini- 
tion 26.2 is reasonable. 
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The CC-belief function and CC-plausibility func- 
tion based on the same belief structure are connected 
by the dual property 


PI(X) = 1—Bel(= X), VX € P(U) (26.103) 
and moreover, 
Bel(X) < PI(X), VX € P(U). (26.104) 


When U is finite, a CC-belief function can be 
equivalently defined as a monotone Choquet capac- 
ity [26.61] on U which satisfies the following proper- 
ties [26.38]: 


(MC1) Bel(@) = 0, 
(MC2) Bel(U) = 1, 
(MC3) for all X; € P(U), i= 1,2,...,k, 


n(x) 


i=l 
YD Bel (Q=) . 
ies 


DAICA,2,....43 
(26.105) 


Similarly, a CC-plausibility function can be equiva- 
lently defined as an alternating Choquet capacity on U 
which satisfies the following properties: 

(AC1) PI(@) = 0, 
(AC2) PI(U) = 1, 
(AC3) for all X; € P(U),i=1,2,...,k, 


k 
Pl (a x) < 
i=1 
> “enh (Ux) . (26.106) 


i€J 
A monotone Choquet capacity is a belief function 
in which the basic probability assignment can be calcu- 
lated by using the Mobius transform 


mX) = X (-1)*\"!Bel(¥),X € P(U). (26.107) 
YCX 


A crisp belief structure can also be induced by 
a dual pair of fuzzy belief and plausibility functions. 


Definition 26.3 
Let (M, m) be a crisp belief structure on U. A fuzzy set 


function Bel: F(U) — [0, 1] is called a CF-belief func- 
tion on U if 


Bel(X) = X` mAN (X), WX € F(U). 
AEM 
(26.108) 


A fuzzy set function Pl: F(U) — [0, 1] is called a CF- 
plausibility function on U if 


PIX) = $ m(A)II4(X), VX € F(U), (26.109) 


AEM 


where Ny: F(U) — [0,1] and Ma: F(U) —> [0, 1] are, 
respectively, the necessity measure and the possibility 
measure determined by the crisp set A defined as fol- 
lows 


Na(X) = NXU), XE FW), (26.110) 
uEA 

mO =V XW). Xe FU). (26.111) 
uEA 


Belief and Plausibility Functions Derived 
from a Fuzzy Belief Structure 


Definition 26.4 

Let U be a nonempty set which may be infinite. A set 
function m: F(U) — [0, 1] is referred to as a fuzzy basic 
probability assignment, if it satisfies axioms (FM1) and 
(FM2) 


(FM1) mø) =0, 
(FM2) }) m(X)=1. 


XEF(U) 


A fuzzy set X € F(U) with m(X) > 0 is referred to as 
a focal element of m. We denote by M the family of all 
focal elements of m. The pair (M, m) is called a fuzzy 
belief structure. 


Lemma 26.2 
[26.62] Let (M, m) be a fuzzy belief structure on W. 
Then the focal elements of m constitute a countable set. 


In the discussion to follow, all the focal elements are 
supposed to be normal, i. e., for any A € M, there exists 
an x € U such that A(x) = 1. Associated with the fuzzy 
belief structure (M,m), two pairs of fuzzy belief and 
plausibility functions can be derived. 
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Definition 26.5 
Let U be a nonempty set which may be infinite, and 
(M, m) a fuzzy belief structure on U. A crisp set func- 
tion Bel: P(U) — [0,1] is referred to as a FC-belief 
function on U if 


Bel(X) = $ m(A)Na(X), VX € P(U). 
AEM 
(26.112) 


A crisp set function Pl: P(U) — [0, 1] is called a FC- 
plausibility function on U if 


PIX) = $ mA), VX € P(U), (26.113) 
AEM 
where Na: P(U) — [0,1] and Ma: P(U) — [0, 1] are, 
respectively, the necessity measure and the possibility 
measure determined by the fuzzy set A defined as fol- 
lows 


Nœ) = /\(1—A(u)),X € P(U) (26.114) 
ux 

TUX) = V Aw), X € P(U). (26.115) 
uEX 


Definition 26.6 
Let U be a nonempty set which may be infinite, and 
(M, m) a fuzzy belief structure on U. A fuzzy set func- 
tion Bel: F(U) — [0, 1] is referred to as a FF-belief 
function on U if 


Bel(X) = ` mAN X), WX € F(U). 
AEM 
(26.116) 


A fuzzy set function Pl: F(U) — [0, 1] is called a FF- 
plausibility function on U if 


PIX) = $ mA), WX € FU). (26.117) 
AEM 
Where Ny: F(U) — [0, 1] and Ma: F(U) — [0, 1] are, 
respectively, the necessity measure and the possibility 
measure determined by the fuzzy set A defined as fol- 
lows 


Na(X) = A Zw v A -Au))).X € FU), 
= (26.118) 


MX) = V (X(u) AAW). XE F(U). (26.119) 


ucU 


It can be proved that the belief and plausibility 
functions derived from the same fuzzy belief structure 
(M, m) are dual, that is, 


Bel(X) = 1 —PI(x X), VXEF(U). (26.120) 
And 
Bel(X) < P(X), YXe€ F(U). (26.121) 


Moreover, Bel is a fuzzy monotone Choquet capacity 
of infinite order on U which satisfies axioms (FMC1)— 
(FMC3), 


(FMC1) Bel(@) = 0, 
(FMC2) Bel(U) = 1, 
(FMC3) For X; € F(U),i=1,2,...,n,n EN, 


> (—1)"'+'Bel Nx 


OAIC£{1,2,..., ny jEJ 
(26.122) 


And Plis a fuzzy alternating Choquet capacity of in- 
finite order on U which obeys axioms (FAC1)—(FAC3), 


(FAC1) PI(Ø) = 0, 
(FAC2) PI(U) = 1, 
(FAC3) For X; € F(U), i=1,2,...,n,n €N, 


|IJI+1 
n(x) = Y ED »ı (Ux). 
i=1 GERi A a ag n} EJ 

(26.123) 


26.3.2 Belief Structures 
of Rough Approximations 


In this section, we show relationships between various 
belief and plausibility functions in Dempster-Shafer 
theory of evidence and the lower and upper approxi- 
mations in rough set theory with potential applications. 


Belief Functions 
versus Crisp Rough Approximations 


Definition 26.7 

Let U and W be two nonempty universes of discourse. 
A subset R € P(U x W) is referred to as a binary relation 
from U to W. The relation R is referred to as serial if 
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for any x € U there exists y € W such that (x,y) € R. 
If U = W, Re P(U x U) is called a binary relation on 
U, R € P(U x U) is referred to as reflexive if (x, x) € R 
for all x€ U; R is referred to as symmetric if (x, y) € 
R implies (y, x) € R for all x, y € U; R is referred to as 
transitive if for any x,y, z € U, (x,y) € Rand (y,z) € R 
imply (x, z) € R; R is referred to as Euclidean if for any 
x,y,z E U, (x,y) € R and (x,z) € R imply (y,z) € R; R 
is referred to as an equivalence relation if R is reflexive, 
symmetric and transitive. 


Assume that R is an arbitrary binary relation from 
U to W. One can define a set-valued mapping R,: U > 
P(W) by 


R,(x) = {ye WiC, y) ER}, xEU. (26.124) 


R;(x) is called the successor neighborhood of x with re- 
spect to R [26.63]. Obviously, any set-valued mapping 
F from U to W defines a binary relation from U to W by 
setting R = {(x, y) € Ux W|y € F(x)}. For A € P(W), 
let j(A) = R7! (A) be the counter-image of A under the 
set-valued mapping Rs, i. e., 


R,'(A) = {u € U|R,(u) = A}, 
if A € {R,(x)|x € U}, 
Ø, otherwise . 


j(A) = 


(26.125) 


Then it is well known that j satisfies the properties (J1) 
and (J2) 


(1) A ¢ B= f(A) NjB) = 9, 
(2) LU wsv. 


AE P(W) 


Definition 26.8 

If R is an arbitrary relation from U to W, then the triple 
(U, W, R) is referred to as a generalized approximation 
space. For any set A C W, a pair of lower and upper ap- 
proximations, R(A) and R(A), are, respectively, defined 
by 


R(A) = tx € UIRs(x) CA}, 


R(A) = {x € UR) NAF Ø}. (26.126) 
The pair (R(A), R(A)) is referred to as a generalized 
crisp rough set and R and R: P(W) —> P(U) are called 
the lower and upper generalized approximation opera- 
tors, respectively. 


If U is countable set, P a normalized probability 
measure on U, i.e., P({x}) > 0 for all x€ U, and R an 
arbitrary relation from U to W, then ((U, P), W, R) is 
referred to as a probability approximation space. 


Theorem 26.1 [26.43] 
Assume that ((U, P), W,R) is a serial probability ap- 
proximation space, for X € P(W), define 


m(X) = P(j(X)), 
Bel(X) = P(R(X)), 
PI(X) = P(R(X)) . 


Then m: P(W) — [0, 1] is a basic probability assign- 
ment on W and Bel: P(W) — [0,1] and Pl: P(W) > 
[0, 1] are, respectively, the CC-belief and CC-plausibil- 
ity functions on W. 

Conversely, for any crisp belief structure (M, m) 
on W which may be infinite. If Bel: P(W) — [0, 1] and 
Pl: P(W) — [0, 1] are, respectively, the CC-belief and 
CC-plausibility functions defined in Definition 26.2, 
then there exists a countable set U, a serial relation R 
from U to W, and a normalized probability measure P 
on U such that 

Bel(X) = P(R(X)), 


PIX) = P(R(X)), 


(26.127) 


YX € P(W). (26.128) 


The notion of information systems (sometimes 
called data tables, attribute-value systems, knowledge 
representation systems etc.) provides a convenient tool 
for the representation of objects in terms of their at- 
tribute values. 

An information system is a pair (U, A), where U = 
{x1,X2,...,X,}is a nonempty, finite set of objects called 
the universe of discourse and A = {aj,d2,...,@m} is 
a nonempty, finite set of attributes, such that a: U > Va 
for any a € A, where V, is called the domain of a. 

Each nonempty subset B C A determines an indis- 
cernibility relation as follows 


Rg = {(x, y) € Ux Ula(x) = a(y), Va € B} . 
(26.129) 


Since Rg is an equivalence relation on U, it forms a par- 
tition U/Rg = {[x]g|x € U} of U, where [x]g denotes 
the equivalence class determined by x with respect to 
(w.r.t.) B, i.e., [x]p = {y € U| (x, y) € Rp}. 

Let (U, A) be an information system, B C A, for any 
X C U, denote 


Re(X) = {x € Ulbe CX}, 


Rg(X) = {x € Ulka NX ZO}, (26.130) 
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where Rg(X) and Rg(X) are, respectively, referred to as 
the lower and upper approximations of X w.r.t. (U, Rg), 
the knowledge generated by B. Objects in Rg(X) can 
be certainty classified as elements of X on the basis 
of knowledge in (U, Rg), whereas objects in Rg(X) can 
only be classified possibly as elements of X on the basis 
of knowledge in (U, Rg)). 

For B CA and X C U, denote Belg(X) = P(Rg(X)) 
and Plg(X) = P(Rg(X)), where P(Y) = |Y|/|U| and |Y| 
is the cardinality of a set Y. Then Belg and Plg are 
CC-belief function and CC-plausibility function on U, 
respectively, and the corresponding mass distribution is 


P(Y), if Y€U/Rp, 


te 0, otherwise . 


A decision system (sometimes called decision table) 
is a pair (U, CU {d}) where (U, C) is an information 
system, and d is a distinguished attribute called the de- 
cision; in this case C is called the conditional attribute 
set, d is a map d: U — V4 of the universe U into the 
value set Vz, we assume, without any loss of generality, 
that Va = {1,2,...,r}. Define 


Ry = {(x,y) € U x Uld(x) =d(y)} . (26.131) 


Then we obtain the partition U/Ry = {D,, D2,...,D,} 
of U into decision classes, where D; = {x € U|d(x) = 
j}. j <r. If Rc C Ra, then the decision system (U, CU 
{d}) is consistent, otherwise it is inconsistent. One 
can acquire certainty decision rules from consistent 
decision systems and uncertainty decision rules from 
inconsistent decision systems. 


Belief Functions 
versus Rough Fuzzy Approximations 


Definition 26.9 

Let (U,W,R) be a generalized approximation space, 
for a fuzzy set A € F(W), the lower and upper ap- 
proximations of A, RF(A) and RF(A), with respect to 
the approximation space (U, W, R) are fuzzy sets of U 
whose membership functions, for each x € U, are de- 
fined, respectively, by 


RF(A)(x)= V AG). xeU, (26.132) 
YER; (x) 

RF(A\(x) = N AO), xeU. (26.133) 
YER, (x) 


The pair (RF(A), RF (A)) is referred to as a generalized 
rough fuzzy set, and RF and RF: F (W) > F(U) are 


referred to as lower and upper generalized rough fuzzy 
approximation operators, respectively. 


In the discussion to follow, we always assume that 
(U, A, P) is a probability space, i.e., U is a nonempty 
set, A C P(U) a o-algebra on U, and P a probability 
measure on U. 


Definition 26.10 

A fuzzy set A € F(U) is said to be measurable w.r.t. 
(U, A) if A: U > [0, 1] is a measurable function w.r.t. 
A—B((0, 1]), where B((0, 1]) is the family of Borel sets 
on [0, 1]. We denote by F(U, A) the family of all mea- 
surable fuzzy sets of U w.r.t. A — B([0, 1]). 


For any measurable fuzzy set A € F(U, A), since 
Aq E A for all a € [0,1], Ag is a measurable set on 
the probability space (U, A, P) and then P(Aq) € [0, 1]. 
Note that f(&) = P(Aq) is monotone decreasing and 
left continuous, it can be seen that f(a) is integrable, 


we denote the integrand as i P(Aq)da. 


Definition 26.11 
If a fuzzy set A is measurable w.r.t. (U, A), and P is 
a probability measure on (U, A). Denote 


1 


P(A) = / P(Aq)da , 


0 


(26.134) 


P(A) is called the probability of A. 


For a singleton set {x}, we will write P(x) instead of 
P({x}) for short. 


Proposition 26.1 
[26.21, 64] The fuzzy probability measure P in Defini- 
tion 26.11 satisfies the following properties: 


(1) P(A) € [0, 1] and P(A) + P(® A) = 1, for all A € 
F(U, A). 

(2) P is countably additive, i. e., for A; € F(U, A), i = 
1,2,...,A;N A; = 0, Yi Æ j, then 


P (Ù 4) = X Pa) , (26.135) 


i=1 i=l 


(3) A,B € F(U, A), A C B => P(A) < P(B). 
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(4) If U = {u;|i=1,2,...} is an infinite countable set 
and A = P(U), then for all A € F(U), 


1 


P(A) = | Peada = YS A@)PO) . (26.136) 


0 xEU 


(5) If U is a finite set with |U| =n, A = P(U), and 
P(u) = 1/n, then P(A) = f) P(Aq)da = |A|/n for 
all A € P(U). 


Theorem 26.2 
Assume that ((U, P), W,R) is a serial probability ap- 
proximation space, for X € F(W), define 


m(X) = P(i(X)) , 
Bel(X) = P(RE(X)) , 


PI(X) = P(RF(X)) . (26.137) 


Then m: P(W)— [0,1] is a basic probability as- 
signment on W and Bel: F(W)— [0,1] and 
Pl: F(W) — [0,1] are, respectively, the CF-belief 
and CF-plausibility functions on W. 

Conversely, for any crisp belief structure (M,m) 
on W which may be infinite. If Bel: F (W) — [0, 1] and 
Pl: F(W) — [0, 1] are, respectively, the CF- belief and 
CF-plausibility functions defined in Definition 26.3, 
then there exists a countable set U, a serial relation R 
from U to W, and a normalized probability measure P 
on U such that 

Bel(X) = P(RE(X)) , 

PI(X) = P(RF(X)), VX e FW). (26.138) 


For a decision table (U,CU{d}), where V4 = 
{d,do,...,d,}, d is called a fuzzy decision if, for each 
x € U, d(x) is a fuzzy subset of V4, i. e., d: U > F (Va), 
with no lose of generality, we represent d as follows 


d(x) = da /dı +dj/dy+-+--+dir/d,,i 


N (26.139) 


where dj € [0, 1]. In this case, (U, C U {d}) is called an 
information system with fuzzy decision. For the fuzzy 
decision d, we define a fuzzy indiscernibility binary re- 
lation R4 on U as follows: For i,k = 1,2,...,n 


Ra(xi, Xk) = min{1 = ld; — dylli = 1,2, ni sr} é 
(26.140) 


Then, we obtain a fuzzy similarity class S4(x) of x € U 
in the system (U, CU {d}) as follows 


Sa(x)(y) = Ra (x, y), ye U; 


Since Sa (x)(x) = Ra(x, x) = 1, we see that Sy(x): U > 
[0, 1] is a normalized fuzzy set of U. Denote by U/R4 
the fuzzy similarity classes induced by the fuzzy deci- 
sion d, i.e. 


(26.141) 


U/Ra = {Sa (x) |x € U}. (26.142) 


For B C C and X € F(U), we define the lower and 
upper approximations of X w.r.t. (U, Rg) as follows 


RFs (X) Œ) = Ayesa(x) XO), xeU, 
RFg(X) (x) = Vessa) XO), xEU. 
(26.143) 


Theorem 26.3 

Let (U, CU {d}) be an information system with fuzzy 
decision. For BC C and X €e F(U), if RFg(X) and 
RFg(X) are, respectively, the lower and upper approxi- 
mations of X w.r.t. (U, Rg) defined by Definition 26.9, 
denote 


Belg(X) = P(RFg(X)), 
Ph) = PRF), ae 
where P(X) = } ey X(x)/|U| for Xe F(U), then 
Belg: F(U) — [0,1] and Plz: F(U) — [0,1] are, re- 
spectively, a CF-belief function and a CF-plausibility 
function on U, and the corresponding basic probability 
assignment m, is 


_{ PŒ)=]YI/|Ul, if Y € U/Re, 
m,(Y) = 0, otherwise . 


(26.145) 


Belief Functions 
versus Fuzzy Rough Approximations 


Definition 26.12 

Let U and W be two nonempty universes of discourse. 
A fuzzy subset R € F(U x W) is referred to as a bi- 
nary relation from U to W, R(x,y) is the degree of 
relation between x and y, where (x,y) € Ux W. The 
fuzzy relation R is referred to as serial if for each x € U, 
Vew R@, y) = 1. If U = W, R € F(U x U) is called 
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a fuzzy binary relation on U, R is referred to as a re- 
flexive fuzzy relation if R(x, x) = 1 for all x € U; R is 
referred to as a symmetric fuzzy relation if R(x, y) = 
R(y, x) for all x, y € U; R is referred to as a transitive 
fuzzy relation if R(x, z) > Vyeu (R(x, y) ARQ, z)) for all 
x,z € U; R is referred to as an equivalence fuzzy rela- 
tion if it is reflexive, symmetric, and transitive. 


Definition 26.13 

Let U and W be two nonempty universes of dis- 
course and R a fuzzy relation from U to W. The triple 
(U, W,R) is called a generalized fuzzy approximation 
space. For any set A € F(W), the lower and upper ap- 
proximations of A, FR(A) and FR(A), with respect to 
the approximation space (U, W, R) are fuzzy sets of U 
whose membership functions, for each x € U, are de- 
fined, respectively, by 


FR(A)(x) = V Ræ y) AAO), x EU, 


yew 


FR(A)(x) = /\[U-R@y)) VAO], x EU. 
yew 
(26.146) 


The pair (FR(A), FR(A)) is referred to as a generalized 
fuzzy rough set, and FR and FR: F (W) —> F(U) are re- 
ferred to as lower and upper generalized fuzzy rough 
approximation operators, respectively. 


Theorem 26.4 

Let (U, W, R) be a serial fuzzy approximation space in 
which U is a countable set and P a probability measure 
on U. If FR and FR are the fuzzy rough approximation 
operators defined in Definition 26.13, denote 


Bel(X) = P(FR(X)) , 


PI(X) = P(FR(X)), X€ F(W). (26.147) 
Then Bel: F(W) — [0,1] and Pl: #(W) — [0, 1] are, 
respectively, FF-fuzzy belief and FF-plausibility func- 
tions on W. 

Conversely, if (M,m) is a fuzzy belief structure 
on W, Bel: F(W) — [0,1] and Pl: F(W) — [0, 1] are 
the pair of FF-fuzzy belief function and FF-plausibility 
function defined in Definition 26.6, then there exists 
a countable set U, a serial fuzzy relation R from U to 
W, and a probability measure P on U such that for all 


Xe F(W), 
Bel(X) = P(ER(X)) = X FR(X)(x)P(x) , 
= (26.148) 
PI(X) = P(FR(X)) = > FR(X)(x)P(x) . 
= (26.149) 


A pair (U, A) is called a fuzzy information system if 
each a € A is a fuzzy attribute, i. e., for each x € U, a(x) 
is a fuzzy subset of Va, that is, a: U > F (Va). Similar 
to (26.140), we can define a reflexive fuzzy binary rela- 
tion Ra on U, and consequently, for any attribute subset 
BCA one can define a reflexive fuzzy relation Rg as 
follows 


Rg = ( \ Ra. 


aEB 


For X € F(U), denote 


(26.150) 


Belz (X) = P(FR,(X)), Pla(X) = P(FRa(X)), 
(26.151) 


where P(X) = cy X(x)/|U| for X€ F(U). Then, 
according to Theorem 26.4, Belg: F(U) — [0, 1] and 
Pls: F(U) — [0, 1] are respectively, FF-fuzzy belief 
function and FF-plausibility function on U. More 
specifically, if X in (26.151) is crisp subset of U, 
then Belg: P(U) — [0,1] and Pls: P(U) — [0, 1] de- 
fined by (26.151) are, respectively, FC-fuzzy belief 
functions and FC-plausibility functions on U. Based 
on these observations, we believe that FF-fuzzy be- 
lief functions and FF-plausibility functions can be used 
to analyze uncertainty fuzzy information systems with 
fuzzy decision and whereas FC-fuzzy belief functions 
and FC-plausibility functions can be employed to deal 
with knowledge discovery in fuzzy information systems 
with crisp decision. 


26.3.3 Conclusion of This Section 


The lower and upper approximations of a set capture 
the non-numeric aspect of uncertainty of the set which 
can be interpreted as the qualitative representation of 
the set, whereas the pair of the belief and plausibil- 
ity measures of the set characterize the numeric aspect 
of uncertainty of the set which can be treated as the 
quantitative characterization of the set. In this section, 
we have introduced some generalized belief and plau- 
sibility and belief functions on the Dempster-Shafer 
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theory of evidence. We have shown that the fuzzy be- 
lief and plausibility functions can be interpreted as 
the lower and upper approximations in rough set the- 
ory. That is, the belief and plausibility functions in the 
Dempster-Shafer theory of evidence can be represented 
as the probabilities of lower and upper approximations 


in rough set theory; thus, rough set theory may be re- 
garded as the basis of the Dempster-Shafer theory of 
evidence. Also the Dempster-Shafer theory of evidence 
in the fuzzy environment provides a potentially useful 
tool for reasoning and knowledge acquisition in fuzzy 
systems and fuzzy decision systems. 


26.4 Applications of Fuzzy Rough Sets 


Both fuzzy set and rough set theories have fostered 
broad research communities and have been applied 
in a wide range of settings. More recently, this has 
also extended to the hybrid fuzzy rough set models. 
This section tries to give a sample of those applica- 
tions, which are in particular numerous for machine 
learning but which also cover many other fields, like 
image processing, decision making, and information re- 
trieval. 

Note that we do not consider applications that 
simply involve a joint application of fuzzy sets and 
rough sets, like for instance a rough classifier that in- 
duces fuzzy rules. Rather, we focus on applications that 
specifically involve one of the fuzzy rough set models 
discussed in the previous sections. 


26.4.1 Applications in Machine Learning 


Feature Selection 
The most prominent application of classical rough set 
theory is undoubtedly semantics-preserving data di- 
mensionality reduction: the removal of attributes (fea- 
tures) from information systems (An information sys- 
tem (U,A) consists of a nonempty set U of objects 
which are described by a set of attributes A.) without 
sacrificing the ability to discern between different ob- 
jects. A minimal attribute subset B C A that maintains 
objects’ discernibility is called a reduct. For classifica- 
tion tasks, it is sufficient to be able to discern between 
objects belonging to different classes, in which case 
a decision reduct, also called relative reduct, is sought. 

The traditional rough set model sets forth a crisp 
notion of discernibility, where two objects are either 
discernible or not w.r.t. a set of attributes B based on 
their values for all attributes in B. To be able to handle 
numerical data, discretization is required. Fuzzy-rough 
feature selection avoids this external preprocessing step 
by incorporating graded indiscernibility between ob- 
jects directly into the data reduction process. On the 
other hand, by the use of fuzzy partitions, such that ob- 


jects can belong to different classes to varying degrees, 
a more flexible data representation is obtained. 

Chronologically, the oldest proposal to apply 
fuzzy rough sets to feature selection is due to 
Kuncheva [26.26] in 1992. However, rather than using 
Dubois and Prade’s definition, she proposed her own 
notion of a fuzzy rough set based on an inclusion mea- 
sure. Based on this, she defined a quality measure for 
evaluating attribute subsets w.r.t. their ability to approx- 
imate a predetermined fuzzy partition on the data, and 
illustrated its usefulness on a medical data set. 

Jensen and Shen [26.27,29] were the first to pro- 
pose a reduction method that generalizes the classical 
rough set positive region and dependency function. In 
particular, the dependency degree yg, with BCA, is 
used to guide a hill-climbing search in which, starting 
from B = Ø, in each step an attribute a is added such 
that ygU,a} is maximal. The search ends when there is 
no further increase in the measure. This is the Quick 
Reduct algorithm. In [26.65] they replaced this simple 
greedy search heuristic by a more complex one based 
on ant colony optimization. 

Hu et al. [26.66] formally defined the notions of 
reduct and decision reduct in the fuzzy-rough case, 
referring to the invariance of the fuzzy partition in- 
duced by the data, and of the fuzzy positive region, 
respectively. They also showed that minimal subsets 
that are invariant w.r.t. (conditional) entropy are (deci- 
sion) reducts. 

Tsang et al. [26.67] proposed a method based on 
the discernibility matrix and function to find all de- 
cision reducts where invariance of the fuzzy positive 
region defined using Dubois and Prade’s definition is 
imposed, and proved its correctness. In [26.68], an ex- 
tension of this method is defined that finds all decision 
reducts where the approximations are defined using 
a lower semicontinuous t-norm J and its R-implicator. 
The particular case using Lukasiewicz connectives was 
studied in [26.69]. Later, Zhao and Tsang [26.31] stud- 
ied relationships that exist between different kinds of 
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decision reducts, defined using different types of fuzzy 
connectives. 

In [26.70], Jensen and Shen introduced three differ- 
ent quality measures for evaluating attribute subsets: the 
first one is a revised version of their previously defined 
degree of dependency, the second one is based on the 
fuzzy boundary region, and the third one on the satisfac- 
tion of the clauses of the fuzzy discernibility function. 
On the other hand, in [26.71], Cornelis et al. proposed 
the definition of fuzzy M-decision reducts, where M is 
an increasing, [0, 1]-valued quality measure. They stud- 
ied two measures based on the fuzzy positive region 
and two more based on the fuzzy discernibility func- 
tion, and applied them to classification and regression 
problems. 

In [26.33], Chen and Zhao studied the concept of lo- 
cal reduction: instead of looking for a global reduction, 
where the whole positive region is considered as an in- 
variant, they focus on subsets of decision classes and 
identify the conditional attributes that provide minimal 
descriptions for them. 

Over the past few years, there has also been consid- 
erable interest in the application of noise-tolerant fuzzy 
rough set models to feature selection, where the aim is 
to make the reduction more robust in the presence of 
noisy or erroneous data. For instance, Hu et al. [26.72] 
defined fuzzy rough sets as an extension of variable 
precision rough sets, and used a corresponding notion 
of positive region to guide a greedy search algorithm. 
In [26.73], Cornelis and Jensen evaluated the vaguely 
quantified rough set (VQRS) approach to feature se- 
lection. They found that because the model does not 
satisfy monotonicity w.r.t. the fuzzy relation R, adding 
more attributes does not always lead to an expansion of 
the fuzzy positive region, and the hill-climbing search 
sometimes runs into troubles. Furthermore, in [26.74] 
Hu et al., inspired by the idea of soft margin support 
vector machines, introduced soft fuzzy rough sets and 
applied them to feature selection. 

He etal. [26.75] consider the problem of fuzzy- 
rough feature selection for decision systems with fuzzy 
decisions, that is, where the decision attribute is charac- 
terized by a fuzzy T-similarity relation instead of a crisp 
one. This is the case of regression problems. They give 
an algorithm for finding all decision reducts and another 
one for finding a single reduction. 

The relatively high complexity of fuzzy-rough fea- 
ture selection algorithms somewhat limits is applicabil- 
ity to large datasets. In view of this, Chen et al. [26.76] 
propose a fast algorithm to obtain one reduct, based 
on a procedure to find the minimal elements of the 


discernibility matrix of [26.67]. The algorithm is com- 
pared w.r.t. execution time with the proposals in [26.70] 
and [26.67], and turns out to be a lot faster. On the other 
hand, Qian et al. [26.77] implement an efficient version 
of feature selection using the model of Hu et al. [26.72]. 

The use of kernel functions as fuzzy similarity re- 
lations in feature selection algorithms has also sparked 
researchers’ interest. In particular, Du et al. [26.78] ap- 
ply fuzzy-rough feature selection with kernelized fuzzy 
rough sets to yawn detection, while Chen et al. [26.79] 
propose parameterized attribute reduction with Gaus- 
sian kernel-based fuzzy rough sets. He and Wu [26.80] 
develop a new method to compute membership for 
fuzzy support vector machines (FSVMs) by using 
a Gaussian kernel-based fuzzy rough set, and em- 
ploy a technique of attribute reduction using Gaussian 
kernel-based fuzzy rough sets to perform feature selec- 
tion for FSVMs. 

Finally, Derrac et al. [26.81] combine fuzzy-rough 
feature selection with evolutionary instance selection. 


Instance Selection 

Instance selection can be seen as the orthogonal task 
to feature selection: here the goal is to reduce an in- 
formation system (U,A) by removing objects from U. 
The first work on instance selection using fuzzy rough 
set theory was presented in [26.82]. The main idea is 
that instances for which the fuzzy rough lower approx- 
imation membership is lower than a certain threshold 
are removed. This idea was improved in [26.83], where 
the selection threshold is optimized. This method has 
been applied in combination with evolutionary feature 
selection in [26.84] and for imbalanced classification 
problems in [26.85, 86], in combination with resam- 
pling methods. 


Classification 
Fuzzy rough sets have been widely used for classifi- 
cation purposes, either by means of rule induction or 
by plugging them into existing classifiers like nearest 
neighbor classifiers, decision trees, and support vector 
machines (SVM). 

The earliest work on rule induction using fuzzy 
rough set theory can be found in [26.25]. In this pa- 
per, the authors propose a fuzzy rough framework 
to induce fuzzy decision rules that does not use any 
fuzzy logical connectives. Later, in [26.30], an approach 
that generates rules from data using fuzzy reducts was 
presented, with a fuzzy rough feature selection prepro- 
cessing step. In [26.87], the authors noticed that using 
feature selection as a preprocessing step often leads to 
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too specific rules, and proposed an algorithm for simul- 
taneous feature selection and rule induction. In [26.88, 
89], a rule-based classifier is built using the so-called 
consistency degree as a critical value to keep the dis- 
cernibility information invariant in the rule-induction 
process. Another approach to fuzzy rough rule induc- 
tion can be found in [26.90], where rules are found from 
training data with hierarchical and quantitative attribute 
values. The most recent work can be found in [26.91], 
where fuzzy equivalence relations are used to model 
different types of attributes in order to obtain small rule 
sets from hybrid data, and in [26.92] where a harmony 
search algorithm is proposed to generate emerging rule 
sets. 

In [26.93], the K nearest neighbor method was 
improved using fuzzy set theory. So far, three differ- 
ent fuzzy-rough-based approaches have been used to 
improve this fuzzy nearest neighbor (FNN) classifier. 
In [26.94], the author introduces a fuzzy rough own- 
ership function and plugs it into the FNN algorithm. 
In [26.95—98], the extent to which the nearest neighbors 
belong to the fuzzy lower and upper approximations of 
a certain class are used to predict the class of the tar- 
get instance, these techniques are applied in [26.99] for 
mammographic risk analysis. Finally, in [26.100], the 
FNN algorithm is improved using the fuzzy rough pos- 
itive regions as weights for the nearest neighbors. 

During the last decade, several authors have worked 
on fuzzy rough improvements of decision trees. The 
common idea of these methods is that during the 
construction phase of the decision tree, the feature 
significances are measured using fuzzy rough tech- 
niques [26.101—104]. In [26.105—107], the kernel func- 
tions of the SVM are redefined using fuzzy rough sets, 
to take into account the inconsistency between condi- 
tional attributes and the decision class. In [26.80], this 
approach is combined with fuzzy rough feature selec- 
tion. In [26.108], SVMs are reformulated by plugging 
in the fuzzy rough memberships of all training samples 
into the constraints of the SVMs. 


Clustering 
Many authors have worked on clustering methods that 
use both fuzzy set theory and rough set theory, but to the 
best of our knowledge, only two approaches use fuzzy 
rough sets for clustering. In [26.109], fuzzy rough sets 
are used to measure the intracluster similarity, in order 
to estimate the optimal number of clusters. In [26.110], 
a fuzzy rough measure is used to measure the similarity 
between genes in microarray analysis, in order to gen- 
erate clusters such that genes within a cluster are highly 


correlated to the sample categories, while those in dif- 
ferent clusters are as dissimilar as possible. 


Neural Networks 

There are many approaches to incorporate fuzzy rough 
set theory in neural networks. One option is to use fuzzy 
rough set theory to reduce the problem that samples 
in the same input clusters can have different classes. 
The resulting fuzzy rough neural networks are designed 
such that they work as fuzzy rough membership func- 
tions [26.111—114]. A related approach is to use fuzzy 
rough set theory to find the importance of each subset 
of information sources of subnetworks [26.115]. Other 
approaches use fuzzy rough set theory to measure the 
importance of each feature in the input layer of the neu- 
ral network [26.116—-118]. 


26.4.2 Other Applications 


Image Processing 
Fuzzy rough sets have been used in several domains of 
image processing. They are especially suitable for these 
tasks because they can capture both indiscernibility and 
vagueness, which are two important aspects of image 
processing. 

In [26.119, 120], fuzzy-rough-based image seg- 
menting methods are proposed and applied in a tra- 
ditional Chinese medicine tongue image segmenta- 
tion experiment. Often, fuzzy rough attribute reduction 
methods are proposed for image processing problems, 
as in [26.121] or in [26.122], where the methods are 
applied for face recognition. In [26.123], a method 
for edge detection is proposed by building a hierar- 
chy of rough-fuzzy sets to exploit the uncertainty and 
vagueness at different image resolutions. Another as- 
pect of image processing is texture segmentation, this 
problem is tackled in [26.124] using rough-fuzzy sets. 
In [26.125], the authors solve the image classification 
problem using a nearest neighbor clustering algorithm 
based on fuzzy rough set theory, and apply their algo- 
rithm to hand gesture recognition. In [26.126], a com- 
bined approach of neural network classification systems 
with a fuzzy rough sets based feature reduction method 
is presented. In [26.127], fuzzy rough feature reduction 
techniques are applied to a large-scale Mars McMurdo 
panorama image. 


Decision Making 
Fuzzy rough set theory has many applications in de- 
cision making. In [26.128], the authors calculate the 
fuzzy rough memberships of software components in 
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previous projects and decide based on these values 
which ones to reuse in a new program. In [26.129, 
130], a multiobjective decision-making model based on 
fuzzy rough set theory is used to solve the inventory 
problem. In [26.131], variable precision fuzzy rough 
sets are used to develop a decision making model, 
that is applied for IT offshore outsourcing risk evalua- 
tion. Another approach can be found in [26.132] where 
the decision corresponds to the decision corresponding 
with the instance with maximal sum of lower and upper 
soft fuzzy rough approximation. Recent work can be 
found in [26.133], where a fuzzy rough set model over 
two universes is defined to develop a general decision- 
making framework in an uncertainty environment for 
solving a medical diagnosis problem. 


Information Retrieval, Data Mining, 

and the Web 
Fuzzy rough sets have been used to model impreci- 
sion and vagueness in databases. In [26.134], the au- 
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27. Artificial Neural Network Models 


Peter Tino, Lubica Benuskova, Alessandro Sperduti 


We outline the main models and developments 
in the broad field of artificial neural networks 
(ANN). A brief introduction to biological neurons 
motivates the initial formal neuron model — the 
perceptron. We then study how such formal neu- 
rons can be generalized and connected in network 
structures. Starting with the biologically motivated 
layered structure of ANN (feed-forward ANN), the 
networks are then generalized to include feedback 
loops (recurrent ANN) and even more abstract gen- 
eralized forms of feedback connections (recursive 
neuronal networks) enabling processing of struc- 
tured data, such as sequences, trees, and graphs. 
We also introduce ANN models capable of form- 
ing topographic lower-dimensional maps of data 
(self-organizing maps). For each ANN type we out- 


The human brain is arguably one of the most excit- 
ing products of evolution on Earth. It is also the most 
powerful information processing tool so far. Learning 
based on examples and parallel signal processing lead 
to emergent macro-scale behavior of neural networks 
in the brain, which cannot be easily linked to the be- 
havior of individual micro-scale components (neurons). 
In this chapter, we will introduce artificial neural net- 
work (ANN) models motivated by the brain that can 
learn in the presence of a teacher. During the course of 
learning the teacher specifies what the right responses 
to input examples should be. In addition, we will also 
mention ANNs that can learn without a teacher, based 
on principles of self-organization. 

To set the context, we will begin by introducing ba- 
sic neurobiology. We will then describe the perceptron 
model, which, even though rather old and simple, is an 


27.1 Biological Neurons 


It is estimated that there are about 10!? neural cells 
(neurons) in the human brain. Two-thirds of the neurons 
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line the basic principles of training the corre- 
sponding ANN models on an appropriate data 
collection. 


important building block of more complex feed-forward 
ANN models. Such models can be used to approximate 
complex non-linear functions or to learn a variety of as- 
sociation tasks. The feed-forward models are capable of 
processing patterns without temporal association. In the 
presence of temporal dependencies, e.g., when learning 
to predict future elements of a time series (with certain 
prediction horizon), the feed-forward ANN needs to 
be extended with a memory mechanism to account for 
temporal structure in the data. This will naturally lead 
us to recurrent neural network models (RNN), which 
besides feed-forward connections also contain feedback 
loops to preserve, in the form of the information pro- 
cessing state, information about the past. RNN can be 
further extended to recursive ANNs (RecNN), which 
can process structured data such as trees and acyclic 
graphs. 


form a 4—6 mm thick cortex that is assumed to be the 
center of cognitive processes. Within each neuron com- 
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Dendrites 


plex biological processes take place, ensuring that it can 
process signals from other neurons, as well as send its 
own signals to them. The signals are of electro-chemical 
nature. In a simplified way, signals between the neurons 
can be represented by real numbers quantifying the in- 
tensity of the incoming or outgoing signals. The point 
of signal transmission from one neuron to the other is 
called the synapse. Within synapse the incoming sig- 
nal can be reinforced or damped. This is represented by 
the weight of the synapse. A single neuron can have up 
to 10°—10° such points of entry (synapses). The input 
to the neuron is organized along dendrites and the soma 
(Fig. 27.1). Thousands of dendrites form a rich tree-like 
structure on which most synapses reside. 

Signals from other neurons can be either excitatory 
(positive) or inhibitory (negative), relayed via exci- 
tatory or inhibitory synapses. When the sum of the 
positive and negative contributions (signals) from other 
neurons, weighted by the synaptic weights, becomes 
greater than a certain excitation threshold, the neuron 
will generate an electric spike that will be transmitted 
over the output channel called the axon. At the end of 


27.2 Perceptron 


The perceptron is a simple neuron model that takes in- 
put signals (patterns) coded as (real) input vectors x = 
(x1,X2,...,%n+1) through the associated (real) vector 
of synaptic weights w = (w1,W2,...,Wn+1). The out- 
put o is determined by 


o = f (net) = f(w-x) 
n+1 


fl do wa 


j=! 


=f Yo wij : (27.1) 


j=l 


where net denotes the weighted sum of inputs, (i. e., dot 
product of weight and input vectors), and f is the acti- 
vation function. By convention, if there are n inputs to 


Fig. 27.1 Schematic illustration of the 
basic information processing struc- 
ture of the biological neuron 


Terminal 


axon, there are thousands of output branches whose ter- 
minals form synapses on other neurons in the network. 
Typically, as a result of input excitation, the neuron can 
generate a series of spikes of some average frequency — 
about 1 — 10? Hz. The frequency is proportional to the 
overall stimulation of the neuron. 

The first principle of information coding and rep- 
resentation in the brain is redundancy. It means that 
each piece of information is processed by a redun- 
dant set of neurons, so that in the case of partial 
brain damage the information is not lost completely. As 
a result, and crucially — in contrast to conventional com- 
puter architectures, gradually increasing damage to the 
computing substrate (neurons plus their interconnec- 
tion structure) will only result in gradually decreasing 
processing capabilities (graceful degradation). Further- 
more, it is important what set of neurons participate 
in coding a particular piece of information (distributed 
representation). Each neuron can participate in cod- 
ing of many pieces of information, in conjunction with 
other neurons. The information is thus associated with 
patterns of distributed activity on sets of neurons. 


the perceptron, the input (n + 1) will be fixed to —1 and 
the associated weight to w„+1 = 9, which is the value 
of the excitation threshold. 


Xn =—1 


Fig. 27.2 Schematic illustration of the perceptron model 
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In 1958 Rosenblatt [27.1] introduced a discrete 
perceptron model with a bipolar activation function 
(Fig. 27.2) 


Jf (net) = sign(net) 


+1 ifnet >0o) way > 


= ‘fel (27.2) 


—1 if net <04% $ way <0. 


j=! 


The boundary equation 


> wy-09=0, (27.3) 


J=1 


parameterizes a hyperplane in n-dimensional space with 
normal vector w. 

The perceptron can classify input patterns into two 
classes, if the classes can indeed be separated by an 
(n— 1)-dimensional hyperplane (27.3). In other words, 
the perceptron can deal with linearly-separable prob- 
lems only, such as logical functions AND or OR. XOR, 
on the other hand, is not linearly separable (Fig. 27.3). 
Rosenblatt showed that there is a simple training rule 
that will find the separating hyperplane, provided that 
the patterns are linearly separable. 

As we shall see, a general rule for training many 
ANN models (not only the perceptron) can be for- 
mulated as follows: the weight vector w is changed 
proportionally to the product of the input vector and 
a learning signal s. The learning signal s is a function 
of w, x, and possibly a teacher feedback d 


s=s(w,x,d) or s=s(w,Xx). (27.4) 


In the former case, we talk about supervised learning 
(with direct guidance from a teacher); the latter case is 
known as unsupervised learning. The update of the j-th 
weight can be written as 


wilt + 1) = w(t) + Aw,(d) = w(t) + s(t) Ct) . 
(27.5) 


Fig. 27.3 Linearly separable and non-separable problems 


The positive constant 0 < œ < 1 is called the learning 
rate. 

In the case of the perceptron, the learning signal is 
the disproportion (difference) between the desired (tar- 
get) and the actual (produced by the model) response, 
s = d —o = ô. The update rule is known as the ô (delta) 
rule 


Aw; = a(d—0)x;. (27.6) 


The same rule can, of course, be used to update the ac- 
tivation threshold w,4) = 8. 
Consider a training set 


Arain = {(x!, dD, P)... (9?,d?)... R, dP) 


consisting of P (input,target) couples. The perceptron 
training algorithm can be formally written as: 


© Step 1: Set w € (0,1). Initialize the weights ran- 
domly from (—1, 1). Set the counters to k = 1, p= 
1 (k indexes sweep through Arain, p indexes individ- 
ual training patterns). 

@ Step 2: Consider input x, calculate the output o = 
sign(S "1 wat). 

© Step 3: Weight update: w; <— w; + a(d? — oP)xX , for 
jJ=Hl,...,n4+1. 

© Step 4: If p <P, setp < p + 1, go to step 2. Other- 
wise go to step 5. 

@ Step 5: Fix the weights and calculate the cumulative 
error E on Arain- 

© Step 6: If E = Q, finish training. Otherwise, set p = 
1,k =k-+1 and go to step 2. A new training epoch 
starts. 
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27.3 Multilayered Feed-Forward ANN Models 


A breakthrough in our ability to construct and train 
more complex multilayered ANNs came in 1986, when 
Rumelhart etal. [27.2] introduced the error back- 
propagation method. It is based on making the transfer 
functions differentiable (hence the error functional to be 
minimized is differentiable as well) and finding a local 
minimum of the error functional by the gradient-based 
steepest descent method. 

We will show derivation of the back-propagation 
algorithm for two-layer feed-forward ANN as demon- 
strated, e.g., in [27.3]. Of course, the same principles 
can be applied to a feed-forward ANN architecture with 
any (finite) number of layers. In feed-forward ANNs 
neurons are organized in layers. There are no connec- 
tions among neurons within the same layer; connections 
only exist between successive layers. Each neuron from 
layer / has connections to each neuron in layer l + 1. 

As has already been mentioned, the activation func- 
tions need to differentiable and are usually of the 
sigmoid shape. The most common activation functions 
are 


© Unipolar sigmoid: 


1 


© Bipolar sigmoid (hyperbolic tangent): 


2 


———_——_ - 1 27.8 
1 + exp(—Anet) ( ) 


fuet) = 

The constant A > 0 determines steepness of the sig- 
moid curve and it is commonly set to 1. In the limit 
À — oo the bipolar sigmoid tends to the sign function 
(used in the perceptron) and the unipolar sigmoid tends 
to the step function. 

Consider the single-layer ANN in Fig. 27.4. The 
output and input vectors are y= (yj,...,)j,-.-,yy) 
and 0 = (01,..., Ok, . - -, Og), respectively, where og = 
f(net,) and 


J 
net, = D Wij - (27.9) 
j=l 


Set yy = —1 and wą; = 6, a threshold fork = 1,...,K 
output neurons. The desired output is d= (d\,..., dk, 
...,dx). 

After training, we would like, for all training pat- 
terns p=1,...,P from Aywan, the model output to 


closely resemble the desired values (target). The train- 
ing problem is transformed to an optimization one by 
defining the error function 


1X 
Ep = 2 > (dyk — Op)” , (27.10) 
k=1 
where p is the training point index. Æ, is the sum of 
squares of errors on the output neurons. During learn- 
ing we seek to find the weight setting that minimizes 
E,. This will be done using the gradient-based steepest 
descent on £, 


P OE dE, d(nety) _ 
ows d(net,) ows 


Awy = AS KY} 5 


(27:11) 


where œ is a positive learning rate. Note that 
—0E,/0(net,) = ok, which is the generalized training 
signal on the k-th output neuron. The partial derivative 
0(net,)/ dw, is equal to y; (27.9). Furthermore, 


dE, OE oy 


Sok = Inet) do, (net) (Apt 


Op ft > 
(27.12) 


where f? denotes the derivative of the activation func- 
tion with respect to net,. For the unipolar sigmoid 
(27.7), we have ff = ox (1 —o,). For the bipolar sigmoid 
(27.8), ff = (1/21 —0;). The rule for updating the j-th 
weight of the k-th output neuron reads as 

Awy = a (dyk — Op Md} ’ (27.13) 
where (dk — Opx) R = ok is generalized error signal 
flowing back through all connections ending in the k- 
the output neuron. Note that if we put ff = 1, we would 
obtain the perceptron learning rule (27.6). 


yi 


Jj 


Wy 


Fig. 27.4 A single-layer ANN 
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We will now extend the network with another layer, 
called the hidden layer (Fig. 27.5). 

Input to the network is identical with the in- 
put vector x = (x,,...,2%;,...,,) for the hidden layer. 
The output neurons process as inputs the outputs y = 
Ois- Yj- --Y7), Yy =f (net) from the hidden layer. 
Hence, 


I 


net = b> VjiXi . 


i=1 


(27.14) 


As before, the last (in this case the /-th) input is fixed to 
—1. Recall that the same holds for the output of the J-th 
hidden neuron. Activation thresholds for hidden neu- 
rons are vj, = 6, forj=1,...,J. 

Equations (27.11)—(27.13) describe modification of 
weights from the hidden to the output layer. We will 
now show how to modify weights from the input to the 
hidden layer. We would still like to minimize E, (27.10) 
through the steepest descent. 

The hidden weight v; will be modified as follows 


ðE, ; 
Oe dE, (net) 
OVji d(ney) dvi 


Ayi = = a yiXi . 


(27.15) 


Again, —dE,/0(net;) = ô; is the generalized training 
signal on the j-th hidden neuron that should flow on the 
input weights. As before, d(net;) /dvjj = x; (27.14). Fur- 
thermore, 


s- 3p _ p ay OE y 
” (net) dy; Ə(net) dy’ 


(27.16) 


Fig. 27.5 A two-layer feed-forward ANN 


where f is the derivative of the activation function in 
the hidden layer with respect to net, 


--5 (dyk — opk) oe 
Y 


> on yf (netz) 0(net,) 
p. 


9) Jne t) oy 
Since fj is the derivative of the output neuron sigmoid 
with respect to net, and d(nety)/dy = wy (27.9), we 
have 


(27.17) 


dE, K K 
Fe De p= ofwy =— D7 Sowy - 
X k=1 k=1 
(27.18) 
Plugging this to (27.16) we obtain 
K 
3 = (>: sam) (27.19) 
k=1 


Finally, the weights from the input to the hidden layer 
are modified as follows 


K 
Avi = a (>: sam) f'xi 


k=1 


(27.20) 


Consider now the general case of m hidden layers. For 
the n-th hidden layer we have 


Avi = abn! , (27.21) 
where 
s = (>: ee “a EY, (27.22) 


and fy’ is the derivative of the activation function of 
the n- layer with respect to net’. 

Often, the learning speed can be improved by using 
the so-called momentum term 


Awy(t) — Awg(t) + HAwy(t— 1), 


Avji(t) < Ayj(t) + pAvi(t— 1), (27.23) 
where jz € (0, 1) is the momentum rate. 

Consider a training set 

Arain = {@',d')@,d’)...G?,d?)...@?,a")}. 


The back-propagation algorithm for training feed- 
forward ANNs can be summarized as follows: 
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@ Step 1: Set a € (0,1). Randomly initialize weights 
to small values, e.g., in the interval (—0.5, 0.5). 
Counters and the error are initialized as follows: 
k=1,p=1, E=0. E denotes the accumulated er- 
ror across training patterns 


E=) Ę,, 


p=1 


(27.24) 


where E, is given in (27.10). Set a tolerance thresh- 
old ¢ for the error. The threshold will be used to stop 
the training process. 

@ Step 2: Apply input x? and compute the correspond- 
ing y? and o?. 

© Step 3: For every output neuron, calculate ôo 
(27.12), for hidden neuron determine 6); (27.19). 

© Step 4: Modify the weights wy <— wy + 0d, and 
Yi < Vi + OS yjXi- 

© Step 5: If p < P, set p = p+ 1 and go to step 2. Oth- 
erwise go to step 6. 

© Step 6: Fixing the weights, calculate E. If E < e, 
stop training, otherwise permute elements of Ayain, 
set E =0,p=1,k = k + 1, and go to step 2. 


Consider a feed-forward ANN with fixed weights 
and single output unit. It can be considered a real- 
valued function G on /-dimensional vectorial inputs, 


J I 
G® =f |} wf > vs) 


j=l i=1 


There has been a series of results showing that such 
a parameterized function class is sufficiently rich in the 
space of reasonable functions (see, e.g., [27.4]). For 
example, for any smooth function F over a compact do- 
main and a precision threshold ¢, for sufficiently large 
number J of hidden units there is a weight setting so 
that G is not further away from F than e (in L-2 norm). 

When training a feed-forward ANN a key decision 
must be made about how complex the model should be. 
In other words, how many hidden units J one should 
use. If J is too small, the model will be too rigid (high 


27.4 Recurrent ANN Models 


Consider a situation where the associations in the train- 
ing set we would like to learn are of the following 
(abstract) form: a> «, b —> p, b > a, b > y, c > q, 
c —> y, d —> q, etc., where the Latin and Greek letters 
stand for input and output vectors, respectively. It is 


bias) and it will not be able to sufficiently adapt to the 
data. However, under different samples from the same 
data generating process, the resulting trained models 
will vary relatively little (low variance). On the other 
hand, if J is too high, the model will be too complex, 
modeling even such irrelevant features of the data such 
as output noise. The particular data will be interpolated 
exactly (low bias), but the variability of fitted models 
under different training samples from the same process 
will be immense. It is, therefore, important to set J to an 
optimal value, reflecting the complexity of the data gen- 
erating process. This is usually achieved by splitting the 
data into three disjoint sets — training, validation, and 
test sets. Models with different numbers of hidden units 
are trained on the training set, their performance is then 
checked on a held-out validation set. The optimal num- 
ber of hidden units is selected based on the (smallest) 
validation error. Finally, the test set is used for inde- 
pendent comparison of selected models from different 
model classes. 

If the data set is not large enough, one can perform 
such a model selection using k-fold cross-validation. 
The data for model construction (this data would be 
considered training and validation sets in the scenario 
above) is split into k disjoint folds. One fold is selected 
as the validation fold, the other k— 1 will be used for 
training. This is repeated k times, yielding k estimates 
of the validation error. The validation error is then cal- 
culated as the mean of those k estimates. 

We have described data-based methods for model 
selection. Other alternatives are available. For exam- 
ple, by turning an ANN into a probabilistic model (e.g., 
by including an appropriate output noise model), un- 
der some prior assumptions on weights (e.g., a-priori 
small weights are preferred), one can perform Bayesian 
model selection (through model evidence) [27.5]. 

There are several seminal books on feed-forward 
ANNs with well-documented theoretical foundations 
and practical applications, e.g., [27.3,6,7]. We refer 
the interested reader to those books as good starting 
points as the breadth of theory and applications of feed- 
forward ANNs is truly immense. 


clear that now for one input item there can be different 
output associations, depending on the temporal con- 
text in which the training items are presented. In other 
words, the model output is determined not only by the 
input, but also by the history of presented items so far. 
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Obviously, the feed-forward ANN model described in 
the previous section cannot be used in such cases and 
the model must be further extended so that the temporal 
context is properly represented. 

The architecturally simplest solution is provided 
by the so-called time delay neural network (TDNN) 
(Fig. 27.6). The input window into the past has a finite 
length D. If the output is an estimate of the next item 
of the input time series, such a network realizes a non- 
linear autoregressive model of order D. 

If we are lucky, even such a simple solution can be 
sufficient to capture the temporal structure hidden in the 
data. An advantage of the TDNN architecture is that 
some training methods developed for feed-forward net- 
works can be readily used. A disadvantage of TDNN 
networks is that fixing a finite order D may not be ade- 
quate for modeling the temporal structure of the data 
generating source. TDNN enables the feed-forward 
ANN to see, besides the current input at time t, the other 
inputs from the past up to time ¢— D. Of course, during 
the training, it is now imperative to preserve the order of 
training items in the training set. TDNN has been suc- 
cessfully applied in many fields where spatial-temporal 
structures are naturally present, such as robotics, speech 
recognition, etc. [27.8, 9]. 

In order to extend the ANN architecture so that the 
variable (even unbounded) length of input window can 
be flexibly considered, we need a different way of cap- 
turing the temporal context. This is achieved through 
the so-called state space formulation. In this case, we 
will need to change our outlook on training. The new 
architectures of this type are known as recurrent neural 
networks (RNN). 

As in feed-forward ANNs, there are connections be- 
tween the successive layers. In addition, and in contrast 
to feed-forward ANNs, connections between neurons of 
the same layer are allowed, but subject to a time de- 


Output layer 


Hidden layer 


(t) xD 


Fig. 27.6 TDNN of order D 


lay. It also may be possible to have connections from 
a higher-level layer to a lower layer, again subject to 
a time delay. In many cases it is, however, more conve- 
nient to introduce an additional fictional context layer 
that contains delayed activations of neurons from the 
selected layer(s) and represent the resulting RNN archi- 
tecture as a feed-forward architecture with some fixed 
one-to-one delayed connections. As an example, con- 
sider the so-called simple recurrent network (SRN) of 
Elman [27.10] shown in Fig. 27.7. The output of SRN 
at time t is given by 


J 
oP =f dima? |. 


j=l 


i I 
y= (>: wA va) . (27.25) 


i=1 i=l 


The hidden layer constitutes the state of the input- 
driven dynamical system whose role it is to represent 
the relevant (with respect to the output) information 
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Fig. 27.7 Schematic depiction of the SRN architecture 
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Fig. 27.8 Schematic depiction of the Jordan’s RNN archi- 
tecture 
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about the input history seen so far. The state (as in 
generic state space model) is updated recursively. 

Many variations on such architectures with time- 
delayed feedback loops exist. For example, Jor- 
dan [27.11] suggested to feed back the outputs as 
the relevant temporal context, or Bengio et al. [27.12] 
mixed the temporal context representations of SRN and 
the Jordan network into a single architecture. Schematic 
representations of these architectures are shown in 
Figs. 27.8 and 27.9. 

Training in such architectures is more complex 
than training of feed-forward ANNs. The principal 
problem is that changes in weights propagate in time 
and this needs to be explicitly represented in the up- 
date rules. We will briefly mention two approaches 
to training RNNs, namely back-propagation through 
time (BPTT) [27.13] and real-time recurrent learning 
(RTRL) [27.14]. We will demonstrate BPTT on a clas- 
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Fig. 27.9 Schematic depiction of the Bengio’s RNN archi- 
tecture 


/ 


Fig. 27.10 A two-neuron SRN 


sification task, where the label of the input sequence 
is known only after T time steps (i.e., after T input 
items have been processed). The RNN is unfolded in 
time to form a feed-forward network with T hidden lay- 
ers. Figure 27.10 shows a simple two-neuron RNN and 
Fig. 27.11 represents its unfolded form for T = 2 time 
steps. 

The first input comes at time t = 1 and the last at 
t= T. Activities of context units are initialized at the 
beginning of each sequence to some fixed numbers. 
The unfolded network is then trained as a feed-forward 
network with T hidden layers. At the end of the se- 
quence, the model output is determined as 


J 


T T T 
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Having the model output enables us to compute the er- 
ror 
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2 
; (27.27) 


The hidden-to-output weights are modified according to 
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Fig. 27.11 Two-neuron SRN unfolded in time for T = 2 
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where 
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The other weight updates are calculated as follows 


(27.29) 
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etc. The final weight updates are the averages of the 
T partial weight update suggestions calculated on the 
unfolded network 
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For every new training sequence (of possibly different 
length T) the network is unfolded to the desired length 
and the weight update process is repeated. In some 
cases (e.g., continual prediction on time series), it is 
necessary to set the maximum unfolding length L that 
will be used in every update step. Of course, in such 
cases we can lose vital information from the past. This 
problem is eliminated in the RTRL methodology. 
Consider again the SRN architecture in Fig. 27.6. 
In RTRL the weights are updated on-line, i. e., at every 


time step t 
dE 
Aw? =-a ; 
4 ə (O) 
kj 
OE 
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The updates of hidden-to-output weights are straight- 
forward 


Am® = _ od? yO = a (a? — ol?) fg (ne) y® i 


(27.36) 
For the other weights we have 
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and ono" is the Kronecker delta oe =1,ifj=h; 
kron: = 0 otherwise). The partial derivatives required 
for the weight updates can be recursively updated us- 
ing (27.37)-(27.39). To initialize training, the partial 
derivatives at t = 0 are usually set to 0. 

There is a well-known problem associated with 
gradient-based parameter fitting in recurrent networks 
(and, in fact, in any parameterized state space models 
of similar form) [27.15]. In order to latch an important 
piece of past information for future use, the state- 
transition dynamics (27.25) should have an attractive 
set. 

However, in the neighborhood of such an attractive 
set, the derivatives of the dynamic state-transition map 
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vanish. Vanishingly small derivatives cannot be reliably 
propagated back through time in order to form a useful 
latching set. This is known as the information latching 
problem. Several suggestions for dealing with informa- 
tion latching problem have been made, e.g., [27.16]. 
The most prominent include long short term memory 
(LSTM) RNN [27.17] and reservoir computation mod- 
els [27.18]. 

LSTM models operate with a specially designed 
formal neuron model that contains so-called gate units. 
The gates determine when the input is significant (in 
terms of the task given) to be remembered, whether 
the neuron should continue to remember the value, and 
when the value should be output. The LSTM architec- 
ture is especially suitable for situations where there are 
long time intervals of unknown size between impor- 
tant events. LSTM models have been shown to provide 
superior results over traditional RNNs in a variety of 
applications (e.g., [27.19, 20]). 

Reservoir computation models try to avoid the 
information latching problem by fixing the state- 
transition part of the RNN. Only linear readout from 
the state activations (hidden recurrent layer) producing 
the output is fit to the data. The state space with the as- 


sociated dynamic state transition structure is called the 
reservoir. The main idea is that the reservoir should be 
sufficiently complex so as to capture a large number of 
potentially useful features of the input stream that can 
be then exploited by the simple readout. 

The reservoir computing models differ in how the 
fixed reservoir is constructed and what form the readout 
takes. For example, echo state networks (ESN) [27.21] 
have fixed RNN dynamics (27.25), but with a lin- 
ear hidden-to-output layer map. Liquid state machines 
(LSM) [27.22] also have (mostly) linear readout, but 
the reservoirs are realized through the dynamics of a set 
of coupled spiking neuron models. Fractal prediction 
machines (FPM) [27.23] are reservoir RNN models for 
processing discrete sequences. The reservoir dynamics 
is driven by an affine iterative function system and the 
readout is constructed as a collection of multinomial 
distributions. Reservoir models have been successfully 
applied in many practical applications with competitive 
results, e.g., (27.21, 24, 25]. 

Several books that are solely dedicated to RNNs 
have appeared, e.g., [27.26—-28] and they contain 
a much deeper elaboration on theory and practice of 
RNNs than we were able to provide here. 


27.5 Radial Basis Function ANN Models 


In this section we will introduce another implemen- 
tation of the idea of feed-forward ANN. The activa- 
tions of hidden neurons are again determined by the 
closeness of inputs X = (x1, X2, .. . , Xn) to weights ¢ = 
(C1, C2,...,Cn). Whereas in the feed-forward ANN in 
Sect. 27.3, the closeness is determined by the dot- 
product of x and c, followed by the sigmoid activation 
function, in radial basis function (RBF) networks the 
closeness is determined by the squared Euclidean dis- 
tance of x and c, transferred through the inverse expo- 
nential. The output of the j-th hidden unit with input 
weight vector G; is given by 


a E 
jX p z , 


J 


(27.40) 


where g; is the activation strength parameter of the j-th 
hidden unit and determines the width of the spherical 
(un-normalized) Gaussian. The output neurons are usu- 
ally linear (for regression tasks) 


J 
o) = 5 wyg) . (27.41) 
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The RBF network in this form can be simply viewed 
as a form of kernel regression. The J functions ø 
form a set of J linearly independent basis functions 
(e.g., if all the centers c are different) whose span 
(the set of all their linear combinations) forms a lin- 
ear subspace of functions that are realizable by the 
given RBF architecture (with given centers c and kernel 
widths gj). 

For the training of RBF networks, it important that 
the basis functions g(x) cover the structure of the in- 
puts space faithfully. Given a set of training inputs x? 
from Again = {(X', d) E, a)... (3, P)... 0, d?)}, 
many RBF-ANN training algorithms determine the cen- 
ters & and widths o; based on the inputs {x! EE a } 
only. One can employ different clustering algo- 
rithms, e.g., k-means [27.29], which attempts to 
position the centers among the training inputs so 
that the overall sum of (Euclidean) distances be- 
tween the centers and the inputs they represent (i. e., 
the inputs falling in their respective Voronoi com- 
partments — the set of inputs for which the cur- 
rent center is the closest among all the centers) is 
minimized: 
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@ Step 1: Set J, the number of hidden units. The op- 
timum value of Jcan be obtained through a model 
selection method, e.g., cross-validation. 

@ Step 2: Randomly select J training inputs that will 
form the initial positions of the J centers G. 

@ Step 3: At time step t: 

a) Pick a training input x(t) and find the center c(t) 
closest to it. 
b) Shift the center c(t) towards x(t) 


clt) — El) + PHAM — ee), 
where 0 < p(t) <1. (27.42) 
The learning rate p(t) usually decreases in time 
towards zero. The training is stopped once the cen- 
ters settle in their positions and move only slightly 
(some norm of weight updates is below a certain 
threshold). Since k-means is guaranteed to find only lo- 
cally optimal solutions, it is worth re-initializing the 
centers and re-running the algorithm several times, 
keeping the solution with the lowest quantization 
error. 
Once the centers are in their positions, it is easy to 
determine the RBF widths, and once this is done, the 


27.6 Self-Organizing Maps 


In this section we will introduce ANN models that 
learn without any signal from a teacher, i.e., learning 
is based solely on training inputs — there are no out- 
put targets. The ANN architecture designed to operate 
in this setting was introduced by Kohonen under the 
name self-organizing map (SOM) [27.32]. This model 
is motivated by organization of neuron sensitivities in 
the brain cortex. 

In Fig. 27.12a we show schematic illustration of 
one of the principal organizations of biological neural 
networks. In the bottom layer (grid) there are recep- 
tors representing the inputs. Every element of the inputs 
(each receptor) has forward connections to all neurons 
in the upper layer representing the cortex. The neurons 
are organized spatially on a grid. Outputs of the neurons 
represent activation of the SOM network. The neurons, 
besides receiving connections from the input recep- 
tors, have a lateral interconnection structure among 
themselves, with connections that can be excitatory, or 
inhibitory. In Fig. 27.12b we show a formal SOM archi- 
tecture — neurons spatially organized on a grid receive 
inputs (elements of input vectors) through connections 
with synaptic weights. 


output weights can be solved using methods of linear 
regression. 

Of course, it is more optimal to position the centers 
with respect to both the inputs and target outputs in the 
training set. This can be formulated, e.g., as a gradient 
descent optimization. Furthermore, covering of the in- 
put space with spherical Gaussian kernels may not be 
optimal, and algorithms have been developed for learn- 
ing of general covariance structures. A comprehensive 
review of RBF networks can be found, e.g., in [27.30]. 

Recently, it was shown that if enough hidden 
units are used, their centers can be set randomly 
at very little cost, and determination of the only 
remaining free parameters — output weights — can 
be done cheaply and in a closed form through lin- 
ear regression. Such architectures, known as extreme 
learning machines [27.31] have shown surprisingly 
high performance levels. The idea of extreme learn- 
ing machines can be considered as being analo- 
gous to the idea of reservoir computation, but in 
the static setting. Of course, extreme learning ma- 
chines can be built using other implementations of 
feed-forward ANNs, such as the sigmoid networks of 
Sect. 27.3. 


A particular feature of the SOM is that it can map 
the training set on the neuron grid in a manner that 
preserves the training set’s topology — two input pat- 
terns close in the input space will activate neurons most 
that are close on the SOM grid. Such topological map- 
ping of inputs (feature mapping) has been observed in 
biological neural networks [27.32] (e.g., visual maps, 
orientation maps of visual contrasts, or auditory maps, 
frequency maps of acoustic stimuli). 

Teuvo Kohonen presented one possible realization 
of the Hebb rule that is used to train SOM. Input 


Fig. 27.12a,b Schematic representation of the SOM ANN architec- 


tures 
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weights of the neurons are initialized as small ran- 

dom numbers. Consider a training set of inputs, Atrain = 
SiP : 

{Xp þp=i and linear neurons 


m 
Oj = X wiy = Wix . (27.43) 


j=l 


where m is the input dimension andi = 1,...,n. Train- 
ing inputs are presented in random order. At each train- 
ing step, we find the (winner) neuron with the weight 
vector most similar to the current input x. The mea- 
sure of similarity can be based on the dot product, i. e., 
the index of the winner neuron is i* = arg max(w/ xX), 
or the (Euclidean) distance i* = arg min; ||x— W;||. Af- 
ter identifying the winner the learning continues by 
adapting the winner’s weights along with the weights 
all its neighbors on the neuron grid. This will ensure 
that nearby neurons on the grid will eventually repre- 
sent similar inputs in the input space. This is moderated 
by a neighborhood function h(i* , i) that, given a winner 
neuron index i*, quantifies how many other neurons on 
the grid should be adapted 


w (t+ 1) = Wi (A) +a (A) -h (i*i) EÀ -wi O). 
(27.44) 


The learning rate a(t) € (0, 1) decays in time as 1/t, or 
exp(—kt), where k is a positive time scale constant. This 
ensures convergence of the training process. The sim- 
plest form of the neighborhood function operates with 
rectangular neighborhoods, 


h(i*,i) = 1, ifdy(i*,) <A (£) 


, (27.45) 
0, otherwise, 


where dy(i*,i) represents the 2A (t) (Manhattan) dis- 
tance between neurons i* and i on the map grid. The 
neighborhood size 2A (t) should decrease in time, e.g., 
through an exponential decay as or exp(—q?), with time 
scale q > 0. Another often used neighborhood function 
is the Gaussian kernel 


d2 (i*, 2) 


h(i* i) = exp (- 


where dg(i*, i) is the Euclidean distance between i* and 
i on the grid, i. e., dg(i*, i) = ||7;* —7;||, where 7; is the 
co-ordinate vector of the i-th neuron on the grid SOM. 


Training of SOM networks can be summarized as fol- 
lows: 


© Step 1: Set œo, Ao and fmax (maximum number 
of iterations). Randomly (e.g., with uniform distri- 
bution) generate the synaptic weights (e.g., from 
(—0.5,0.5)). Initialize the counters: t = 0, p = 1; t 
indexes time steps (iterations) and p is the input pat- 
tern index. 

© Step 2: Take input x? and find the corresponding 
winner neuron. 

@ Step 3: Update the weights of the winner and its 
topological neighbors on the grid (as determined by 
the neighborhood function). Increment t. 

© Step 4: Update a and À. 

@ Step 5: If p< P, set p <p + 1, go to step 2 (we 
can also use randomized selection), otherwise go to 
step 6. 

@ Step 6: If t= tmax, finish the training process. Oth- 
erwise set p = 1 and go to step 2. A new training 
epoch begins. 


The SOM network can be used as a tool for non- 
linear data visualization (grid dimensions 1, 2, or 3). 
In general, SOM implements constrained vector quanti- 
zation, where the codebook vectors (vector quantization 
centers) cannot move freely in the data space dur- 
ing adaptation, but are constrained to lie on a lower 
dimensional manifold W in the data space. The dimen- 
sionality of W is equal to the dimensionality of the 
neural grid. The neural grid can be viewed as a dis- 
cretized version of the local co-ordinate system Y (e.g., 
computer screen) and the weight vectors in the data 
space (connected by the neighborhood structure on the 
neuron grid) as its image in the data space. In this in- 
terpretation, the neuron positions on the grid represent 
co-ordinate functions (in the sense of differential ge- 
ometry) mapping elements of the manifold W to the 
coordinate system Y. Hence, the SOM algorithm can 
also be viewed as one particular implementation of 
manifold learning. 

There have been numerous successful applications 
of SOM in a wide variety of applications, e.g., in image 
processing, computer vision, robotics, bioinformatics, 
process analysis, and telecommunications. A good sur- 
vey of SOM applications can be found, e.g., in [27.33]. 
SOMs have also been extended to temporal domains, 
mostly by the introduction of additional feedback con- 
nections, e.g., [27.34—37]. Such models can be used for 
topographic mapping or constrained clustering of time 
series data. 
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27.7 Recursive Neural Networks 


In many application domains, data are naturally or- 
ganized in structured form, where each data item is 
composed of several components related to each other 
in a non-trivial way, and the specific nature of the task to 
be performed is strictly related not only to the informa- 
tion stored at each component, but also to the structure 
connecting the components. Examples of structured 
data are parse trees obtained by parsing sentences in 
natural language, and the molecular graph describing 
a chemical compound. 

Recursive neural networks (RecNN) [27.38, 39] are 
neural network models that are able to directly pro- 
cess structured data, such as trees and graphs. For the 
sake of presentation, here we focus on positional trees. 
Positional trees are trees for which each child has an as- 
sociated index, its position, with respect to the siblings. 
Let us understand how RecNN is able to process a tree 
by analogy with what happens when unfolding a RNN 
when processing a sequence, which can be understood 
as a special case of tree where each node v possesses 
a single child. 

In Fig. 27.13 (top) we show the unfolding in time 
of a sequence when considering a graphical model (re- 
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cursive network) representing, for a generic node v, the 
functional dependencies among the input information 
Xy, the state variable (hidden node) y,, and the output 
variable o,. The operator g~! represents the shift oper- 
ator in time (unit time delay), i. e., g yr = yi which 
applied to node v in our framework returns the child of 
node v. At the bottom of Fig. 27.13 we have reported 
the unfolding of a binary tree, where the recursive net- 
work uses a generalization of the shift operator, which 
given an index i and a variable associated to a vertex 
v returns the variable associated to the i-th child of v, 
i.e., q7 Yy = Yen;[v]: SO, while in RNN the network is 
unfolded in time, in RecNN the network is unfolded on 
the structure. The result of unfolding, in both cases, is 
the encoding network. The encoding network for the se- 
quence specifies how the components implementing the 
different parts of the recurrent network (e.g., each node 
of the recurrent network could be instantiated by a layer 
of neurons or by a full feed-forward neural network 
with hidden units) need to be interconnected. In the case 
of the tree, the encoding network has the same seman- 
tics: this time, however, a set of parameters (weights) 
for each child should be considered, leading to a net- 


Encoding network: time unfolding 


Frontier states 


Encoding network 


Fig. 27.13a,b Generation of the encoding network (a) for a sequence and (b) a tree. Initial states are represented by 
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Fig. 27.15a,b Schematic illustration of a principal organization of 
biological self-organizing neural network (a) and its formal coun- 
terpart SOM ANN architecture (b) 


Fig. 27.14 The causality style of computation induced by 
the use of recursive networks is made explicit by using 
nested boxes to represent the recursive dependencies of the 
hidden variable associated to the root of the tree < 


work that, given a node v, can be described by the 
equations 


o0 =f 5 myy” f 
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where d is the maximum number of children an input 
node can have, and weights wi are indexed on the s-th 
child. Note that it is not difficult to generalize all the 
learning algorithms devised for RNN to these extended 
equations. 

It should be remarked that recursive networks 
clearly introduce a causal style of computation, 1. e., 
the computation of the hidden and output variables for 
a vertex v only depends on the information attached to v 
and the hidden variables of the children of v. This de- 
pendence is satisfied recursively by all v’s descendants 
and is clearly shown in Fig. 27.14. In the figure, nested 
boxes are used to make explicit the recursive depen- 
dencies among hidden variables that contribute to the 
determination of the hidden variable y, associated to the 
root of the tree. 

Although an encoding network can be generated 
for a directed acyclic graph (DAG), this style of com- 
putation limits the discriminative ability of RecNN 
to the class of trees. In fact, the hidden state is not 
able to encode information about the parents of nodes. 
The introduction of contextual processing, however, al- 
lows us to discriminate, with some specific exceptions, 
among DAGs [27.40]. Recently, Micheli [27.41] also 
showed how contextual processing can be used to ex- 
tend RecNN to the treatment of cyclic graphs. 

The same idea described above for supervised neu- 
ral networks can be adapted to unsupervised mod- 
els, where the output value of a neuron typically 
represents the similarity of the weight vector associ- 
ated to the neuron with the input vector. Specifically, 
in [27.37] SOMs were extended to the processing of 
structured data (SOM-SD). Moreover, a general frame- 
work for self-organized processing of structured data 
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was proposed in [27.42]. The key concepts introduced 
are: 


i) The explicit definition of a representation space R 
equipped with a similarity measure dp(-,-) to evalu- 
ate the similarity between two hidden states. 

ii) The introduction of a general representation func- 
tion, denoted rep(-), which transforms the activation 
of the map for a given input into an hidden state rep- 
resentation. 


In these models, each node v of the input struc- 
ture is represented by a tuple [x,,7,,...,7,], where 
X, is a real-valued vectorial encoding of the infor- 
mation attached to vertex v, and 7,, are real-valued 


27.8 Conclusion 


The field of artificial neural networks (ANN) has 
grown enormously in the past 60 years. There are 
many journals and international conferences specifi- 
cally devoted to neural computation and neural net- 
work related models and learning machines. The field 
has gone a long way from its beginning in the form 
of simple threshold units existing in isolation (e.g., 
the perceptron, Sect. 27.2) or connected in circuits. 
Since then we have learnt how to generalize such 
networks as parameterized differentiable models of var- 
ious sorts that can be fit to data (trained), usually 
by transforming the learning task into an optimization 
one. 

ANN models have found numerous successful 
practical applications in many diverse areas of sci- 
ence and engineering, such as astronomy, biology, 
finance, geology, etc. In fact, even though basic feed- 
forward ANN architectures were introduced long time 
ago, they continue to surprise us with successful ap- 
plications, most recently in the form of deep net- 
works [27.44]. For example, a form of deep ANN 
recently achieved the best performance on a well- 
known benchmark problem — the recognition of hand- 
written digits [27.45]. This is quite remarkable, since 
such a simple ANN architecture trained in a purely 
data driven fashion was able to outperform the current 
state-of-art techniques, formulated in more sophisti- 


vectorial representations of hidden states returned by 
the rep(-) function when processing the activation 
of the map for the i-th neighbor of v. Each neu- 
ron nj in the map is associated to a weight vector 
[W ct, e ee | The computation of the winner neu- 
ron is based on the joint contribution of the similarity 
measures d,(-,-) for the input information, and dp(-, -) 
for the hidden states, i.e., the internal representa- 
tions. Some parts of a SOM-SD map trained on DAGs 
representing visual patterns are shown in Fig. 27.15. 
Even in this case the style of computation is causal, 
ruling out the treatment of undirected and/or cyclic 
graphs. In order to cope with general graphs, recently 
a new model, named GraphSOM [27.43], was pro- 
posed. 


cated frameworks and possibly incorporating domain 
knowledge. 

ANN models have been formulated to operate in 
supervised (e.g., feed-forward ANN, Sect. 27.3; RBF 
networks, Sect. 27.5), unsupervised (e.g., SOM models, 
Sect. 27.6), semi-supervised, and reinforcement learn- 
ing scenarios and have been generalized to process 
inputs that are much more general than simple vector 
data of fixed dimensionality (e.g., the recurrent and re- 
cursive networks discussed in Sects. 27.4 and 27.7). Of 
course, we were not able to cover all important de- 
velopments in the field of ANNs. We can only hope 
that we have sufficiently motivated the interested reader 
with the variety of modeling possibilities based on the 
idea of interconnected networks of formal neurons, 
so that he/she will further consult some of the many 
(much more comprehensive) monographs on the topic, 
e.g., [27.3, 6, 7]. 

We believe that ANN models will continue to 
play an important role in modern computational in- 
telligence. Especially the inclusion of ANN-like mod- 
els in the field of probabilistic modeling can provide 
techniques that incorporate both explanatory model- 
based and data-driven approaches, while preserving 
a much fuller modeling capability through operat- 
ing with full distributions, instead of simple point 
estimates. 
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28. Deep and Modular Neural Networks 


Ke Chen 


In this chapter, we focus on two important areas 
in neural computation, i.e., deep and modular 
neural networks, given the fact that both deep 
and modular neural networks are among the most 
powerful machine learning and pattern recogni- 
tion techniques for complex Al problem solving. 
We begin by providing a general overview of deep 
and modular neural networks to describe the gen- 
eral motivation behind such neural architectures 
and fundamental requirements imposed by com- 
plex Al problems. Next, we describe background 
and motivation, methodologies, major building 
blocks, and the state-of-the-art hybrid learning 
strategy in context of deep neural architectures. 
Then, we describe background and motivation, 
taxonomy, and learning algorithms pertaining to 
various typical modular neural networks in a wide 
context. Furthermore, we also examine relevant 


28.1 Overview 


The human brain is a generic effective and efficient 
system that solves complex and difficult problems and 
generates the trait of intelligence and creation. Neural 
computation has been inspired by brain-related research 
in different disciplines, e.g., biology and neuroscience, 
on various levels ranging from a simple single-neuron 
to complex neuronal structure and organization [28.1]. 
Among many discoveries in brain-related sciences, two 
of the most important properties are modularity and hi- 
erarchy of neuronal organization in the human brain. 
Neuroscientific research has revealed that the cen- 
tral nervous system (CNS) in the human brain is 
a distributed, massively parallel, and self-organizing 
modular system [28.1—3]. The CNS is composed of sev- 
eral regions such as the spinal cord, medulla oblongata, 
pons, midbrain, diencephalon, cerebellum, and the two 
cerebral hemispheres. Each such region forms a func- 
tional module and all regions are interconnected with 
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issues and discuss open problems in deep and 
modular neural network research areas. 


other parts of the brain [28.1]. In particular, the cerebral 
cortex consists of several regions attributed to main per- 
ceptual and cognitive tasks, where modularity emerges 
in two different aspects: i.e., structural and functional 
modularity. Structural modularity is observable from 
the fact that there are sparse connections between dif- 
ferent neuronal groups but neurons are often densely 
connected within a neuronal group, while functional 
modularity is evident from different response patterns 
produced by neural modules for different perceptual 
and cognitive tasks. Modularity evidence in the human 
brain strongly suggests that domain-specific modules 
are required by specific tasks and different modules 
can cooperate for high level, complex tasks, which pri- 
marily motivates the modular neural network (MNN) 
development in neural computation (NC) [28.4, 5]. 
Apart from modularity, the human brain also ex- 
hibits a functional and structural hierarchy given the fact 
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that information processing in the human brain is done 
in a hierarchical way. Previous studies [28.6,7] sug- 
gested that there are different cortical visual areas that 
lead to hierarchical information representations to carry 
out highly complicated visual tasks, e.g., object recog- 
nition. In general, hierarchical information processing 
enables the human brain to accomplish complex percep- 
tual and cognitive tasks in an effective and extremely 
efficient way, which mainly inspires the study of deep 
neural networks (DNNs) of multiple layers in NC. 

In general, both DNNs and MNNs can be cate- 
gorized into biologically plausible [28.8] and artificial 


28.2 Deep Neural Networks 


In this section, we overview main deep neural net- 
work (DNN) techniques with an emphasis on the latest 
progress. We first review background and motivation 
for DNN development. Then we describe major build- 
ing blocks and relevant learning algorithms for con- 
structing different DNNs. Next, we present a hybrid 
learning strategy in the context of NC. Finally, we ex- 
amine relevant issues related to DNNs. 


28.2.1 Background and Motivation 


The study of NC dates back to the 1940s when Mc- 
Cullod and Pitts modeled a neuron mathematically. 
After that NC was an active area in AI studies until 
Minsky and Papert published their influential book, Per- 
ceptron [28.10], in 1969. In the book, they formally 
proved the limited capacities of the single-layer percep- 
tron and further concluded that there is a slim chance 
to expand its capacities with its multi-layer version, 
which significantly slowed down NC research until the 
back-propagation (BP) algorithm was invented (or rein- 
vented) to solve the learning problem in a multi-layer 
perceptron (MLP) [28.11]. 

In theory, the BP algorithm enables one to train an 
MLP of many hidden layers to form a powerful DNN. 
Such an attractive technique has aroused tremendous 
enthusiasm in applying DNNs in different fields [28.9]. 
Apart from a few exceptions, e.g., [28.12], researchers 
soon found that an MLP of more than two hidden lay- 
ers often failed [28.13] due to the well-known fact 
that MLP learning involves an extremely difficult non- 
convex optimization problem, and the gradient-based 
local search used in the BP algorithm easily gets stuck 
in an unwanted local minimum. As a result, most re- 


models [28.9] in NC. The main difference between 
biologically plausible and artificial models lies their 
methodologies that a biologically plausible model of- 
ten takes both structural and functional resemblance 
to its biological counterpart into account, while an 
artificial model simply works towards modeling the 
functionality of a biological system without consider- 
ing those bio-mimetic factors. Due to the limited space, 
in this chapter we merely focus on artificial DNNs 
and MNNs. Readers interested in biologically plausi- 
ble models are referred to the literature, e.g., [28.4], for 
useful information. 


searchers gradually gave up deep architectures and 
devoted their attention to shallow learning architectures 
of theoretical justification, e.g., the formal but non- 
constructive proof that an MLP of single hidden layer 
may be a universal function approximator [28.14] anda 
support vector machine (SVM) [28.15], instead. It has 
been shown that shallow architectures often work well 
with support of effective feature extraction techniques 
(but these are often handcrafted). However, recent the- 
oretic justification suggests that learning models of 
insufficient depth have a fundamental weakness as they 
cannot efficiently represent the very complicated func- 
tions often required in complex AI tasks [28.16, 17]. 

To solve complex the non-convex optimization 
problem encountered in DNN learning, Hinton and his 
colleagues made a breakthrough by coming up with 
a hybrid learning strategy in 2006 [28.18]. The novel 
learning strategy combines unsupervised and super- 
vised learning paradigms where a layer-wise greedy 
unsupervised learning is first used to construct an ini- 
tial DNN with chosen building blocks (such an initial 
DNN alone can also be used for different purposes, 
e.g., unsupervised feature learning [28.19]), and super- 
vised learning is then fulfilled based on the pre-trained 
DNN. Their seminal work led to an emerging machine 
learning (ML) area, deep learning. As a result, differ- 
ent building blocks and learning algorithms have been 
developed to construct various DNNs. Both theoretical 
justification and empirical evidence suggest that the hy- 
brid learning strategy [28.18] greatly facilitates learning 
of DNNs [28.17]. 

Since 2006, DNNs trained with the hybrid learn- 
ing strategy have been successfully applied in differ- 
ent and complex AI tasks, such as pattern recogni- 
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tion [28.20—23], various computer vision tasks [28.24— 
26], audio classification and speech information 
processing [28.27-31], information retrieval [28.32— 
34], natural language processing [28.35-37], and 
robotics [28.38]. Thus, DNNs have become one of the 
most promising ML and NC techniques to tackle chal- 
lenging AI problems [28.39]. 


28.2.2 Building Blocks 
and Learning Algorithms 


In general, a building block is composed of two para- 
metric models, encoder and decoder, as illustrated in 
Fig. 28.1. An encoder transforms a raw input or a low- 
level representation x into a high-level and abstract 
representation h(x), while a decoder generates an out- 
put x, a reconstructed version of x, from h(x). The 
learning building block is a self-supervised learning 
task that minimizes an elaborate reconstruction cost 
function to find appropriate parameters in encoder and 
decoder. Thus, the distinction between two building 
blocks of different types lies in their encoder and de- 
coder mechanisms and reconstruction cost functions (as 
well as optimization algorithms used for parameter es- 
timation). Below we describe different building blocks 
and their learning algorithms in terms of the generic ar- 
chitecture shown in Fig. 28.1. 


Auto-Encoders 
The auto-encoder [28.40] and its variants are sim- 
ple building blocks used to build an MLP of many 
layers. It is carried out by an MLP of one hidden 
layer. As depicted in Fig. 28.2, the input and the 
hidden layers constitute an encoder to generate a M- 


Fig. 28.1 Schematic diagram of a generic building block 
architecture 


dimensional representation h(x) = (h(x),...,Ay(x))' 
(hereinafter, we use the notation h(x) = (Am(x))“_, to 
indicate a vector—element relationship for simplifying 
the presentation) for a given input x = (x,)*_, in N- 
dimensional space 


h(x) =f(Wx + bn). 


where W is a connection weight matrix between the 
input and the hidden layers, bp is the bias vector for 
all hidden neurons, and f(-) is a transfer function, e.g., 
the sigmoid function [28.9]. Let f(u) = (f(uz))<, be 
a collective notation for output of all K neurons in 
a layer. Accordingly, the hidden and the output lay- 
ers form a decoder that yields a reconstructed version 
$= (EN, 


&=f(W'h(x) + bo), 


where W! is the transpose of the weight matrix W and 
b, is the bias vector for all output neurons. Note that the 
auto-encoder can be viewed as a special case of auto- 
associator when the same weights are tied to be used 
in connections between different layers, which will be 
clearly seen in the learning algorithm later on. Doing 
so avoids an unwanted solution when an over-complete 
representation, i. e., M > N, is required [28.22]. 
Further studies [28.41] suggest that the auto- 
encoder is unlikely to lead to the discovery of a more 
useful representation than the input despite the fact that 
a representation should encode much of the informa- 
tion conveyed in the input whenever the auto-encoder 
produces a good reconstruction of its input. As a re- 
sult, a variant named the denoising auto-encoder (DAE) 
was proposed to capture stable structures underlying 
the distribution of its observed input. The basic idea is 
as follows: instead of learning the auto-encoder from 
the intact input, the DAE will be trained to recover the 
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original input from its distorted version of partial de- 
struction [28.41]. As is illustrated in Fig. 28.3, the DAE 
leads to a more useful representation h(x) by restoring 
the corrupted input x to a reconstructed version £ as 
close to the clean input x as possible. Thus, the encoder 
yields a representation as 


h(x) = f(W% + bn) , 


and the decoder produces a restored version £ via the 
representation h(x) 


&=f(W'h(€) +bo) . 


To produce a corrupted input, we need to distort a clean 
input by corrupting it with appropriate noise. Depend- 
ing on the attribute nature of input, there are three 
kinds of noise used in the corruption process: i.e., 
the isotropic Gaussian noise, N (0, oI), masking noise 
(by setting some randomly chosen elements of x to 
zero) and salt-and-pepper noise (by flipping some ran- 
domly chosen elements’ values of x to the maximum 
or the minimum of a given range). Normally, Gaus- 
sian noise is used for input of real or continuous 
values, while and masking and salt-and-pepper noise 
is applied to input of discrete values, e.g., pixel in- 
tensities of gray images. It is worth stating that the 
variance o? in Gaussian noise and the number of ran- 
domly chosen elements in masking and salt-and-pepper 
noise are hyper-parameters that affect DAE learning. 
By corrupting a clean input with the chosen noise, we 
achieve an example, (x,x), for self-supervised learn- 
ing. 

Given a training set of T examples {(x;,x;)}/_, 
(auto-encoder) or {(*;,x,)}/_, (DAE) two reconstruc- 
tion cost functions are commonly used for learning 
auto-encoders as follows 


T N 
1 A 
LW, bio) = z7 2 2 Omn), (28.1) 


t=] n=l 


L(W, bn, bo) 


T N 
1 4 ‘ 
S= XOY Gn log din + (1 — xm) log (1 — ĉm)) . 


t=1n=1 


(28.1b) 


The cost function in (28.1a) is used for input of real 
or discrete values, while the cost function in (28.1b) is 
employed especially for input of binary values. 


Fig. 28.3 Denoising auto-encoder architecture 


To minimize reconstruction functions in (28.1), 
application of the stochastic gradient descent algo- 
rithm [28.12] leads to a generic learning algorithm for 
training the auto-encoder and its variant, summarized as 
follows: 


Auto-Encoder Learning Algorithm. Given a train- 
ing set of T examples, {(z;,x,)}7_, where z; =x, for 
the auto-encoder or z; = x, for the DAE, and a transfer 
function, f(-), randomly initialize all parameters, W, bp 
and b,, in auto-encoders and pre-set a learning rate e€. 
Furthermore, the training set is randomly divided into 
several batches of Tg examples, enx) 2 and then 
parameters are updated based on each batch: 


@ Forward computation 
For the input z, (t= 1,--- , Tg), output of the hidden 
layer is 


h(z,) = fun) . 


And output of the output layer is 


¥, =f(Uo(Z:)) , 


@ Backward gradient computation 
For the cost function in (28.1a), 


OL(W, bn, bo) 
duo (21) 


unz) = Wz: + bn . 


Uo(Z:) = W'h(z,) +b, . 


= (Gn = Xm) f’ (o aED, ’ 


where f’(-) is the first-order derivative function of 
f(-). For the cost function in (28.1b), 


OL(W, by, Bo) ig 
duo (Zr) = 
Then, the gradient with respect to h(z,) is 
OL(W, bn, bo), OL(W, bn, bo) 
oh (z) OU, (Zr) 
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Applying the chain rule achieves the gradient with 
respect to uj,(Z;) as 


AL(W, By, Bo) 


/ ƏL(W, bn, bo) \" 
dun (Zr) z (r (nna) Ee ) 


dhn (z) 


m=1 
Gradients with respect to biases are 


ƏL(W, br, bo)  ƏL(W, bn, bo) 
ðb, duo (Zr) ? 


and 


oL(W, bn, bo) z oL(W, bn, bo) 
dbn, dun (z:) 


@ Parameter update 
Applying the gradient descent method and tied 
weights leads to update rules 


Tg 
€ ƏL(W, br bo) at 
wew-= | 
<" T l ame) C"! 


t=1 


ƏL(W, bn, bo) \' 
men a h 
oO t 
Tg 
E ƏL(W, bn, bo) 
b, <b, ; 
T > OU, (Z:) 
and 
Tg 
e€ & aL(W, bn, bo) 
b, <b, 
A T > ðu, (Zr) 


The above three steps repeat for all batches, which 
leads to a training epoch. The learning algorithm runs 
iteratively until a termination condition is met (typically 
based on a cross-validation procedure [28.12]). 


The Restricted Boltzmann Machine 
Strictly speaking, the restricted Boltzmann machine 
(RBM) [28.42] is an energy-based generative model, 
a simplified version of the generic Boltzmann machine. 
As illustrated in Fig. 28.4, an RBM can be viewed 
as a probabilistic NN of two layers, i.e., visible and 
hidden layers, with bi-directional connections. Unlike 
the Boltzmann machine, there are no lateral connec- 
tions among neurons in the same layer in an RBM. 
With the bottom-up connections from the visible to the 


Hidden layer 


wT 


Encoder 
= 
Jopoooq 


Visible layer 


Fig. 28.4 Restricted Boltzmann machine (RBM) architec- 
ture 


hidden layer, RBM forms an encoder that yields a prob- 
abilistic representation h = (h,,)”_, for input data v = 


(Vn y 1 


M 
P(h|v) = | | Panl»), 
m=1 
N 
P(hnlv) = (o (>: Win T hn) ` (28.2) 
n=l 


where W,,,, is the connection weight between the vis- 
ible neuron n and the hidden neuron m, and bj.» 
is the bias of the hidden neuron m. ¢(u) = I~ 
is the sigmoid transfer function. As h,, is assumed 
to take a binary value, i.e., hm € {0,1}, P(hm|v) is 
interpreted as the probability of n= 1. Accord- 
ingly, RBM performs a probabilistic decoder via the 
top-down connections from the hidden to the visi- 
ble layer to reconstruct an input with the probabil- 


ity 


M 
Poh) = [] Poult). 
m=1 
N 
P(valh) = ġ ( SO Wamltm + ha) ; (28.3) 
m= 1 


where W,„m is the connection weight between the hid- 
den neuron m and the visible neuron n, and by n is 
the bias of visible neuron n. Like connection weights 
in auto-encoders, bi-directional connection weights are 
tied, i.e., Win = Wm, aS shown in Fig. 28.4. By 
learning a parametric model of the data distribution 
P(v) derived from the joint probability P(v,h) for 
a given data set, RBM yields a probabilistic repre- 
sentation that tends to reconstruct any data subject 
to P(v). 
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The joint probability P(v, h) is defined based on the 
following energy function for v, € {0, 1} 


M N 
E(v, h) = X > WinlinVn 


m=1n=1 


N 


M 
= >D hmbh.m = = Vnby,m . 


m=1 n=1 


(28.4) 


As a result, the joint probability is subject to the Boltz- 
mann distribution 


eT EW.h) 


Zy a eh 


Thus, we achieve the data probability by marginalizing 
the joint probability as follows 


PY) = X P0, h) = D> PO|A)P(h). 
h h 


P(v,h) = (28.5) 


In order to achieve the most likely reconstruction, we 
need to maximize the log-likelihood of P(v). Therefore, 
the reconstruction cost function of an RBM is its nega- 
tive log-likelihood function 


L(W, bn, by) = — log P() = — log È ` P(|h)P(h) . 
h 


(28.6) 


From (28.5) and (28.6), it is observed that the direct 
use of a gradient descent method for optimal parame- 
ters often leads to intractable computation due to the 
fact that the exponential number of possible hidden- 
layer configurations needs to be summed over in (28.5) 
and then used in (28.6). Fortunately, an approximation 
algorithm named contrastive divergence (CD) has been 
proposed to solve this problem [28.42]. The key ideas 
behind the CD algorithm are (i) using Gibbs sampling 
based on the conditional distributions in (28.2) and 
(28.3), and (ii) running only a few iterations of Gibbs 
sampling by treating the data x input to an RBM as the 
initial state, i. e., v? = x, of the Markov chain at the vis- 
ible layer. Many studies have suggested that only the 
use of one iteration of the Markov chain in the CD algo- 
rithm works well for building up a deep belief network 
(DBN) in practice [28.17, 18,22], and hence the algo- 
rithm is dubbed CD-/ in this situation, a special case 
of the CD-k algorithm that executes k iterations of the 
Markov chain in the Gibbs sampling process. 


XO; -0 
k=0 k=1 


Data Reconstruction 


Fig. 28.5 Gibbs sampling process in the CD-/ algorithm 


Figure 28.5 illustrates a Gibbs sampling process 
used in the CD-J algorithm as follows 


i) Estimating probabilities P(h°|v°), for m= 1, 


--+ ,M, with the encoder defined in (28.2) and then 
forming a realization of h? by sampling with these 
probabilities. 

ii) Applying the decoder defined in (28.3) to estimate 
probabilities P(v}|h®), for n= 1,--- ,N, and then 
producing a reconstruction of v! via sampling. 

iii) With the reconstruction, estimating probabilities 
P(h}|v!), for m = 1,--- , M, with the encoder. 


m 


With the above Gibbing sampling procedure, the 
CD-1 algorithm is summarized as follows: 


Algorithm 28.1 RBM CD-1 Learning Algorithm 
Given a training set of T instances, {x,}/_,, randomly 
initialize all parameters, W, b, and b,, in an RBM and 
pre-set a learning rate €: 


@ Positive phase 


— Present an instance to the visible layer, i.e., 
0 


ve =X; 
- Estimate probabilities with the encoder: 
P(h®|v°) = (P(A°|v°))“_, by using (28.2). 


@ Negative phase 
- Formarealization of h? by sampling with prob- 
abilities P(h°|v°). 
— With the realization of h°, apply the de- 
coder to estimate probabilities: Pv'|h°) = 


(P(vo\h°))_, by using (28.3), and then pro- 
duce a reconstruction v! via sampling based on 
Pi! |h?). 


— With the encoder and the reconstruction, es- 
timate probabilities: P(A! |v!) = (P(h},|v!))“_, 
by using (28.2). 
@ Parameter update 
Based on Gibbs sampling results in the positive and 
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the negative phases, parameters are updated as fol- 
lows: 


W-W+e (Pno) — Êh’ yoy") 
b, <b, te (Pa) -Êh b») , 

and 
b, — by +€ (v? —v'). 


The above three steps repeat for all instances in the 
given training set, which leads to a training epoch. The 
learning algorithm runs iteratively until it converges. 


Predictive Sparse Decomposition 
Predictive sparse decomposition (PSD) [28.43] is 
a building block obtained by combining sparse cod- 
ing [28.44] and auto-encoder ideas. In a PSD building 
block, the encoder is specified by 


h(x,) = Gtanh(Wex, + bn) , (28.7) 


where Wg is the MXN connection matrix between input 
and hidden neurons in the encoder and G = diag (8mm) 
is an MxM learnable diagonal gain matrix for an M- 
dimensional representation of an N-dimensional input, 
x;, b are biases of hidden neurons, and tanh(-) is the hy- 
perbolic tangent transfer function [28.9]. Accordingly, 
the decoder is implemented by a linear mapping used in 
the sparse coding [28.44] 


&, = Wph(x,) , (28.8) 


where Wp is an NxM connection matrix between hidden 
and output neurons in the decoder, and each column of 
Wp always needs to be normalized to a unit vector to 
avoid trivial solutions [28.44]. 

Given a training set of T instances, {x I p the PSD 
cost function is defined as 


Lesp (G, We, Wp, br; h* (x,)) 


i 
= Y` Woh" œ) — xl? + alk" Œ): 


t=1 


+ B\|h* œŒ) -kE . (28.9) 


where h*(x,) is the optimal sparse hidden representa- 
tion of x, while h(x,) is the output of the encoder in 
(28.7) based on the current parameter values. In (28.9), 
a and # are two hyper-parameters to control regular- 
ization strengths, and ||- ||, and ||- ||, are £; and £2 
norm, respectively. Intuitively, in the multi-objective 


cost function defined in (28.9), the first term specifies 
reconstruction errors, the second term refers to the mag- 
nitude of non-sparse representations, and the last term 
drives the encoder towards yielding the optimal repre- 
sentation. 

For learning a PSD building block, the cost function 
in (28.9) needs to be optimized simultaneously with 
respect to the hidden representation and all the param- 
eters. As a result, a learning algorithm of two alternate 
steps has been proposed to solve this problem [28.43] 
as follows: 


Algorithm 28.2 PSD Learning algorithm 

Given a training set of T instances {x}, randomly 
initialize all the parameters, Wg, Wp, G, bp, and the 
optimal sparse representation {h*(x,)}7_, in a PSD 
building block and pre-set hyper-parameters œ and f 
as well as learning rates e€; (i = 1,--- ,4): 


@ Optimal representation update 
In this step, the gradient descent method is applied 
to find the optimal sparse representation based on 
the current parameter values of the encoder and the 
decoder, which leads to the following update rule 


h* (x;) —h* (x;)—€ [wsign(h* (x;)) 
+ B(h* (x) —h(x,)) 
+(Wp)'(Wph* œ) —x1)] , 


where sign(-) is the sign function; sign(w) = +1 if 
u = O and sign(u) = 0 if u = 0. 
@ Parameter update 

In this step, h*(x,) achieved in the above step is 
fixed. Then the gradient descent method is applied 
to the cost function (28.9) with respect to all en- 
coder and decoder parameters, which results in the 
following update rules 


We < We — ega) a) , 
by, io by, — €39(x;) > 


Here g(x,) is obtained by 
B(X1) = [ium Erm — Mtn) 
OED — hin) Jy 


m=1 ' 


G < G—endiag (x (x;) — ba] 


N 
x tanh (Ewin + bun) > 
n=l 
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and 
Wp <— Wp — €4 [Wph* (œ) — x+] [h* œ] 
Normalize each column of Wp such that 
[Wo] l2 = 1 forn=1,,N. 


The above two steps repeat for all the instances in 
the given training set, which leads to a training epoch. 
The learning algorithm runs iteratively until it con- 
verges. 


Other Building Blocks 
While the auto-encoders and the RBM are building 
blocks widely used to construct DNNs, there are other 
building blocks that are either derived from existing 
building blocks for performance improvement or are 
developed with an alternative principle. Such build- 
ing blocks include regularized auto-encoders and RBM 
variants. Due to the limited space, we briefly overview 
them below. 

Recently, a number of auto-encoder variants have 
been developed by adding a regularization term to 
the standard reconstruction cost function in (28.1) and 
hence are dubbed regularized auto-encoders. The con- 
trastive auto-encoder (CAE) is a typical regularized 
version of the auto-encoder with the introduction of the 
norm of the Jacobian matrix of the encoder evaluated at 
each training example x, into the standard reconstruc- 
tion cost function [28.45] 


: 
Lcae(W, bn, bo) = L(W, br, bo) +0 >> IEDIG, 


t=1 


(28.10) 


where a is a trade-off parameter to control the regular- 
ization strength and ||J(x;)||? is the Frobenius norm of 
the Jacobian matrix of the encoder and is calculated as 
follows 


Welz = 2 ST 


m=l n=l 


f'n maD W, 


Here, f’ (-) is the first-order derivative of a transfer func- 
tion fC), and f [hn (&1)] = hy (x,)[1 = hn (X1)] when 
f(-) is the sigmoid function [28.9]. It is straightfor- 
ward to apply the stochastic gradient method [28.12] 


to the CAE cost function in (28.10) to derive a learn- 
ing algorithm used for training a CAE. Furthermore, 
an improved version of CAE was also proposed by 
penalizing additional higher order derivatives [28.46]. 
The sparse auto-encoder (SAE) is another class of 
regularized auto-encoders. The basic idea underlying 
SAEs is the introduction of a sparse regularization term 
working on either hidden neuron biases, e.g., [28.47], 
or their outputs, e.g., [28.48], into the standard re- 
construction cost function. Different forms of sparsity 
penalties, e.g., £; norm and student-t, are employed 
for regularization, and the learning algorithm is derived 
by applying the coordinate descent optimization pro- 
cedure to a new reconstruction cost function [28.47, 
48]. 

The RBM described above works only for an input 
of binary values. When an input has real values, a vari- 
ant named Gaussian RBM (GRBM) [28.49], has been 
proposed with the following energy function 


E(v,h) = 


E: S Wonln = ose 


m=1n=1 


N 
( Ge Dyn)? 
_ 3 An Drm a 2 sch 


m=1 n=l 


(28.11) 


where o, is the standard deviation of the Gaussian 
noise for the visible neuron n. In the CD learning al- 
gorithm, the update rule for the hidden neurons remains 
the same except that each v, is substituted by =, and 
the update rule for all visible neurons needs "fo use 
reconstructions v, produced by sampling from a Gaus- 
sian distribution with mean on i Wrimlm + by, and 
variance o? for n = 1,--- , N. In addition, an improved 
GRBM was also proposed by introducing an alternative 
parameterization of the energy function in (28.11) and 
incorporating it into the CD algorithm [28.50]. Other 
RBM variants will be discussed later on, as they of- 
ten play a different role from being used to construct 
a DNN. 


28.2.3 Hybrid Learning Strategy 


Based on the building blocks described in Sect. 28.2.2, 
we describe a systematic approach to establish- 
ing a feed-forward DNN for supervised and semi- 
supervised learning. This approach employs a hybrid 
learning strategy that combines unsupervised and su- 
pervised learning paradigms to overcome the optimiza- 
tion difficulty in training DNNs. The hybrid learning 
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strategy [28.18, 40] first applies layer-wise greedy un- 
supervised learning to set up a DNN and initialize 
parameters with input data only and then uses a global 
supervised learning algorithm with teachers’ informa- 
tion to train all the parameters in the initialized DNN 
for a given task. 


Layer-Wise Unsupervised Learning 
In the hybrid learning strategy, unsupervised learning 
is a layer-wise greedy learning process that constructs 
a DNN with a chosen building block and initializes pa- 
rameters in a layer-by-layer way. 

Suppose we want to establish a DNN of K (K>1) 
hidden layers and denote output of hidden layer k as 
h(x) (k = 1,--- , K) for a given input x and output of 
the output layer as o(x), respectively. To facilitate the 
presentation, we stipulate ho(x) = x. Then, the generic 
layer-wise greedy learning procedure can be summa- 
rized as follows: 


Algorithm 28.3 Layer-wise greedy learning proce- 
dure 

Given a training set of T instances {eH randomly 
initialize its parameters in a chosen building block and 
pre-set all hyper-parameters required for learning such 
a building block: 


@ Train a building block for hidden layer k 
— Set the number of neurons required by hidden 
layer k to be the dimension of the hidden repre- 
sentation in the chosen building block. 
— Use the training data set {h,—1(x;)}/_, train the 
building block to achieve its optimal parame- 
ters. 
@ Construct a DNN up to hidden layer k 
With the trained building block in the above step, 
discard its decoder part, including all associated pa- 
rameters, and stack its hidden layer on the existing 
DNN with connection weights of the encoder and 
biases of hidden neurons achieved in the above step 
(the input layer ho(x) = x is viewed as the starting 
architecture of a DNN). 


The above steps are repeated for k= 1,--- ,K. 
Then, the output layer o (x) is stacked onto hidden layer 
K with randomly initialized connection weights so as to 
finalize the initial DNN construction and its parameter 
initialization. 


Figure 28.6 illustrated two typical instances for 
constructing an initial DNN via the layer-wise greedy 
learning procedure described above. Figure 28.6a 


a) hki) 


hgx) 


hgx) 


b) na) 


h(x) 


Fig. 28.6a,b Construction of a DNN with a building block via 
layer-wise greedy learning. (a) Auto-encoder or its variants. 


(b) RBM or its variants 


shows a schematic diagram of the layer-wise greedy 
learning process with the auto-encoder or its vari- 
ants; to construct the hidden layer k, the output layer 
and its associated parameters W7 and b, are re- 
moved and the remaining part is stacked onto hid- 
den layer k— 1, and W, is a randomly initialized 
weight matrix for the connection between the hid- 
den layer K and the output layer. When a DNN is 
constructed with the RBM or its variants, all back- 
ward connection weights in the decoder are abandoned 
after training and only the hidden layer with those 
forward connection weights and biases of hidden neu- 
rons are used to construct the DNN, as depicted in 
Fig. 28.6b. 


h(x) 


h(x) 
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Global Supervised Learning 
Once a DNN is constructed and initialized based on 
the layer-wise greedy learning procedure, it is ready 
to be further trained in a supervision fashion for 
a classification or regression task. There are a vari- 
ety of optimization methods for supervised learning, 
e.g., stochastic gradient descent and the second-order 
Levenberg—Marquadt methods [28.9, 12]. Also there 
are cost functions of different forms used for various 
supervised learning tasks and regularization towards 
improving the generalization of a DNN. Due to the 
limited space, we only review the stochastic gradi- 
ent descent algorithm with a generic cost function for 
global supervised learning. 

For a generic cost function L(O,D), where © 
is a collective notation of all parameters in a DNN 
(Fig. 28.6) and D is a training data set for a given su- 
pervised learning task, applying the stochastic gradient 
descent method [28.9, 12] to L(©@, D) leads to the fol- 
lowing learning algorithm for fine-tuning parameters. 


Algorithm 28.4 Global supervised learning algo- 
rithm 

Given a training set of T examples D = {(x,,y,)}7_, 
pre-set a learning rate € (and other hyper-parameters 
if required). Furthermore, the training set is ran- 
domly divided into many mini-batches of Tg examples 
{x yE =, and then parameters are updated based on 
each mini-batch. © = GEWETE, boD are all pa- 
rameters in a DNN, where W, is the weight matrix 
for the connection between the hidden layers k and 
k— 1, and b; is biases of neurons in layer k (Fig. 28.6). 
Here, input and output layers are stipulated as layers 0 
and K + 1, respectively, i. e., ho(x;) = x;, Wo = Wx-+1, 
b, = bg4ı and o(x;) = hk+1(%;): 

@ Forward computation 


Given the input x,, for k = 1,--- 
of layer k is 


, K+1, the output 


hix) =f(UK@%1)) , Wer) = Wih Œ) + Dx . 
@ Backward gradient computation 
Given a cost function on each mini-batch, Lg (©, D) 
calculate gradients at the output layer, i. e., 


ILg(O,D) _ dLg(O, D) 

dhk (x) d0(x;) 

ƏL (O, D) a , i 
= (x , 

ðug+ı (x) Tey AU j=l 


where f”(-) is the first-order derivative of the transfer 
function f(-). 


For all hidden layers, i.e., k= K,--- , 1, applying 
the chain rule leads to 
aLp(O, D) aLp(O, D) lhal 
= = ( OES (ugi) 
dur (X;) ðhy(x) j=l 


and 
OL,(O, D) = ( (i) dL (9, D) 
Ahy(x;) ETT Bupa Ge) 


Fork=K+1,---,1 
ases of layer k are 


gradients with respect to bi- 


ILp(O,D) _ 
ab, 


OLp(O, D) 
ður (x;) 


© Parameter update 
Applying the gradient descent method results in the 
following update rules: 
For k = K+1,---,1 


Tg 
€ dLg(O, D) T 
W; <— W, h;— (x : 
kW nZ ut) CED 
dLg(O0, D) 
b; — by — ay eo) Ju; @,) 3 


The above three steps repeat for all mini-batches, 
which leads to a training epoch. The learning algorithm 
runs iteratively until a termination condition is met (typ- 
ically based on a cross-validation procedure [28.12]). 


For the above learning algorithm, the BP algo- 
rithm [28.11] is a special case when the transfer func- 
tion is the sigmoid function, i. e., f (u) = ġ (u), and the 
cost function is the mean square error (MSE) function, 
i.e., for each mini-batch 


Tg 
1 
L(0.D) = z7 X lowe) —yill3 . 
t=1 


Thus, we have 


p'u) = o(w—¢(u)). 

ðLe(0©, D) 1 Ë 

ae = Ts X oœ) =J) 
t=1 
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and 
aLp(O, D) 
OuK+1 (x) 


t=1 


=1 


lol 
ieee 
= È Soe) ren a6) 
J 


28.2.4 Relevant Issues 


In the literature, the hybrid learning strategy described 
in Sect. 28.2.3 is often called the semi-supervised learn- 
ing strategy [28.17, 39]. Nevertheless, semi-supervised 
learning implies the situation that there are few labeled 
examples but many unlabeled instances in a training set. 
Indeed, such a strategy works well in a situation where 
both unlabeled and labeled data in a training set are 
used for layer-wise greedy learning, and only labeled 
data are used for fine-tuning in global supervised learn- 
ing. However, other studies, e.g., [28.28, 29,51], also 
show that this strategy can considerably improve the 
generalization of a DNN even though there are abun- 
dant labeled examples in a training data set. Hence we 
would rather name it hybrid learning. On the other hand, 
our review focuses on only primary supervised learn- 
ing tasks in the context of NC. In a wider context, the 
unsupervised learning process itself develops a novel 
approach to automatic feature discovery/extraction via 
learning, which is an emerging ML area named rep- 
resentation learning [28.39]. In such a context, some 
DNNs can perform a generative model. For instance, 
the DBN [28.18] is a RBM-based DNN by retaining 
both forward and backward connections during layer- 
wise greedy learning. To be a generative model, the 
DBN needs an alternative learning algorithm, e.g., the 
wake-sleep algorithm [28.18], for global unsupervised 
learning. In general, the global unsupervised learning 
for a generative DNN is still a challenging problem. 
While the hybrid learning strategy has been success- 
fully applied to many complex AI tasks, in general, it is 
still not entirely clear why such a strategy works well 
empirically. A recent study attempted to provide some 
justification of the role played by layer-wise greedy 
learning for supervised learning [28.51]. The findings 
can be summarized as follows: such an unsupervised 
learning process brings about a regularization effect that 
initializes DNN parameters towards the basin of attrac- 
tion corresponding to a good local minimum, which 
facilitates global supervised learning in terms of gener- 
alization [28.51]. In general, a deeper understanding of 
such a learning strategy will be required in the future. 


On the other hand, a successful story was recently re- 
ported [28.52] where no unsupervised pre-training was 
used in the DNN learning for a non-trivial task; which 
poses another open problem as to when and where 
such a learning strategy must be employed for training 
a DNN to yield a satisfactory generalization perfor- 
mance. 

Recent studies also suggest that the use of arti- 
ficially distorted or deformed training data and un- 
supervised front-ends can considerably improve the 
performance of DNNs regardless of the hybrid learning 
strategy. As DNN learning is of the data-driven nature, 
augmenting training data with known input deforma- 
tion amounts to the use of more representative examples 
conveying intrinsic variations underlying a class of 
data in learning. For example, speech corrupted by 
some known channel noise and deformed images by 
using affine transformation and adding noise have sig- 
nificantly improved the DNN performance in various 
speech and visual information processing tasks [28.22, 
28, 29,51,52]. On the other hand, the generic build- 
ing blocks reviewed in Sect. 28.2.2 can be extended 
to be specialist front-ends by exploiting intrinsic data 
structures. For instance, the RBM has several vari- 
ants, e.g., [28.53-55], to capture covariance and other 
statistical information underlying an image. After un- 
supervised learning, such front-ends generate power- 
ful representations that greatly facilitate further DNN 
learning in visual information processing. 

While our review focuses on only fully connected 
feed-forward DNNs, there are alternative and more 
effective DNN architectures for specific tasks. Convo- 
lutional DNNs [28.12] make use of topological locality 
constraints underlying images to form more effective 
locally connected DNN architecture. Furthermore, var- 
ious pooling techniques [28.56] used in convolutional 
DNNs facilitate learning invariant and robust features. 
With appropriate building blocks, e.g., the PSD re- 
viewed in Sect. 28.2.2, convolutional DNNs work very 
well with the hybrid learning strategy [28.43,57]. In 
addition, novel DNN architectures need to be devel- 
oped by exploring the nature of a specific problem, e.g., 
a regularized Siamese DNN was recently developed for 
generic speaker-specific information extraction [28.28, 
29]. As a result, novel DNN architecture development 
and model selection are among important DNN re- 
search topics. 

Finally, theoretical justification of deep learning and 
the hybrid learning strategy, along with other developed 
recently techniques, e.g., parallel graphics processing 
unit (GPU) computing, enable researchers to develop 
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large-scale DNNs to tackle very complex real world 
problems. While some theoretic justification has been 
provided in the literature, e.g., [28.16, 17,39], to show 
strengths in their potential capacity and efficient rep- 
resentational schemes of DNNs, more and more suc- 
cessful applications of DNNs, particulary working with 
the hybrid learning strategy, lend evidence to support 
the argument that DNNs are one of the most promis- 


28.3 Modular Neural Networks 


In this section, we review main modular neural net- 
works (MNN) and their learning algorithms with our 
own taxonomy. We first review background and moti- 
vation for MNN research and present our MNN tax- 
onomy. Then we describe major MNN architectures 
and relevant learning algorithms. Finally, we exam- 
ine relevant issues related to MNNs in a boarder 
context. 


28.3.1 Background and Motivation 


Soon after neural network (NN) research resurged in the 
middle of the 1980s, MNN studies emerged; they have 
become an important area in NC since then. There are 
a variety of motivations that inspire MNN researches, 
e.g., biological, psychological, computational, and im- 
plementation motivations [28.4,5,9]. Here, we only 
describe the background and motivation of MNN re- 
searches from learning and computational perspectives. 

From the learning perspective, MNNs have several 
advantages over monolithic NNs. First of all, MNNs 
adopt an alternative methodology for learning, so that 
complex problem can be solved based an ensemble of 
simple NNs, which might avoid/alleviate the complex 
optimization problems encountered in monolithic NN 
learning without decreasing the learning capacity. Next, 
modularity enables MNNs to use a priori knowledge 
flexibly and facilitates knowledge integration and up- 
date in learning. To a great extent, MNNs are immune 
to temporal and spatial cross-talk, a problem faced by 
monolithic NNs during learning [28.9]. Finally, theoret- 
ical justification and abundant empirical evidence show 
that an MNN often yields a better generalization than 
its component networks [28.5,59]. From the compu- 
tational perspective, modularization in MNNs leads to 
more efficient and robust computation, given the fact 
that MNNs often do not suffer from a high coupling 
burden in a monolithic NN and hence tend to have 


ing learning systems for dealing with complex and 
large-scale real world problems. For example, such evi- 
dence can be found from one of the latest developments 
in a DNN application to computer vision where it is 
demonstrated that applying a DNN of nine layers con- 
structed with the SAE building block via layer-wise 
greedy learning results in the favorable performance in 
object recognition of over 22 000 categories [28.58]. 


a lower overall structural complexity in tackling the 
same problem [28.5]. This main computational merit 
makes MNNs scalable and extensible to large-scale 
MNN implementation. 

There are two highly influential principles that are 
often used in artificial MNN development; i. e., divide- 
and-conquer and diversity-promotion. The divide-and- 
conquer principle refers to a generic methodology that 
tackles a complex and difficult problem by dividing it 
into several relatively simple and easy subproblems, 
whose solutions can be combined seamlessly to yield 
a final solution. On the other hand, theoretical justifica- 
tion [28.60, 61] and abundant empirical studies [28.62] 
suggest that apart from the condition that component 
networks need to reach some certain accuracy, the suc- 
cess of MNNs are largely attributed to diversity among 
them. Hence, the promotion of diversity in MNNs be- 
comes critical in their design and development. To 
understand motivations and ideas underlying different 
MNNs, we believe that it is crucial to examine how two 
principles are applied in their development. 

There are different taxonomies of MNNs [28.4, 5, 
9]. In this chapter, we present an alternative taxonomy 
that highlights the interaction among component net- 
works in an MNN during learning. As a result, there 
is a dichotomy between tightly and loosely coupled 
models in MNNs. In a tightly coupled MNN, all com- 
ponent networks are jointly trained in a dependent way 
by taking their interaction into account during a single 
learning stage, and hence all parameters of different net- 
works (and combination mechanisms if there are any) 
need to be updated simultaneously by minimizing a cost 
function defined at the global level. In contrast, training 
of a loosely coupled MNN often undergoes multi- 
ple stages in a hierarchical or sequential way where 
learning undertaken in different stages may be either 
correlated or uncorrelated via different strategies. We 
believe that such a taxonomy facilitates not only un- 
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derstanding different MNNs especially from a learning 
perspective but also relating MNNs to generic ensemble 
learning in a broader context. 


28.3.2 Tightly Coupled Models 


There are two typical tightly coupled MNNs: the mix- 
ture of experts (MoE) [28.63, 64] and MNNs trained via 
negative correlation learning (NCL) [28.65]. 


Mixture of Experts 
The MoE [28.63, 64] refers to a class of MNNs that 
dynamically partition input space to facilitate learn- 
ing in a complex and non-stationary environment. 
By applying the divide-and-conquer principle, a soft- 
competition idea was proposed to develop the MoE 
architecture. That is, at every input data point, multiple 
expert networks compete to take on a given supervised 
learning task. Instead of winner-take-all, all expert net- 
works may work together but the winner expert plays 
a more important role than the losers. 

The MoE architecture is composed of N expert net- 
works and a gating network, as illustrated in Fig. 28.7. 
The n-th expert network produces an output vector, 
0, (x), for an input, x. The gating network receives the 
vector x as input and produces N scalar outputs that 
form a partition of the input space at each point x. For 
the input x, the gating network outputs N linear combi- 
nation coefficients used to verdict the importance of all 
expert networks for a given supervised learning task. 
The final output of MoE is a convex weighted sum of 
all the output yielded by N expert networks. Although 
NNs of different types can be used as expert networks, 
a class of generalized linear NNs are often employed 
where such an NN is linear with a single output non- 


o(x) 


Gating 
network 


oix) on(x) 
Expert Expert 
network 1 network N 
x, x 


Fig. 28.7 Architecture of the mixture of experts 


linearity [28.64]. As a result, output of the n-th expert 
network is a generalized linear function of the input x 


0n(x) = f (Wx) >, 


where W, is a parameter matrix, a collective notation 
for both connection weights and biases, and f(-) is 
a nonlinear transfer function. The gating network is also 
a generalized linear model, and its n-th output g(x, v,,) 
is the softmax function of v'x 


8, vn) = ——T > 


where v, is the n-th column of the parameter matrix V 
in the gating network and is responsible for the linear 
combination coefficient regarding the expert network n. 
The overall output of the MoE is the weighted sum re- 
sulted from the soft-competition at the point x 


N 
ox) =) > g(x, vo). 


n=l 


There is a natural probabilistic interpretation of the 
MoE [28.64]. For a training example (x, y), the values 
of g(x, V) = (g(x, Pe 4 are interpreted as the multi- 
nomial probabilities associated with the decision that 
terminates in a regressive process that maps x to y. 
Once a decision has been made that leads to a choice 
of regressive process n, the output y is chosen from 
a probability distribution P(y|x, W,,). Hence, the over- 
all probability of generating y from x is the mixture of 
the probabilities of generating y from each component 
distribution and the mixing proportions are subject to 
a multinomial distribution 


N 
PO|x, O) = X g, v) POX, Wr) » 


n=l 


(28.12) 


where © is a collective notation of all the parameters in 
the MoE, including both expert and gating network pa- 
rameters. For different learning tasks, specific compo- 
nent distribution models are required. For example, the 
probabilistic component model should be a Gaussian 
distribution for a regression task, while a Bernoulli dis- 
tribution and multinomial distributions are required for 
binary and multi-class classification tasks, respectively. 
In general, MoE is viewed as a conditional mixture 
model for supervised learning, a non-trivial extension 
of finite mixture model for unsupervised learning. 
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By means of the above probabilistic interpretation, 
learning in the MoE is treated as a maximum likeli- 
hood problem defined based on the model in (28.12). 
An expectation-maximization (EM) algorithm was pro- 
posed to update parameters in the MoE [28.64]. It is 
summarized as follows: 


Algorithm 28.5 EM algorithm for MoE learning 
Given a training set of T examples D = {œ yA}; 
pre-set the number of expert networks, N, and randomly 
initialize all the parameters © = {V, (W,,)*_,} in the 
MoE: 


© E-step 
For each example, (x;,y;) in D, estimate posterior 
probabilities, h, = Ga is with the current pa- 
rameter values, V and {W,}_, 


g(x, Pn) POX, W,,) 


Dnt = N a a R 
Xai gr PAPY: |X We) 
@ M-step 
— For expert network n (n = 1,--- ,n), solve the 


maximization problems 


t 
W, = arg max ) hnt log P(Y ixr, Wn) » 
n 


t=1 


with all examples in D and posterior probabili- 
ties {h,}!_, achieved in the E-step. 
— For the gating network, solve the maximization 


problem 


T N 
V = arg max) X hn log g(x, Vn), 


t=1n=1 


with training examples, {(x;, h;) on derived 
from posterior probabilities {h} i 

Repeat the E-step and the M-step alternately until 
the EM algorithm converges. 


To solve optimization problems in the M-step, the 
iteratively re-weighted least squares (IRLS) algorithm 
was proposed [28.64]. Although the IRLS algorithm 
has the strength to solve maximum likelihood prob- 
lems arising from MoE learning, it might result in 
some instable performance due to its incorrect assump- 
tion on multi-class classification [28.66]. As learning 
in the gating network is a multi-class classification 


task in essence, the problem always exists if the IRLS 
algorithm is used in the EM algorithm. Fortunately, 
improved algorithms were proposed to remedy this 
problem in the EM learning [28.66]. In summary, nu- 
merous MoE variants and extensions have been de- 
veloped in the past 20 years [28.59], and the MoE 
architecture turns out to be one of the most successful 
MNNs. 


Negative Correlation Learning 
The NCL [28.65] is a learning algorithm to establish 
an MNN consisting of diverse neural networks (NNs) 
by promoting the diversity among component networks 
during learning. The NCL development was clearly in- 
spired by the bias-variance analysis of generalization 
errors [28.60, 61]. As a result, the NCL encourages co- 
operation among component networks via interaction 
during learning to manage the bias-variance trade-off. 

In the NCL, an unsupervised penalty term is in- 
troduced to the MSE cost function for each com- 
ponent NN so that the error diversity among com- 
ponent networks is explicitly managed via training 
towards negative correlation. Suppose that an NN en- 
semble F(x,@) is established by simply taking the 
average of N neural networks f(x, W,) (n =1,--- , N), 
where W, denotes all the parameters in the n-th com- 
ponent network and © = {W,}‘_,. Given a training 
set D = {(x;,y,)}1_, the NCL cost function for the 
n-th component network [28.65] is defined as fol- 
lows 


T 
1 
L(D, Wn) == OT 5 Wf, Wn) -=y:ll2 
t=1 


a T 
Z. X Fr. Wn) — Fer, OVS , 


t=1 


(28.13) 


where F(x,;, ©) = i SLf, W,„) and A is a trade- 
off hyper-parameter. In (28.13), the first term is the 
MSE cost for network n and the second term refers 
to the negative correlation cost. By taking all compo- 
nent networks into account, minimizing the second term 
leads to maximum negative correlation among them. 
Therefore, A needs to be set properly to control the 
penalty strength [28.65]. 

For the NCL, all N cost functions specified in 
(28.13) need to be optimized together for parameter es- 
timation. Based on the stochastic descent method, the 
generic NCL algorithm is summarized as follows: 


Deep and Modular Neural Networks | 28.3 Modular Neural Networks 


Algorithm 28.6 Negative correlation learning al- 
gorithm 

Given a training set of T examples D = {(x;,y,)}1_, 
pre-set the number of component networks, N, and 
learning rate, €, as well as randomly initialize all the 
parameters © = {W,,}_, in component networks: 


© Output computation 
For each example, (x,,y;) in D, calculate output of 
each component network f(x;, W,,) and that of the 
NN ensemble by 


1 N 
Fer, 0) = = Df er, Wa). 


n=l 


@ Gradient computation 
For component network n (n = 1,--- , N), calculate 
the gradient of the NCL cost function in (28.13) 
with respect to the parameters based on all training 
examples in D 


OL(D, Wr) 1< 
a ee 2 æn Wn) — Yello 
_MN=1) ee Wn) 


— tpt Wa) -FEO 


d Wha 


@ Parameter update 
For component network n (n= 1,--- ,N), update 
the parameters 


dL(D, Wn) 
€ : 


Wn < Wn 
aw, 


Repeat the above three steps until a pre-specified 
termination condition is met. 


While the NCL was originally proposed based on 
the MSE cost function, the NCL idea can be extended to 
other cost functions without difficulty. Hence, applying 
appropriate optimization techniques on alternative cost 
functions leads to NCL algorithms of different forms 
accordingly. 


28.3.3 Loosely Coupled Models 


In a loosely coupled model, component networks are 
trained independently or there is no direct interaction 
among them during learning. There are a variety of 
MNNs that can be classified as loosely coupled mod- 


els. Below we review several typical loosely coupled 
MNNs. 


Neural Network Ensemble 
An neural network ensemble here refers to a committee 
machine where a number of NNs trained indepen- 
dently but their outputs are somehow combined to reach 
a consensus as a final solution. The development of 
NN ensembles is explicitly motivated by the diversity- 
promotion principle [28.67, 68]. 

Intuitively, errors made by component NNs can be 
corrected by taking their diversity or mutual comple- 
ment into account. For example, three NNs, NN; (i = 
1,2, 3), trained on the same data set have different yet 
imperfect performance on test data. Given three test 
points, x;, (t = 1, 2,3), NN; yields the correct output for 
x2 and x3 but does not for xı, NN> yields the correct out- 
put for x;, and x3 but does not for x2 and NN; yields the 
correct output for xı and xz but does not for x3, respec- 
tively. In such circumstances, an error made by one NN 
can be corrected by other two NNs with a majority vote 
so that the ensemble can outperform any component 
NNs. Formally, there is a variety of theoretical justifica- 
tion [28.60, 61] for NN ensembles. For example, it has 
been proven for regression that the NN ensemble per- 
formance is never inferior to the average performance 
of all component NNs [28.60]. Furthermore, a theo- 
retical bias-variance analysis [28.61] suggests that the 
promotion of diversity can improve the performance of 
NN ensembles provided that there is an adequate trade- 
off between bias and variance. In general, there are 
two non-trivial issues in constructing NN ensembles; 
i.e., creating diverse component NNs and ensembling 
strategies. 

Depending on the nature of a given problem [28.5, 
62], there are several methodologies for creating diverse 
component NNs. First of all, a NN learning process 
itself can be exploited. For instance, learning in a mono- 
lithic NN often needs to solve a complex non-convex 
optimization problem [28.9]. Hence, a local-search- 
based learning algorithm, e.g., BP [28.11], may end up 
with various solutions corresponding to local minima 
due to random initialization. In addition, model selec- 
tion is required to find out an appropriate NN structure 
for a given problem. Such properties can be exploited 
to create component networks in a homogeneous NN 
ensemble [28.67]. Next, NNs of different types trained 
on the same data may also yield different performance 
and hence are candidates in a heterogeneous NN en- 
semble [28.5]. Finally, exploration/exploitation of input 
space and different representations is an alternative 
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methodology for creating different component NNs. In- 
stead of training an NN on the input space, NNs can be 
trained on different input subspaces achieved by a par- 
titioning method, e.g., random partitioning [28.69], 
which results in a subspace NN ensemble. Whenever 
raw data can be characterized by different represen- 
tations, NNs trained on different feature sets would 
constitute a multi-view NN ensemble [28.70]. 

Ensembling strategies are required for different 
tasks. For regression, some optimal fusion rules have 
been developed for NN ensembles, e.g., [28.68], which 
are supported by theoretical justification, e.g., [28.60, 
61]. For classification, ensembling strategies are more 
complicated but have been well-studied in a wider con- 
text, named combination of multiple classifiers. As is 
shown in Fig. 28.8, ensembling strategies are gener- 
ally divided into two categories: learnable and non- 
learnable. Learnable strategies use a parametric model 
to learn an optimal fusion rule, while non-learnable 
strategies fulfil the combination by directly using the 
statistics of all competent network outputs along with 
simple measures. As depicted in Fig. 28.8, there are 
six main non-learnable fusion rules: sum, product, min, 
max, median, and majority vote; details of such non- 
learnable rules can be found in [28.71]. Below, we focus 
on the main learnable ensembling strategies in terms of 
classification. 

In general, learnable ensembling strategies are 
viewed as an application of the stacked generalization 
principle [28.72]. In light of stacked generalization, 
all component NNs serve as level 0 generalizers, and 
a learnable ensembling strategy carried out by a combi- 
nation mechanism would perform a level 1 generalizer 
working on the output space of all component NNs to 


Input Input 
dependent independent 


Soft 
competition 


Evidence 
reasoning 


Associative 
switch 


Bayesian 
fusion 


Ensembling 
strategies 


improve the overall generalization. In this sense, such 
a combination mechanism is trained on a validation 
data set that is different from the training data set used 
in component NN learning. As is shown in Fig. 28.8, 
combination mechanisms have been developed from 
different perspectives, i. e., input-dependent and input- 
independent. 

An input-dependent mechanism combines compo- 
nent NNs based on test data; i.e., given two inputs, 
xı and x2; there is the property: c(x;|O) Æ c(x2|0) 
if x; AX2, where c(x|O) = (cnal: is an input- 
dependent mechanism used to combine an ensemble of 
N component NNs and © collectively denotes all learn- 
able parameters in this parametric model. As a result, 
output of an NN ensemble with the input-dependent en- 
sembling strategy is of the following form 


olx) = 2 (01%), ,on(x) | e(x|@)), 


where 0,,(x) is output of the n-th component NN for 
n=1,---,N and &2 indicates a method on how to ap- 
ply c(x|©) to component NNs. For example, {2 may be 
a linear combination scheme such that 


N 


o(x) = Ý cn(e|@)on(x) . 


n=1 


(28.14) 


As listed in Fig. 28.8, soft-competition and associative 
switch are two commonly used input-dependent com- 
bination mechanisms. The soft-competition mechanism 
can be regarded as a special case of the MoE described 
earlier when all expert networks were trained on a data 
set independently in advance. In this case, the gating 
network plays the role of the combination mechanism 


Non-learnable 


Sum || Prod || Min || Max || Med Majority 
vote 


Linear 


combination 


Fig. 28.8 A taxonomy of ensembling strategies 
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by deciding the importance of component NNs via soft- 
competition. Although various learning models may be 
used as such a gating network, a RBF-like (radial ba- 
sis function) parametric model [28.73] trained on the 
EM algorithm has been widely used for this purpose. 
Unlike a soft-competition mechanism that produces the 
continuous-value weight vector c(x) used in (28.14), 
the associative switch [28.74] adopts a winner-take-all 
strategy, i. €., = Cn(x|O) = 1 and c,(x|@) € {0, 1}. 
Thus, an associative switch yields a specific code for 
a given input so that the output of the best performed 
component NN can be selected as the final output of 
the NN ensemble according to (28.14). The associative 
switch learning is a multi-class classification problem, 
and an MLP is often used to carry it out [28.74]. 
Although an input-dependent ensembling strategy is ap- 
plicable to most NN ensembles, it is difficult to apply it 
to multi-view NN ensembles, since different represen- 
tations need to be considered simultaneously in training 
a combination mechanism. Fortunately, such issues 
have been explored in a wider context on how to use dif- 
ferent representations simultaneously for ML [28.70, 
75-79] so that both soft-competition and associative 
switch mechanisms can be extended to multi-view NN 
ensembles. 

In contrast, an input-independent mechanism com- 
bines component NNs based on the dependence of their 
outputs without considering input data directly. Given 
two inputs x; and x2, and x; Ax, the same c(@) 
may be applied to outputs of component NNs, where 
c(@) = (c,(@))*_, is an input-independent combina- 
tion mechanism used to combine an ensemble of N 
component NNs. Several input-independent mecha- 
nisms have been developed [28.62], which often fall 
into one of three categories, i.e., Bayesian fusion, ev- 
idence reasoning, and a linear combination scheme, as 
shown in Fig. 28.8. Bayesian fusion [28.80] refers to 
a class of combination schemes that use the informa- 
tion collected from errors made by component NNs 
on a validation set in order to find out the optimal 
output of the maximum a posteriori probability, C* = 
arg max) <;<z P(C)|0;(x),--: ,on(x), ©), via Bayesian 
reasoning, where C; is the label for the /-th class 
in a classification task of L classes, and © here en- 
codes the information gathered, e.g., a confusion matrix 
achieved during learning [28.80]. Similarly, evidence 
reasoning mechanisms make use of alternative reason- 
ing theories [28.80], e.g., the Dempster-Shafer theory, 
to yield the best output for NN ensembles via an ev- 
idence reasoning process that works on all outputs of 
component NNs in an ensemble. Finally, linear com- 


bination schemes of different forms are also popular 
as input-independent combination mechanisms [28.62]. 
For instance, the work presented in [28.68] exemplifies 
how to achieve optimal linear combination weights in 
a linear combination scheme. 


Constructive Modularization Learning 
Efforts have also been made towards constructive mod- 
ularization learning for a given supervised learning 
task. In such work, the divide-and-conquer principle 
is explicitly applied in order to develop a constructive 
learning strategy for modularization. The basic idea be- 
hind such methods is to divide a difficult and complex 
problem into a number of subproblems that are eas- 
ily solvable by NNs of proper capacities, matching the 
requirements of the subproblems, and then the solu- 
tions to subproblems are combined seamlessly to form 
a solution to the original problem. On the other hand, 
constructive modularization learning may alleviate the 
model selection problem encountered by a monolithic 
NN. As NNs of simple and even different architectures 
may be used to solve subproblems, empirical stud- 
ies suggest that an MNN generated via constructive 
modularization learning is often insensitive to compo- 
nent NN architectures and hence is less likely to suffer 
from overall overfitting or underfitting [28.81]. Below 
we describe two constructive modularization learning 
strategies [28.8 1-83] for exemplification. 

The partitioning-based strategy [28.81,82] per- 
forms the constructive modularization learning by ap- 
plying the divide-and-conquer principle explicitly. For 
a given supervised learning task, the strategy consists 
of two learning stages: dividing and conquering. In 
the dividing stage, it first recursively partitions the 
input space into overlapping subspaces, which facili- 
tates dealing with various uncertainties, by taking into 
supervision information into account until the nature 
of each subproblem defined in generated subspaces 


Fig. 28.9 A self-generated tree-structured MNN 
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matches the capacity of one pre-selected NN. In the 
conquering stage, an NN works on a given input sub- 
space to complete the corresponding learning subtask. 
As a result, a tree-structured MNN is self-generated, 
where a learnable partitioning mechanism P, is situ- 
ated at intermediate levels and NNs works at leaves 
of the tree, as illustrated in Fig. 28.9. To enable the 
partition-based constructive modularization learning, 
two generic algorithms have been proposed, i. e., grow- 
ing and credit-assignment algorithms [28.81,82] as 
summarized below. 


Algorithm 28.7 Growing algorithm 

Given a training set D, set X <— D. Randomly initialize 
parameters in all component NNs in a given repository 
and pre-set hyper-parameters in a learnable partitioning 
mechanism and compatibility criteria, respectively: 


@ Compatibility test 
For a training (sub)set X, apply the compatibility 
criteria to X to examine whether the learning task 
defined on X matches the capacity of a component 
NN in the repository. 

© Partitioning space 
If none in the repository can solve the problem de- 
fined on X, then train the partitioning mechanism 
on the current X to partition it into two overlapped 
Xı and X,. Set X < X4, then go to the compatibility 
test step. Set X < X,, then go to the compatibility 
test step. 
Otherwise, go to the subroblem solving step. 

@ Subproblem solving 
Train this NN on X with an appropriate learning 
algorithm. The trained NN resides at the current leaf 
node. 


The growing algorithm expands a tree-structured 
MNNs until learning problems defined on all parti- 
tioned subspaces are solvable with NNs in the reposi- 
tory. 


For a given test data point, output of such a tree- 
structured MNN may depend on several component 
NNs at the leaves of the tree since the input space 
is partitioned into overlapping subspaces. A credit- 
assignment algorithm [28.81, 82] has been developed to 
weight the importance of component NNs contributed 
to the overall output, which is summarized as follows: 


Algorithm 28.8 Credit-assignment algorithm 
P(x) is a trained partitioning mechanisms that resides 
at a nonterminal node and partitions the current input 


(sub)space into two subspaces with an overlapping de- 
fined by —t < P(x) < t(t > 0). Cr(-), and Cr(-) are 
two credit assignment functions for two subspaces, re- 
spectively. For a test data point x: 


© Initialization 
Set a(x) <— 1 and Pointer < Root. 
© Credit assignment 
As a recursive credit propagation process to assign 
credits to all the component NNs at leaf nodes that 
x can reach, CR[a(x), Pointer] consists of three 
steps: 
- If Pointer points to a leaf node, then output 
a(x) and stop. 
— If P(x) <rt, a(x) < a(x)xC_(P(x)) and invoke 
CR[a(x), Pointer.Leftchild]. 
— If P(x) > ~-r, a(x) << a(X)xCr(P(x)) and in- 
voke CR[a(x), Pointer.Rightchild]. 


Thus, the output of a self-generated MNN is 


oč) = D> an(%)xo,(#) , 


nEN 


where N denotes all the component NNs that x can 
reach, and a, (x) and 0,,(x) are the credit assigned and 
the output of the n-th component NN in W for x, re- 
spectively. 


To implement such a strategy, hyper-planes placed 
with heuristics [28.81] or linear classifiers trained with 
the Fisher discriminative analysis [28.82] were first 
used as the partition mechanism and NNs such as 
MLP or RBF can be employed to solve subproblems. 
Accordingly, two piece-wise linear credit assignment 
functions [28.81,82] were designed for the hyper- 
plane partitioning mechanism, so that Cr (x) + Cr(x) = 
1. Heuristic compatibility criteria were developed by 
considering learning errors and efficiency [28.81, 82]. 
By using the same constructive learning algorithms 
described above, an alternative implementation was 
also proposed by using the self-organization map as 
a partitioning mechanism and SVMs were used for sub- 
problem solving [28.84]. Empirical studies suggest that 
the partitioning-based strategy leads to favorable results 
in various supervised learning tasks despite different 
implementations [28.81, 82, 84]. 

By applying the divide-and-conquer principle, task 
decomposition [28.83] is yet another constructive mod- 
ularization learning strategy for classification. Unlike 
the partitioning-based strategy, the task decomposition 
strategy converts a multi-class classification task into 
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a number of binary classification subtasks in a brute- 
force way and each binary classification subtask is 
expected to be fulfilled by a simple NN. If a subtask 
is too difficult to carry out by a given NN, the subtask is 
allowed to be further decomposed into simpler binary 
classification subtasks. For a multi-class classification 
task of M categories, the task decomposition strategy 
first exhaustively decomposes it into 5M (M — 1) differ- 
ent primary binary subtasks where each subtask merely 
concerns classification between two different classes 
without taking remaining M —2 classes into account, 
which differs from the commonly used one-against-rest 
decomposition method. In general, the original multi- 
class classification task may be decomposed into more 
binary subtasks if some primary subtasks are too dif- 
ficult. Once the decomposition is completed, all the 
subtasks are undertaken by pre-selected simple NNs, 
e.g., MLP of one hidden layer, in parallel. For a final 
solution to the original problem, three non-learnable op- 
erations, min, max, and inv, were proposed to combine 
individual binary classification results achieved by all 
the component NNs. By applying three operations prop- 
erly, all the component NNs are integrated together to 
form a min-max MNN [28.83]. 


28.3.4 Relevant Issues 


In general, studies of MNNs closely relate to several 
areas in different disciplines, e.g., ML and statistics. We 
here examine several important issues related to MNNs 
in a wider context. 

As described above, a tightly coupled MNN leads 
to an optimal solution to a given supervised learn- 
ing problem. The MoE is rooted in the finite mixture 
model (FMM) studied in probability and statistics and 
becomes a non-trivial extension to conditional models 
where each expert is a parametric conditional prob- 
abilistic model and the mixture coefficients also de- 
pend on input [28.64]. While the MoE has been well 
studied for 20 years [28.59] in different disciplines, 
there still exist some open problems in general, e.g., 
model selection, global optimal solution, and conver- 
gence of its learning algorithms for arbitrary component 
models. Different from the FMM, the product of ex- 
perts (PoE) [28.42] was also proposed to combine 
a number of experts (parametric probabilistic mod- 
els) by taking their product and normalizing the re- 
sult into account. The PoE has been argued to have 
some advantages over the MoE [28.42] but has so 
far merely been developed in the context of unsu- 
pervised learning. As a result, extending the PoE to 


conditional models for supervised learning would be 
a non-trivial topic in tightly coupled MNN studies. 
On the other hand, the NCL [28.65] directly applies 
the bias-variance analysis [28.60,61] to construction 
of an MNN. This implies that MNNs could be also 
built up via alternative loss functions that properly 
promote diversities among component MNNs during 
learning. 

Almost all existing NN ensemble methods are now 
included in ensemble learning [28.85], which is an 
important area in ML, or the multiple classifier sys- 
tem [28.62] in the context of pattern recognition. In 
statistical ensemble learning, generic frameworks, e.g., 
boosting [28.86] and bootstrapping [28.87], were devel- 
oped to construct ensemble learners where any learning 
models including NNs may be used as component 
learners. Hence, most of common issues raised for 
ensemble learning are applicable to NN ensembles. 
Nevertheless, ensemble learning researches suggest that 
behaviors of component learners may considerably af- 
fect the stability and overall performance of ensemble 
learning. As exemplified in [28.88], properties of dif- 
ferent NN ensembles are worth investigating from both 
theoretical and application perspectives. 

While constructive modularization learning pro- 
vides an alternative way of model selection, it is gener- 
ally a less developed area in MNNs, and existing meth- 
ods are subject to limitation due to a lack of theoretical 
justification and underpinning techniques. For example, 
a critical issue in the partitioning-based strategy [28.8 1, 
82] is how to measure the nature of a subproblem to 
decide if any further partitioning is required and the 
appropriateness of a pre-selected NN to a subproblem 
in terms of its capacity. In previous studies [28.81, 82], 
a number of heuristic and simple criteria were proposed 
based on learning errors and efficiency. Although such 
heuristic criteria work practically, there is no theoretical 
justification. As a result, more sophisticated compatibil- 
ity criteria need to be developed for such a constructive 
learning strategy based on the latest ML development, 
e.g., manifold and adaptive kernel learning. Fortunately, 
the partitioning-based strategy has inspired the latest 
developments in ML [28.89]. In general, constructive 
modularization learning is still a non-trivial topic in 
MNN research. 

Finally, it is worth stating that our MNN review here 
only focuses on supervised learning due to the limited 
space. Most MNNs described above may be extended 
to other learning paradigms, e.g., semi-supervised and 
unsupervised learning. More details on such topics are 
available in the literature, e.g., [28.90, 91]. 
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28.4 Concluding Remarks 


In this chapter, we have reviewed two important ar- 
eas, DNNs and MNNs, in NC. While we have pre- 
sented several sophisticated techniques that are ready 
for applications, we have discussed several challeng- 
ing problems in both deep and modular neural net- 
work research as well. Apart from other non-trivial 
issues discussed in the chapter, it is worth empha- 
sizing that it is still an open problem to develop 
large-scale DNNs and MNNs and integrate them for 
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James T. Kwok, Zhi-Hua Zhou, Lei Xu 


This tutorial provides a brief overview of a num- 
ber of important tools that form the crux of the 
modern machine learning toolbox. These tools 
can be used for supervised learning, unsupervised 
learning, reinforcement learning and their numer- 
ous variants developed over the years. Because 
of the lack of space, this survey is not intended 
to be comprehensive. Interested readers are re- 
ferred to conference proceedings such as Neural 
Information Processing Systems (NIPS) and the In- 
ternational Conference on Machine Learning (ICML) 
for the most recent advances. 
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29.1 Overview 


Machine learning represents one of the most prolific 
developments in modern artificial intelligence. It pro- 
vides a new generation of computational techniques and 
tools that support understanding and extraction of use- 
ful knowledge from complicated data sets. So what is 
machine learning? Simon [29.1] defined machine learn- 
ing as: 


changes in the system that are adaptive in the sense 
that they enable the system to do the same task or 
tasks drawn from the same population more effec- 
tively the next time. 
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Hence, fundamentally, the emphasis of machine learn- 
ing is on the system’s ability to adapt or change. 
Typically, this is in response to some form of experience 
provided to the system. After learning or adaptation, the 
system is expected to have better future performance on 
the same or a related task. 

Over the past decades, machine learning has grown 
from a few toy applications to being almost every- 
where. It is now being applied to numerous real-world 
applications. For example, the control of autonomous 
robots that can navigate on their own, the filtering of 
spam from mailboxes, the recognition of characters 
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from handwriting, the recognition of speech on mo- 
bile devices, the detection of faces in digital cameras, 
and so on. Indeed, one can find applications of ma- 
chine learning from everyday consumer products to 
advanced information systems in corporations. Studies 
of machine learning may be overviewed from either 
the perspective of learning intelligent systems or that 
of a machine learning toolbox. For the former, learning 
is considered as a process of an intelligent system for 
coordinately solving three levels of inverse problems, 
namely problem solving for making pattern recognition 
and various other tasks, parameter learning for esti- 
mating unknown parameters in the system, and model 
selection for shaping system configuration with an ap- 
propriate scale or complexity to describe regularities 
underlying a finite size of samples. Different learning 
approaches are featured by differences in one or more 
of three ingredients, namely as a learner that has an 
appropriate system configuration, a theory that guides 
learning, and an algorithm or dynamic procedure that 
implements learning. Examples of studies from this 
prospect are recently overviewed in [29.2,3], and will 
not be further addressed in this chapter. Instead, this 
chapter aims at a tutorial on studies of machine learn- 
ing from the second prospect, that is, on those important 
collections pooled in the machine learning toolbox for 
decades. Actually, the current prosperity of machine 
learning comes from not only further developments of 
the classical statistical modeling and neural network 
learning, but also emerging achievements of machine 
learning and data mining in recent decades. Due to lim- 
ited space, the focus of this tutorial will be particularly 
placed on those advancements made in the last two 
decades or so. 

Classically, there are three basic learning 
paradigms, namely, supervised learning (Sect. 29.2), 
unsupervised learning (Sect. 29.3), and reinforcement 
learning (Sect. 29.4). In supervised learning, the 
learner is provided with a set of inputs together with 
the corresponding desired outputs. This is similar 
to the familiar human learning process for pattern 
recognition, in which a teacher provides examples to 
teach children to recognize different objects (say, for 
example, animals). Such a pattern recognition task is 
featured by data with each input sample associated with 
a label, namely labeled data. In the current literature 
on machine learning, the term labeled data is even 
generally used to refer data with each input associated 
with an output beyond simply a label, which is also 
adopted in this chapter. Section 29.2 provides not only 
a tutorial on basic issues of supervised learning but 


also an overview on a number of interesting topics 
developed in recent years. The coverage of this section 
is not complete, e.g., it does not cover the supervised 
learning studies in the literature on neural networks. 
Interested readers are referred to a number of survey 
papers, e.g., especially those on multi-layer perceptron 
and radial basis functions [29.4, 5]. 

Unlike supervised learning, the tasks of unsuper- 
vised learning are featured by data that consist of only 
inputs, namely, the data is unlabeled and there is no 
longer the presence of a teacher. Unsupervised learning 
aims at finding certain dependence structure underlying 
data via optimizing a learning principle. Considering 
different types of structures, studies include not only 
classic topics of data clustering, subspace, and topolog- 
ical maps, but also emerging topics of learning latent 
factor models, hidden state-space models, and hier- 
archical structures. Section 29.3 also consists of two 
parts. The first part provides a tutorial on three clas- 
sic topics, while the second part makes an overview 
on emerging topics. Extensive studies have been made 
on unsupervised learning for many decades. Instead of 
seeking a complete coverage, this section focuses on 
a tutorial on fundamentals and an overview on inter- 
esting developments of recent years, mainly based on 
a more systematic overview [29.6]. Further, readers are 
referred to several recent survey papers, e.g., [29.7] for 
an overview on 50 years of studies beyond k-means 
for data clustering, [29.8, 9] for subspace and manifold 
learning, and [29.10] for topological maps. 

The third paradigm is reinforcement learning. Upon 
observing the current environment and obtaining some 
input (if any), the learner makes an action and changes 
to a new environment, receiving an evaluation (award 
or punish) value about the action. A learning pro- 
cess makes a series of actions with the received total 
award maximized. Different to unsupervised learning, 
the learner gets a guidance from an external evaluation. 
Also unlike supervised learning in which the teacher 
clearly specifies the output that corresponds to an input, 
in reinforcement learning the learner is only provided 
with an evaluative value about the action made. Sec- 
tion 29.4 starts at giving a tutorial on basic issues of 
reinforcement learning, especially temporal difference 
TD learning and Q-learning, plus improvements on the 
Q-learning with the help of some unsupervised learning 
methods. 

Besides these three basic learning paradigms, many 
more variants have been developed in recent years be- 
cause of the advances in machine learning. Some of 
these will also be described in this tutorial. They are of- 
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ten a hybrid of the previous learning paradigms. A very 
popular variant is semi-supervised learning (Sect. 29.5), 
which uses both labeled data (as in supervised learning) 
and unlabeled data (as in unsupervised learning) for 
training. This is advantageous as labeled data typically 
are expensive and involve tedious human effort, while 
a large amount of unlabeled data can often be obtained 
in an inexpensive manner (e.g., simply downloadable 
from the web). Another hybrid of supervised learning 
and unsupervised learning is discriminative clustering. 
Here, one adopts a cost function originally used for 
supervised learning as a clustering criterion. A well- 
known example in this category is called maximum 
margin clustering [29.1 1-13], which tries to maximize 
the margin (used as a criterion in constructing the highly 


29.2 Supervised Learning 


A supervised learner is provided with some labeled data 
(often called training data). This consists of a set of 
training samples, each of which is an input together 
with the corresponding desired outputs. Hence, the first 
step in machine learning is to collect these training 
samples. Moreover, as each training sample needs to 
be represented in a form amendable by the computer 
algorithm, one has to define a set of features. As an 
example, consider the task of recognizing handwritten 
characters on an envelope. To construct the training 
samples, obviously one first has to collect a number 
of envelopes with handwritten characters on. Then, the 
characters on each envelope have to be separated from 
each other. This can be performed either manually or 
automatically by some image segmentation algorithm. 
Afterwards, each character is a block of pixels (typi- 
cally rectangular). A simple feature representation will 
be to use the intensities of these raw pixels. Each in- 
put is represented as a vector of feature values, and this 
vector is called the feature vector. Obviously, it is im- 
portant to have a good set of features to work with. 
The presence of bad features may confuse the learn- 
ing algorithm and makes learning more difficult. For 
example, in the context of character recognition, the 
color of the ink is not relevant to the identity of the 
character and so can be considered as a bad feature. De- 
pending on the domain knowledge, more sophisticated 
features can be manually defined. It is desirable that 
good features can be automatically extracted and bad 
features automatically removed. More details on these 
feature selection/extraction algorithms will be covered 


successful supervised learning model: support vector 
machine) between clusters. 

Moreover, instead of just constructing one learner 
from training data, one can construct a set of learners 
and combine them to solve the problem. This approach, 
known as ensemble learning [29.14], has become very 
popular in recent years and will be discussed in more 
detail in Sect. 29.6. Finally, before learning can pro- 
ceed, the data need to be appropriately represented by 
a set of features. In many real-world data sets, there 
are often a large number of features, many of which 
are abundant or irrelevant. Feature selection and extrac- 
tion aim at automatically extracting the good features 
and removing the bad ones, and this will be covered in 
Sect. 29.7. 


in Sect. 29.7. Finally, each character on the envelope 
has to be manually labeled. 

In practice, as the real-world data are often dirty, 
a significant amount of time may have to be spent on 
data pre-processing in order to create the training data. 
There are many forms of dirty data. For example, it can 
be incomplete in that certain attribute values (e.g., occu- 
pation) may be lacking; it can contain outliers or errors 
(e.g., the salary is negative); parts of it may be inconsis- 
tent (e.g., the customer’s age is 42 but his/her birthday is 
03/07/2012); it may also be redundant in that there are 
duplicate records or unnecessary attributes. All these 
problems may be due to faulty or careless data col- 
lection, human/hardware/software problems, errors in 
data transmission; or that the data may have come from 
a number of different data sources. In all cases, data pre- 
processing can have a significant impact on the resultant 
machine learning system, as no quality data implies no 
quality learning results! 


29.2.1 Classification and Regression 


The two main goals in supervised learning are (i) clas- 
sification, which aims at assigning the input pattern 
to different categories (also called classes or labels); 
and (ii) regression, which aims at predicting a real 
value or vector associated with the input. The ba- 
sic idea and the training/testing procedures in regres- 
sion are similar to those in classification. Hence, we 
will mainly focus on the classification problem in the 
sequel. 
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The simplest case for classification is binary classi- 
fication, in which there are only two classes. Examples 
include classifying an email as spam or non-spam; and 
classifying an image as face or non-face. For each sam- 
ple, the supervised learner examines the feature values 
in that sample and predicts the class that the sample be- 
longs to. Essentially, the supervised learner partitions 
the whole feature space (the space of all possible fea- 
ture value combinations) into two regions, one for each 
class. The boundary is called the decision boundary. 
A wide variety of models can be used to construct 
this decision boundary. A simple example is the linear 
classifier, which creates a linear boundary. Depending 
on the task, the linear classifier may be too simple 
to differentiate the two classes. Then, one can also 
use a more complicated decision boundary, such as 
a quadratic surface, leading to a quadratic classifier. 
In machine learning, a large number of various mod- 
els that are capable of producing nonlinear decision 
boundaries have been proposed. The most popular ones 
include the decision tree classifier, nearest neighbor 
classifiers, neural network classifiers, Bayesian clas- 
sifiers, and support vector machines. Each of these 
models has some parameters that have to be adapted 
to the particular data set. For example, the parameters 
of a linear classifier include the weight on each fea- 
ture (which controls the slope of the linear boundary) 
and a bias (which controls the offset). To estimate or 
train these parameters, one has to provide a training set, 
where the i-th training pattern (x;, y;) consists of an in- 
put x; and the corresponding target output label y; (for 
regression problems, this y; is a real value or vector). 
The greater the amount of training data, intuitively the 
more accurate the learned model. However, since the 
training data in supervised learning are labeled, obtain- 
ing these output labels typically involve expensive and 
tedious human effort. Hence, recent machine learning 
algorithms also try to utilize data that are unlabeled, 
leading to the development of semi-supervised learning 
algorithms in Sect. 29.5. 

Given the model, different strategies can be used 
to learn the model parameters so that it fits the train- 
ing set (i. e., train the model). Parameter estimation and 
feature selection (Sect. 29.7) can sometimes be per- 
formed together. However, note that there is the danger 
of overfitting, which occurs when the model performs 
better than other models on the training data, but worse 
on the entire data distribution as it has captured the 
trends of the noise underlying the data. Often this hap- 
pens when the model is excessively complex, such as 
when it has a lot more parameters than can be re- 


liably estimated from the limited number of training 
patterns. To combat overfitting, one can constrain the 
model’s freedom during training by adding a regular- 
izer or Bayesian to the parameters or model beforehand. 
Alternatively, one can stop the learning procedure be- 
fore convergence (early stopping) or remove part of the 
model when training is complete (pruning). If there are 
noisy training samples that significantly deviate from 
the underlying input-output trend, one can also per- 
form outlier detection to first remove these outlying 
samples. 

There are two general approaches to train the model 
parameters. The first approach treats the model as a gen- 
erative model that defines how the data are generated 
(typically by using a probabilistic model). One can then 
maximize the likelihood by varying the parameters, or 
to maximize the posterior probability of the parame- 
ters given the training data. Alternatively, one can take 
a discriminative approach that directly considers how 
the output is related to the input. The parameters can be 
obtained by empirical risk minimization, which seeks 
the parameters which best fit the training data. The risk 
is dependent on the loss function, which measures the 
difference between the prediction and the target out- 
put. Let y; be the target output for sample i, and 4; 
be the predicted output from the supervised learner. 
For classification problems, commonly used loss func- 
tions include the logistic loss In(1 + exp(—y;¥;)) and 
the hinge loss max(0, 1 — y;¥;); and for regression prob- 
lems, the most common loss function is the square loss 
(yi —3;)*. However, in order to combat overfitting, it is 
better to perform regularized risk minimization instead 
of empirical risk minimization. Regularized risk con- 
sists of two components. The first component is the loss 
as in empirical risk minimization. The second compo- 
nent is a regularizer, which helps to control the model 
complexity and prevents overfitting. Various regulariz- 
ers have been proposed. Let w = [w1,W2,...,Wa]’ be 
the vector of parameters. A popular regularizer is the 
£5-norm of w, i.e., 


d 
wb = dow. 


i=l 


This leads to ridge regression when the linear model is 
used, and is commonly called weight decay in the neural 
networks literature. Instead of using the 2-norm, one 
can use the lo-norm ||w||o, which counts the number 
of nonzero w; in the model. However, this is noncon- 
vex and the associated optimization is more difficult. 
A common way to alleviate this problem is by using the 
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which is still convex (as for the £-norm) but can still 
lead to a sparse parameter solution (as for the £9-norm). 
When used with the square loss on the linear model, this 
leads to the well-known lasso model. 

Once trained, the classifier can be used to predict 
the label of an unseen test sample. The underlying as- 
sumption is that this test sample comes from the same 
distribution as that of the training samples. In this case, 
we expect the trained classifier to be able to generalize 
well to this new sample. This can also be formally de- 
scribed by generalization error bounds in computational 
learning theory. 

There are multiple ways to measure the perfor- 
mance of a trained classifier. An obvious performance 
evaluation criterion is classification accuracy, which is 
the fraction of test samples that are correctly classi- 
fied (by comparing the prediction obtained from the 
classifier and the true class output of the test sample). 
As mentioned above, because of the issue of overfit- 
ting, it can be misleading to simply gauge classification 
accuracy on the training set. Instead, one can mea- 
sure classification accuracy on a separate validation set 
(which is used as a proxy for the underlying data distri- 
bution), or use cross-validation. Moreover, sometimes, 
when the sample sizes of the two classes differ sig- 
nificantly, this accuracy may again be misleading, as 
one may attain an apparently high accuracy by sim- 
ply predicting the test sample to belong to the majority 
class. In these cases, other measures such as precision, 
recall, and F-measure may be more useful. Moreover, 
while the classifier’s accuracy is often an important cri- 
terion, other aspects may also be important, such as the 
training and testing of computational complexities (in- 
cluding both time and space), user-friendliness (e.g., is 
the trained model considered as a black-box or can it be 
easily conveyed and explained to the users), etc. 

While binary classification assumes the presence of 
only two output classes, many real-world applications 
have more than two (say, K), leading to a multi-class 
classification problem. There are two common ap- 
proaches to reduce a multi-class classification problem 
to binary classification problems, namely, the one-vs- 
rest (also called one-vs-all) approach and the one-vs- 
one approach. In the one-vs-rest approach, K binary 
classifiers are constructed, each one separating the sam- 
ples belonging to the i-th class from those that do not. 


On prediction, the test sample is sent to all the K classi- 
fiers, and its label corresponds to the classifier with the 
highest output. In the one-vs-one approach, one binary 
classifier is built for each pair of outputs (e.g., outputs 7 
and j), and each classifier tries to discriminate samples 
belonging to the i class from those belonging to the j 
class. Thus, there are a total of 5K (K — 1) classifiers. 
On prediction, the test sample is again sent to all the 
binary classifiers, and the class that receives the largest 
number of votes is output. 


29.2.2 Other Variants 
of Supervised Learning 


Multi-Label Classification 
While an instance can only belong to one and only one 
class in multi-class classification, an instance in multi- 
label classification can belong to multiple classes. Many 
real-world applications involve multi-label classifica- 
tion. For example, in text categorization, a document 
can belong to more than one category, such as gov- 
ernment and health; in bioinformatics, a gene may 
be associated with more than one function, such as 
metabolism, transcription, and protein synthesis; and 
in image classification, an image may belong to multi- 
ple semantic categories, such as beach and urban. Note 
that the number of labels associated with an unseen 
instance is unknown and can also vary from instance 
to instance. Hence, this makes the multi-label classifi- 
cation problem more complicated than the multi-class 
classification problem. In the special case where the 
number of labels associated with each instance is al- 
ways equal to one, obviously multi-label classification 
reduces to multi-class classification. 

In general, multi-label classification algorithms can 
be divided into two categories: problem transformation 
and algorithm adaptation [29.15]. Problem transfor- 
mation methods transform a multi-label classification 
problem into one or more single-label classification 
problems. The basic approach (called binary relevance) 
simply decomposes a multi-label problem with K la- 
bels into K binary classification problems, one for 
each label. In other words, the i-th classifier is a bi- 
nary classifier that tries to decide whether the sample 
belongs to the i-th class. However, since this consid- 
ers the labels independently, any possible correlations 
among labels will be ignored, leading to inferior per- 
formance in problems with highly correlated labels. 
More refined variants thus take the label correlation 
into account during training, a similar idea that is also 
exploited in multi-task learning (Sect. 29.2.2). On the 
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other hand, algorithm adaptation methods extend a spe- 
cific learning algorithm for multi-label classification. 
The specific extension is thus tailor-made for each in- 
dividual learning algorithm and less general. Example 
learning algorithms that have been extended in this 
way include boosting, decision trees, ensemble meth- 
ods, neural networks, support vector machines, genetic 
algorithms, and the nearest-neighbor classifier. Recent 
surveys on the progress of multi-label classification and 
its use in different applications can be found in [29.15, 
16]. 

In many applications, the labels are often organized 
in a hierarchy, either in the form of a tree (such as 
documents in Wikipedia) or as a directed acyclic graph 
(such as gene ontology). An instance is associated 
with a label only if it is also associated with the label’s 
parent(s) in the hierarchy. Recently, progress has 
also been made in multi-label classification in these 
structured label hierarchies [29.17—19]. 


Multi-Instance Learning 
In multi-instance learning (MIL), the training set is 
composed of many bags each containing multiple in- 
stances, where a positive bag contains at least one 
positive instance, whereas a negative bag contains 
only negative instances; labels of the training bags are 
known, but labels of the instances are unknown. The 
task is to make predictions for labels of unseen bags. 
The multi-instance learning framework is illustrated in 
Fig. 29.1. Notice that the instances are described by the 
same feature set, rather than different feature sets. 

The MIL learning framework originated from the 
study of drug activity [29.20], where a molecule with 
multiple low-energy shapes is known to be useful to 
make a drug, whereas it is unknown which shape is cru- 
cial. Later, many real tasks are found to be natural multi- 
instance learning problems. For example, in image re- 
trieval if we regard each image patch as an instance, then 
the fact that an user is interested (or not interested) in 
an image implies that there are at least one patch (or no 
patch) that contains his/her interesting objects. 


Instance ` 


Instance “ 


Fig. 29.1 Illustration of multi-instance learning 


Most MIL methods attempt to adapt single-instance 
supervised learning algorithms to the multi-instance 
representation by shifting their focus from discrimi- 
nation on instances to discrimination on bags; there 
are also methods that try to adapt the multi-instance 
representation to single-instance algorithms by repre- 
sentation transformation [29.21]. Recently, it has been 
recognized that the instances in the bags should not be 
treated independently [29.22]; otherwise MIL is a spe- 
cial case of semi-supervised learning [29.23]. 

In addition to classification, multi-instance regres- 
sion and clustering have been studied, and different 
versions of generalized multi-instance learning have 
been defined [29.24, 25]. To deal with complicated data 
objects that are associated with multiple labels simulta- 
neously, a new framework, multi-instance, multi-label 
learning (MIML) [29.26], was developed recently. 

Notice that the original MIL assumption implies 
that there exists a key instance in a positive bag; 
later, some other assumptions were introduced [29.27]. 
For example, some methods assumed that there is no 
key instance and every instance contributes to the bag 
label. 


Multi-View Learning 
In many real tasks there is more than one feature set. 
For example, a video film can be described by au- 
dio features, image features, etc.; a web page can be 
described by features characterizing its own content, 
features characterizing its linked pages, etc. A classical 
routine is to take these features together and represent 
each instance using a concatenated feature vector. The 
different feature sets, however, usually convey informa- 
tion from different channels, and therefore, it may be 
better to consider the difference explicitly. This mo- 
tivates multi-view learning, where each feature set is 
called a view. 

Each instance in multi-view learning is represented 
by multiple feature vectors each in a different, usually 
non-overlapping feature set. Multi-view learning meth- 
ods in supervised learning setting are closely related 
to studies of information fusion, combining classi- 
fiers (29.28-30], and ensemble methods [29.14]. A pop- 
ular representative is to construct a model from each 
view, and then combine their predictions using voting or 
averaging. The models are often assigned with different 
weights, reflecting their different strength, reliability, 
and/or importance. 

Multi-views make great sense when unlabeled data 
are considered. For example, it has been proved that 
when there are sufficient and redundant views (that is, 
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each view contains sufficient information for construct- 
ing a good model, and the two views are conditionally 
independent given the class label), co-training is able 
to boost the performance of any initial weak learner to 
an arbitrary performance using unlabeled data [29.31]. 
Later, it was found that such a process is beneficial 
even when the two views satisfy weaker assumptions, 
such as weak dependence, expansion, or large diver- 
sity [29.32-34], and when there are really sufficient and 
redundant views, even semi-supervised learning with 
a single labeled example is possible [29.35]. Moreover, 
in active learning where the learner actively selects 
some unlabeled instances to query their labels from an 
oracle (such as a human expert), it has been proved 
that multi-view learning enables exponential improve- 
ment of sample complexity in a setting close to real 
tasks [29.36], whereas previously it was believed that 
only polynomial improvement is possible. 


Multi-Task Learning 
Many real-world problems involve the learning of 
a number of similar asks. Consider the simple example 
of learning to recognize the numeric digits 0—9. One 
can build ten separate classifiers, one for each digit. 
However, apparently these ten classifiers share some 
common features, e.g., many of the digits consist of 
loops and strokes. Hence, the ability to detect these 
higher latent features is of common interest to all these 
classifiers, and learning all these tasks together will 
allow them to borrow strength from each other. More- 
over, when the number of training examples is rare for 
each task, most single-task learning methods may fail. 
By learning them together, better generalization per- 
formance can be obtained by harnessing the intrinsic 
task relationships. Consequently, this leads to the de- 
velopment of multi-task learning (MTL) [29.37]. These 
different tasks have different output spaces and can also 
have different input features. But it is also quite often 
that these different tasks share the same set of input fea- 
tures. In this case, the problem is similar to multi-label 
classification (Sect. 29.2.2). 

A popular MTL approach is regularized multi-task 
learning (RMTL) [29.38, 39]. It assumes that the tasks 
are highly related, and encourages the parameters of all 
the tasks to be close. More specifically, let there be T 
tasks and denote the parameter associated with the f-th 
task by w,;. RMTL assumes that all the w,’s are close 
to some shared task w, and that the w,’s differ by each 
other only in a term Aw,’s as w, = w + Aw,. Hence, 
w represents the component that is shared by all the 
tasks, and thus can benefit from learning all the tasks 


together; while Aw, is the component that is specific to 
each individual task, and can be used to capture the in- 
dividual variations. Alternatively, other MTL methods, 
such as multi-task feature learning (MTFL) [29.40], as- 
sumes that all the tasks lie in a shared low-dimensional 
space. 

Moreover, tasks are supposed to form several clus- 
ters rather than from the same group. If such a task 
clustering structure is known, then a simple remedy is 
to constrain task sharing to be just within the same clus- 
ter [29.39, 41]. More generally, all the tasks are related 
in different degrees, which can be represented by a net- 
work of task relationships [29.42]. In this case, MTL 
can also be performed. In practice, however, such an 
explicit knowledge of task clusters/network may not be 
readily available. 

A number of efforts have made towards identify- 
ing task relationships simultaneously during parameter 
learning, e.g., learning a low-dimensional subspace 
shared by most of the tasks [29.43], finding the correla- 
tions between tasks [29.44], and inferring the clustering 
structure [29.45,46], as well as integrating low-rank 
and group-spatse structures for robust multi-task learn- 
ing [29.47]. 


Transfer Learning 
As discussed in Sect. 29.1, traditionally, machine learn- 
ing is defined as: 


changes in the system that are adaptive in the sense 
that they enable the system to do the same task or 
tasks drawn from the same population more effec- 
tively the next time. 


However, recently, there has been increasing interest in 
adapting a classifier/regressor trained in one task for 
use in another. This so-called transfer learning is par- 
ticularly crucial when the target application is in short 
supply of labeled data. For example, it is very expensive 
to calibrate a WiFi localization model in a large-scale 
environment. To reduce re-calibration effort, we might 
want to adapt the localization model trained in one 
time period (source domain) for a new time period (tar- 
get domain), or to adapt the localization model trained 
on a mobile device (source domain) for a new mo- 
bile device (target domain). However, the WiFi signal 
strength is a function of time, device, and other dynamic 
factors. Thus, transfer learning is used to adapt the dis- 
tributions of WiFi data collected over time or across 
devices. 

In general, transfer learning addresses the problem 
of how to utilize plentiful labeled data in a source do- 
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main to solve related but different problems in a target 
domain, even when the training and testing problems 
have different distributions or features. The success to 
transfer learning from one context to another context 
depends on how similar the learning task is to the trans- 
ferred task. There are two main approaches to transfer 
learning. The first approach tries to learn a common set 
of features from both domains, which can then be used 
for knowledge transfer [29.48-50]. Intuitively, a good 
feature representation should be able to reduce the dif- 
ference in distributions between domains as much as 
possible, while at the same time preserving important 
(geometric or statistical) properties of the original data. 
With a good feature representation, we can apply stan- 
dard machine learning methods to train classifiers or 
regression models in the source domain for use in the 
target domain. The second approach to transfer learning 
is based on instances [29.5 1—53]. It tries to learn differ- 
ent weights on the source examples for better adaptation 
in the target domain. For example, in the kernel mean 
matching algorithm [29.52], instances in a reproducing 
kernel Hilbert space are re-weighted based on the the- 
ory of maximum mean discrepancy. 


Cost-Sensitive Learning 
In many real tasks, the costs of making different 
types of mistakes are usually unequal. In such sit- 
uations, maximizing the accuracy (or equivalently, 
minimizing the number of mistakes) may not pro- 
vide the optimal decision. For example, two instances 
that each cost 10 dollars are less important than one 
instance that costs 50 dollars. Cost-sensitive learn- 
ing methods attempt to minimize the total cost by 
reducing serious mistakes through sacrificing minor 
mistakes. 

There are two types of misclassification costs, i. e., 
example-dependent or class-dependent cost. The for- 
mer assumes that every example has its own misclassifi- 
cation cost, whereas the latter assumes that every class 
has its own misclassification cost. To obtain example- 
dependent cost is usually much more difficult in real 
practice, and therefore, most studies focus on class- 
dependent cost. 


29.3 Unsupervised Learning 


Given a set Xy = {x,}_, of unlabeled data samples, 
unsupervised learning aims at finding a certain de- 
pendence structure underlying data Xy with help of 
a learning principle. The simplest one is the structure 


The essence of most cost-sensitive learning meth- 
ods is rescaling (or rebalance), which tries to rebalance 
the classes such that the influence of each class in the 
learning process is in proportion to its costs. Suppose 
the cost of misclassifying the i-th class to the j-th class 
is cost;. For binary classification, it can be derived from 
the Bayes risk theory that the optimal rescaling ratio of 
the i-th class against the j-th class is tj = ah [29.54]. 
For multi-class problems, however, there is no direct so- 
lution to obtain the optimal rescaling ratios [29.55], and 
one may want to decompose a multi-class problem to 
a series of binary problems to solve. 

Rescaling can be implemented in different ways, 
e.g., re-weighting or re-sampling the training exam- 
ples of different classes, or even moving the decision 
threshold directly towards the cheaper class. It can be 
easily incorporated into existing supervised learning al- 
gorithms. For example, for support vector machines, 
the corresponding optimization problem can be written 
as 


i 1 m 
min zlwli3e +C) _ costxéi 
st yi(w (x) +b) > 1-& 


&>0 i=1,...,m, (29.1) 


where ¢ is the feature induced from a kernel func- 
tion and cost(x;) is the cost for misclassifying x;. 
It can be found that the only difference with the 
classical support vector machine is the insertion of 
cost(x;). 

It is often difficult to know precise costs in real 
practice, and some recent studies have tried to ad- 
dress this issue [29.56]. Notice that a learning pro- 
cess may involve various costs such as the fest cost, 
teacher cost, intervention cost, etc. [29.57], and these 
costs can also be considered in cost-sensitive learn- 
ing. Last but not least, it should be noted that the 
variants introduced in this section already go beyond 
the classic paradigm of supervised learning. Many of 
them are integrated with unsupervised learning. Some 
further issues will be also addressed in the following 
sections. 


of merely a point jz in a vector space as illustrated in 
Fig. 29.2a(A(2)). It represents each sample x, with an 
error measure £; = ||x; — u||?. The best u may be ob- 
tained under a learning principle, e.g., minimizing the 
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following error 


N 
B=} oe, (29.2) 


which results in that jz is simply the mean of the sam- 
ples. 

Efforts made have been far from a simple point 
structure. As illustrated in Fig. 29.2, these efforts are 
roughly grouped into several closely related streams. 
One consists of those listed in Fig. 29.2a(A), fea- 
tured by increasing the dimensionality of the modeling 
structure from a single point to a line, plane, and sub- 
space. The second stream consists of those listed in 
Fig. 29.2a(B), with multiple structures replacing its 
counterparts listed in Fig. 29.2a(A). The third stream 
consists of those listed in Fig. 29.2b(C), based on 
matrix/graph representation of underlying dependence 
structures. Moreover, another stream is featured with 
underlying dependencies in tree structures, such as tem- 
poral modeling, hierarchical learning, and causal tree 
structuring, as illustrated in Fig. 29.2b(D). This section 
will provide a tutorial on the basic structures listed in 
Fig. 29.2. Also, an overview will be made of a number 
of emerging topics, mainly coming from a recent sys- 
tematic overview [29.6]. 

Additionally, there is also a stream of studies 
that not only consist of unsupervised learning as 
a major ingredient but also include features of su- 
pervised learning and reinforcement learning, some 
of which are referred to under the term semi- 
supervised learning, while others are referred to un- 
der the names of semi-unsupervised learning, hy- 
brid learning, mixture of experts, etc. Among them, 
semi-supervised learning has become a well-adopted 
name in the literature of machine learning and will 
be further introduced in Sect. 29.5. Moreover, read- 
ers are further referred to Sect. 4.3 of [29.58] 
and [29.59] for a general formulation called semi-blind 
learning. 


29.3.1 Principal Subspaces 
and Independent Factor Analysis 


When a point jz is replaced with a line structure as illus- 
trated in Fig. 29.2a(A(3)), e.g., represented by a vector 
Wu =w-—_ of a unit length, we consider the error €; 
as the shortest distance from x; to the line. Then, min- 
imizing E by (29.2) results in that w,, is the principal 
component direction of the sample set Xy, that is, we 


A(2) 


Gaussian mixture 
(GM) 
—_ * BQ) 


B(4) 


D4) 


Fig. 29.2a,b Four streams of unsupervised learning studies fea- 
tured by types of underlying dependence structures 
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have 


1 N 
Swu =Awy, s=7 di w(x u)", (29.3) 


t=1 


where A is the largest eigenvalue of the sample covari- 
ance matrix S, and w,, is the corresponding eigenvector. 

Moreover, we consider a plane or subspace il- 
lustrated in Fig. 29.2a(A(4)), resulting in a princi- 
pal plane or subspace, i.e., a subspace spanned by 
m eigenvectors that correspond to the first m largest 
eigenvalue of S. Usually, the tasks in Fig. 29.2a(A(3)) 
and Fig. 29.2a(A(4)) are called principal compo- 
nent analysis (PCA) and principal subspace analysis 
(PSA). 

Considering a subspace spanned by the column vec- 
tors of A, each sample x, is represented by x, = Ay, from 
a vector y; = fy, ...,y!]" in the subspace with the 
mutually independent elements of y, being the coordi- 
nates along these column vectors, subject to an error 
e, = xX, —<; that is uncorrelated to or independent of y+. 
Thus, as illustrated in Fig. 29.2a(A(5)), x, comes from 
an underlying subspace as follows 


X, =X, +e; = Ay; +e, 


Ee,y; = 0 or plerly:) = per) . (29.4) 
featured with the following independence 
PO) = POS?) ---POP”). (29.5) 


Particularly, it is called factor analysis (FA) if we con- 
sider 


PO) = G0, D), 


p(e;) = G(e,|0,D) fora diagonal D , (29.6) 


where G(x|u, £) denotes a Gaussian density with the 
mean vector u and the covariance matrix ©. 

In general, the matrix A and other unknowns 
in (29.4) are estimated by the maximum likelihood 
learning on 


q(x|0) = G(x|0, AA? + D) , (29.7) 


with help of the expectation maximization (EM) al- 
gorithm. With the special case D=o7J, the space 
spanned by the column vectors of A is the same sub- 
space spanned by m eigenvectors that correspond to the 
first m largest eigenvalue of S in (29.3), which has been 
a well-known fact since Anderson’s work in 1959. In 
the last two decades, there has been a renewed interest 


in the machine learning literature under the new name 
of probabilistic PCA. 

Classically, the principal subspace is obtained via 
computing the eigenvectors of the sample covari- 
ance X. However, X is usually poor in accuracy when 
the sample size N is small while the dimensionality of x; 
is high. Alternatively, Oja’s rule and variants thereof 
have been proposed to learn the eigenvectors adaptively 
per sample without directly computing X. Also, exten- 
sions have been made on adaptive robust PCA learning 
on data with outliers and on the adaptive principle curve 
as the line in Fig. 29.2a(A(3)) extended to a curve. 

A hyperplane has two dual representations. One is 
spanned by several one-dimensional unit vectors, while 
the other is represented by a unit length normal vec- 
tor w that is orthogonal to this subspace. In the latter 
case, minimizing E results in that w,, is still a solution 
of (29.3) but with A becoming the smallest eigenvalue 
instead of the largest one. Accordingly, the problem 
is called minor component analysis (MCA). In gen- 
eral, an m-dimensional subspace in R? may also be 
represented by the spanning vectors of a d—m com- 
plementary subspace, for which minimizing E results 
in d—m eigenvectors that correspond to the first m 
smallest eigenvalues of S. The problem is called minor 
subspace analysis (MSA). When m > d/2, the minor 
subspace needs fewer free parameters than the principal 
subspace does. In a dual subspace pattern classifica- 
tion, each class is represented by either a principal 
subspace or a minor subspace. Because they are dif- 
ferent from PCA and PSA, MCA and MSA are more 
prone to noises. For further details about these topics 
the interested reader is referred to Sect. 3.2.1 of [29.58] 
and a recent overview [29.60]. In the following, we add 
brief summaries on three typical methods. 


PCA versus ICA 
Independent component analysis (ICA) has been widely 
studied in the past two decades. The key point is to 
seek a linear mapping y; = Wx, such that the compo- 
nents y"),...,y” of y, become mutually indepen- 
dent, as shown in (29.5), as an extension of PCA 
by which the components [y{”,..., y”] of yı = Wx, 
become mutually de-correlated when the rows of W 
consist of the eigenvectors of the first m largest eigen- 
value of S in (29.3). Strictly speaking, this is inex- 
act since the counterpart of ICA should be called 
de-correlated component analysis (DCA), with inde- 
pendence among y®,...,y™ in the second order 
of statistics. PCA is one extreme case of DCA that 
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chooses those de-correlated components with the first 
m largest variances/eigenvalues, while MCA is an- 
other extreme case that chooses those with the first 
m smallest variances/eigenvalues. Correspondingly, the 
extended counterpart of PCA/MCA should be the prin- 
cipal/minor ICA that chooses the independent compo- 
nents with the first m largest/smallest variances. Read- 
ers are referred to Sect. 2.4 of [29.61] for further details. 
Several adaptive learning algorithms have been devel- 
oped for implementing ICA, but their implementation 
cannot be guaranteed (29.5). Theoretically, theorems 
have been proved that such a guarantee can be reached 
as long as one bit of information is provided to each 
component yl ) [29.62]. 
FA-a, FA-b, and Model Selection 

In the literature of statistics and machine learning, the 
model (29.4) with (29.6) is conventionally referred to as 
FA. Actually, we have Ay = Aj with A = A7 !, 5 = oy 
for any unknown nonsingular matrix. Among the differ- 
ent choices to handle this indeterminacy, the standard 
one, shortly denoted by FA-a, imposes (29.6) on y, 
which reduces an indeterminacy of a general nonsin- 
gular ọ to an orthonormal matrix. One other choice is 
given by (29.4) with 


ATA =I, p(y) = GO; 


0, A) for a diagonal A , 
(29.8) 


shortly denoted by FA-b. We have %;=Ay= 
AAA '!y=AAdo7bA!y =A, with A=AAQ’, 
y=oA'y, and o76=], i.e., its %; is equivalent 
to the one by FA-a for a given m with an invertible 
A! In other words, FA-a and FA-b are equivalent for 
a learning principle based on e, = x,—X;, e.g., mini- 
mizing E by (29.2) or maximizing the likelihood on 
xı. Moreover, FA-a and FA-b are still equivalent when 
model selection is used for determining an appropriate 
value m by one of the classic model selection criteria 
to be introduced later in (29.14). However, FA-a and 
FA-b become considerably different to the Bayesian 
Yin-Yang (BYY) harmony learning in Sect. 3.2.1 
of [29.58] and also to automatic model selection in 
general. Empirically, experiments show that not only 
BYY harmony learning but also the variational Bayes 
method perform considerably better on FA-b than on 
FA-a [29.63]. 


Non-Gaussian FA 
Both FA-a and FA-b still suffer an indeterminacy of 
an mxm orthonormal matrix , which can be fur- 


ther removed when at most one of the components 
y®,...,y™ is Gaussian. Accordingly, (29.4) with 
non-Gaussian components py?) in (29.5) is called 
non-Gaussian FA (NFA). It is also referred to as inde- 
pendent FA (IFA), although NFA sounds better, since 
the concept of IFA covers not only NFA but also FA- 
a and FA-b. One useful special case of NFA is called 
binary FA when y, is a binary vector. Moreover, in the 
degenerated case e, = 0, obtaining A of x; = Ay, sub- 
ject to (29.5) is equivalent to getting W = AT! such 
that Wx, = y, to satisfy (29.5). For this reason, NFA 
with e, Æ 0 is also sometimes referred to as noisy ICA. 
Strictly speaking, the map x, —> y; towards (29.5), be- 
ing an inverse of NFA, should be nonlinear instead 
of a linear y, = Wx,. Maximum likelihood learning is 
implemented with help of the EM algorithm, which 
was developed in the middle to end of the 1990s for 
BFA/NFA, respectively. Also, learning algorithms have 
been proposed for implementing BYY harmony learn- 
ing with automatic model selection on m. Recently, both 
BFA and NFA were used for transcription regulatory 
networks in gene analysis; for further details the reader 
is referred to the overview [29.60] and especially its 
Roadmap. 


29.3.2 Multi-Model-Based Learning: 
Data Clustering, Object Detection, 
and Local Regression 


The task of data clustering is partitioning a set of 
samples into several clusters such that samples within 
a sample cluster are similar while samples from dif- 
ferent clusters should be as different as possible. An 
indicator matrix P = [pe] with PP’ =I is used to 
represent one possible partition of a sample set Xy = 
fe, into one of £=1,...,k clusters, i. e., pe, = 1 
if x, belongs to the £-th cluster, otherwise pe, = 1. For 
multi-model-based clustering, each cluster is modeled 
by one structure, with pe, obtained by a competi- 
tion of using the structure of each cluster to repre- 
sent a sample x,. The structure of each cluster could 
be one of the ones listed on the left-hand side of 
Fig. 29.2; multiple clusters are thus represented by mul- 
tiple structures listed on the right-hand side of Fig. 29.2, 
which feature the basic topics of the second stream of 
studies. 

We still start from the simplest point structure illus- 
trated in Fig. 29.2a(A(1)), extending to the structure of 
multi-points illustrated in Fig. 29.2a(B(1)). With data 
already divided into k clusters, it is easy to obtain the 


mean 4; of each cluster. Given {uj}, fixed, it is also 
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easy to divide Xy into k clusters by 


1, £= argmin gs, 
Pea = 


(29.9) 
0, otherwise, 


where sj = ||x;— ||", i.e., x; is assigned to the £-th 
cluster if pe „= 1. The key idea of the k-means al- 
gorithm is alternatively getting pg, and computing py; 
from an initialization. Although it aims at minimizing 
E, by 29.2 with e; = $$; pe.ré,1, k-means typically 
results in a local minimum of E>, depending on the ini- 
tialization. 

Merely using the mean ju; is not good for describing 
a cluster beyond a ball shape. Instead, it is extended to 
considering the Gaussian illustrated in Fig. 29.2a(A(2)) 
and thus its counterpart in Fig. 29.2a(B(2)), i. e., the fol- 
lowing Gaussian mixture 


k 
g(x|9) = >> Gely, 5) - (29.10) 


j=l 
K-means can be extended to getting pe „ by (29.11) with 
&,. = — lil Gl, ))] - (29.11) 


and computing each Gaussian by 


Pet 


ok 
ap = —, 
N 
1 
7k pare 
Me = Na% $ per, 
1 
aS S pea uiui, (29.12) 


ok 
Na; 


which actually performs a type of elliptic clustering. 
Instead of getting pe, by (29.9), we compute 


k 
Pea = llan 0"), (Ela 0) = 1 Yee 


i=1 


(29.13) 


Actually, alternatively iterating (29.13) and (29.12) is 
the well-known EM algorithm for carrying out maxi- 
mum likelihood learning on the Gaussian mixture. 
Another important topic is to determine an appro- 
priate k (model selection), i.e. how many clusters are 
needed. Classic model selection seeks a best k* = 


arg min, J(k) with a criterion J(k) in a format as fol- 
lows 


J(k) = -L(k, 0*)+a(k,N), 0* = arg max L(k, 8), 
(29.14) 


where L(k, 0) is the likelihood function of q(x|0), and 
w(k, N) > 0 increases with k and decreases with N. One 
typical example is called the Bayesian information cri- 
terion (BIC) or minimum description length (MDL). To 
obtain k* one needs to enumerate a set of k values and 
estimate 6* for each k value, which incurs an extensive 
computation and is thus difficult to scale up to a large 
number of clusters. 

Alternatively, automatic model selection aims at 
obtaining k* during learning 6* by a mechanism or 
principle that is different from the maximum likeli- 
hood. This learning drives away an extra cluster via 
a certain indicator o; — 0, e.g., p; = œj or p; = 4; Tr[ Xj]. 
One early effort is rival penalized competitive learning 
(RPCL). RPCL learning does not implement (29.12) by 
either (29.9) or (29.13), with pe, given as follows 


1, £* = arg min; £j, , 
£ = arg mingzex &,1 , (29.15) 
0, otherwise , 


by which learning is made on a cluster when pe, = 1, 
and penalizing or de-learning is made on a cluster 
when pe; = —y, with a heuristic penalizing strength of 
roughly y ~ 0.005 ~ 0.05. 

The BYY harmony learning gets rid of the difficulty 
of finding an appropriate penalizing strength, with both 
parameter learning and model selection made under 
the Ying Yang best harmony principle. The algorithm 
obtained still implements (29.12) and replaces pe, 
by (29.12) with 


Peat = qe |x:, DIG] + ATE), 


Are, = Yo ax: 0° Jej eea. (29.16) 
j 


where Are ,>0 means that the j-th component is 
better than the average of all the components for de- 
scribing the sample x,. We further update the j-th 
component in (29.12) to enhance the description. If 
0 > Are, >-—1, i.e., the fitness by the j-th compo- 
nent to x; is below the average but still not too far 
away, updating of the j-th component remains the same 
trend as in (29.12) but with reduced strength. Moreover, 
when —1 > Az +, the updating on the j-th component 
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reverses direction to become de-learning, similar to up- 
dating the rival in RPCL learning. 

RPCL learning, which was proposed in 1992, and 
BYY harmony learning, which was developed in 1995, 
are similar in nature to the popular sparse learn- 
ing method, which was developed in 1995 [29.64, 
65], and prior-based automatic model selection ap- 
proaches [29.66—68], that is, extra parts in a model 
are removed as some parameters are pushed towards 
zero. Without any priors on the parameters, these prior- 
based approaches degenerate to maximum likelihood 
learning, while RPCL learning and further improved by 
incorporating appropriate priors. 

For further details about automatic model selection, 
prior-aided learning, and model selection criteria the 
reader is referred to Sect. 2.2 of [29.58, 69] and [29.70] 
for recent overviews. Also, readers are referred to 
Sect. 2.1 and Table 1 of [29.58] for a tutorial on several 
algorithms for learning Gaussian mixture, including the 
ones introduced above. In the following, only three typ- 
ical ones are briefly summarized. 


Local Subspaces and Local Factor Analysis 

As illustrated in Fig. 29.2a(B(3)-(5)), the structure of 
multi-points illustrated in Fig. 29.2a(B(1)) can be ex- 
tended into multiple subspaces and FA models. Still, 
we can obtain pg, by (29.9), (29.15), and (29.13) 
with ¢,; given by either (29.11) with 3; = AA +D; 
or simply the shortest square distance from x, to the 
j-th subspace. Given data divided into k clusters, we 
may estimate the subspace or FA of each cluster as 
introduced in Sect. 29.3.1, which leads to extensions 
of the k-means algorithm, the EM algorithm, and the 
BYY harmony learning algorithm for learning FAs or 
subspaces that locate at different ir Moreover, 
readers are referred to [29.71] and [29.2] for learning 
local FAs with both the number k and the dimensions 
{m;} determined automatically during BYY harmony 
learning. 


Object Detection and Pattern-Based Clustering 
The structures in Fig. 29.3a(B(3),(4)) are applicable to 
the tasks of detecting lines and subspaces among im- 
age data, which are topics that are widely studied in 
the literature of pattern recognition and handled by the 
well-known Hough transform (HT) and randomized HT 
(RHT) [29.70]. Extensions can be made to detect mul- 
tiple objects such as circles, ellipses, lines and other 
shapes, as well as so-called pattern based clustering, 
still obtaining pe, by (29.9), (29.15), and (29.13) but 
with ¢,; being the shortest square distance from x; 


to each shape. However, it is no longer possible to 
use (29.12) for updating the parameters 6; of each shape. 
Instead, learning is done by 

On = pi + NPe.tV og Elt, (29.17) 
where 7 > 0 is a learning step size; for further details 
the reader is referred to [29.69, 70] and [29.8]. 


Mixture of Experts, RBF Networks, 

SBF Functions 
Let each Gaussian to be associated with a function 
f(x|o;) for a mapping x — z, we consider the task of 
learning 


k 
qearler, Y) = > Elx, Gleb, ), D]. 
j=l 
(29.18) 


from a set Dy = {x;,z;}_, of labeled data. The above 
q(z:\x;,W) is actually the alternative mixture of ex- 
perts [29.72], featured by a combination of unsuper- 
vised learning for the Gaussian mixture by (29.10) 
and supervised learning for every f(x|;). For a regres- 
sion task, typically we consider f (x|;) = wi x +c with 
E[z|x] = ey q(j|x, Of (x, o;) implementing a type of 
piecewise linear regression. In implementation, we still 
obtain pe; by (29.9), (29.15), and (29.13) but with the 
following £j, 

sj = —InfojGx| uj, DGE p), (29.19) 
and then compute each Gaussian by (29.12), as well as 
update G(z:lf@, y), 17). When oj = |5]/ D1 |2: 
it becomes equivalent to an extended normalized ra- 
dial basis function (RBF) network and a normalized 
RBF network simply with w; = 0. Moreover, letting 
each subspace be associated with f(x|9;) will lead to 
subspace-based functions (SBFs). For further details 
readers are referred to [29.5] and Sect. 7 of [29.69]. 


29.3.3 Matrix-Based Learning: 
Similarity Graph, Nonnegative 
Matrix Factorization, 
and Manifold Learning 


We proceed to the third stream, featured with 
graph/matrix structures. We start with the sample simi- 
larity graph, with each node for a sample and each edge 
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attached with a similarity measure between two sam- columns being the eigenvectors of W = D~->5wp-5, 
ples, as illustrated in Fig. 29.2b(C(1)). Such a graph corresponding to the first k largest eigenvalues. 
is also equivalently represented by a symmetric ma- Moreover, the above studies are closely related 
trix W = [wy]. to nonnegative matrix factorization (NMF) prob- 
One similarity measure is simply the inner prod- lems [29.73]. For example, the above problem can be 
uct w; =x7x; of two samples. Given a data matrix equivalently expressed as factorization W ~ HTH by 
X = [x,...,xy], we simply have W = XTX. We seek : 
an indicator matrix P = [pe ,,| that divides Xy = ENAS min ||W— H'H|? . (29.23) 
into k clusters, with help of the following maximization HH? =1H>0 
max Tr[H™WH], More generally, the NMF problem considers that 
HHT =I,H>0 
H = diag[n>,.... np, (29.20) X + FH, X>0, F>0,H=>0, (29.24) 
v : er : ; 
5 where H > 0 is a nonnegative matrix with each element 2 illustrated n Fig. 29.2b(C(1)). One typical method is 
= hy > 0, and ne is the number of samples in the ¢-th to iterate the following multiplicative update rule 
T cluster. It can be shown that this problem is equivalent 
e to minimizing E> by (29.2) with £, = Frai Pe sllx:— Hew — g” (FX); new __ pold (XH"); 
wll’. i. e., the same target that k-means aims at. y y (FTFH);’ y ! (FHH"); : 
Computationally, (29.20) is a typical intractable (29.25) 


binary quadratic programming problem, for which var- 
ious approximate methods are proposed. The most sim- 
ple one is dropping the constraint H > 0 to do a PCA 
analysis about the matrix W. That is, the columns of H 
consist of the k eigenvectors of W that correspond to 
the first k largest eigenvalues. Then, each element of 
the matrix diag[n?°, ...,n?-5]H is chopped into 1 or 0 
by arule of thumb. 

Another similarity measure is w; = exp(—||x; — 
x;||7), based on which we consider dividing the nodes 
of a graph into balanced two sets A,B such the to- 
tal sum of w; associated with edges connecting the 
two sets becomes as small as possible. Using a vec- 
torf =[fi.....fy]’ with f, = 1 if x, € A and f; = —1 if 
x, € B, the problem is formulated as follows 


minf' Lf, st. Mess I E0 


L=D-W, D = diag[w1,..., www], (29.21) 
where L is the graph Laplacian. Again, it is an in- 
tractable combinatorial problem and needs to consider 


some approximation. A typical one is given as follows 


T 

mn E ae [1,..., 1]f=0. (29.22) 
£ FTF 

Its solution f is the eigenvector of L that corresponds 

the second smallest eigenvalue. Moreover, this idea has 

been extended to cutting a graph into multiple clus- 

ters, which leads to approximately finding H with its 


which guarantees nonnegativity and is supposed to con- 
verge to a local solution of the following minimization 
IX-FH|?. 


min (29.26) 


HHT =1,H>0,F>0 
Particularly, if we also impose the constraint F'F = I, 
the resulted H divides the columns of X into k clusters, 
while the resulted F also divides the rows of X into k 
clusters, and is thus called bi-clustering [29.74]. 

Several NMF learning algorithms have been de- 
veloped in the literature. In [29.75], a binary matrix 
factorization (BMF) algorithm was developed under 
BYY harmony learning for clustering proteins that 
share similar interactions, featured with the nature of 
automatically determining the cluster number, while 
this number has to be pre-given for most existing BMF 
algorithms. 

In the past decade, the similarity graph and espe- 
cially the graph Laplacian L have also taken important 
roles in another popular topic called manifold learn- 
ing [29.76, 77]. Considering a mapping Y ~ WX, a lo- 
cality preserving projection is made to minimize the 
sum of each distance between two mapped points on the 
graph, subject to a unity L, norm of this projection WX. 

Alternatively, we may also regard that X is gener- 
ated via X = AY + E such that the topological depen- 
dence among Y is preserved by considering 


g(¥) x eT TILAT], (29.27) 
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where A is a positive diagonal matrix. Learning is im- 
plemented by BYY harmony learning, during which 
automatic model selection is made via updating q(Y) 
to drive some diagonal elements of A towards zeros. 
For further details readers are referred to the end part of 
Sect. 5 in [29.58]. 


29.3.4 Tree-Based Learning: Temporal 
Ladder, Hierarchical Mixture, 
and Causal Tree 


Unsupervised learning also includes learning tempo- 
ral and hierarchical underlying dependence structures, 
as illustrated in Fig. 29.2b(D). Instead of directly 
modeling temporal dependence underlying data X = 
[x,,...,Xy], its structure is typically represented in 
a hidden space, while non-temporal or spatial depen- 
dence is represented by a relation from the hidden space 
to the space where X is observed, in the ladder structure 
illustrated in Fig. 29.2b(D(1)). 

One typical example is the classic hidden Markov 
model (HMM). Its hidden space is featured by a discrete 
variable that jumps between a set of discrete values or 
states {s;}, with temporal dependence described by the 
jumping probabilities between the states, typically con- 
sidering p(sj|s;) of jumping from one state s; to another 
sj. The relation from the hidden space to the space of 
X is described by p(x,|s;) for the probability that the 
value of x; is emitted from the state s;. Classically, 
the values of x; are also a set of labels. The task is 
learning from X = [x,,..., xy] two probability matrices 
Q = [p(s|s;)] and E = [p(,|s;)]. Given the number of 
states, learning is typically implemented to maximize 
the likelihood p(X|Q, E) by the well-known Baum- 
Welch algorithm. 

Another example is the classic state-space model 
(SSM), which has been widely studied in the literature 
of control theory and signal processing since the 1960s; 
this has also been called a linear dynamical system with 
considered with renewed interest since the beginning of 
the 2000s. As illustrated in Fig. 29.2b(D(2)), its hid- 
den space is featured by an m-dimensional subspace 
and temporal dependence is described by one first-order 
vector autoregressive model as follows 


Yı = By;-1 + &, Ey,-1€) =0, 


€; = G(e,|0, A), A is diagonal , (29.28) 


while the spatial dependence is represented by a rela- 
tion between the coordinates of the state-space and the 
coordinates of the space of X, e.g., typically by (29.4). 


Though the EM algorithm has also been suggested for 
learning the SSM parameters, the performance is usu- 
ally unsatisfactory because an SSM is generally not 
identifiable due to an indeterminacy of any unknown 
nonsingular matrix, similar to what was discussed pre- 
viously with respect to the FA in (29.5). Favorably, it 
has been shown that the indeterminacy of not only any 
unknown nonsingular matrix but also an unknown or- 
thonormal matrix is usually removed by additionally 
requiring a diagonal matrix B, which leads to temporal 
factor analysis (TFA). 

TFA is an extension of the FA by (29.4) with (29.6) 
replaced by (29.28). As introduced in Sect. 29.3.1, the 
FA is generalized into NFA when (29.6) becomes (29.5) 
with each p(y”) being non-Gaussian. The NFA with 
a real vector y, can be further extended into a temporal 
NFA when (29.5) is also extended by (29.28) with 


ple) = p(e)-+-p(e™). 


Moreover, the BFA (i. e., NFA with a binary vector y,) 
can be extended into a temporal BFA. Also, TFA has 
been extended into an integration of several TFA mod- 
els coordinated by an HMM. For further details readers 
are referred to Sect. 5.2 of [29.58] for a recent overview 
on TFA and its extensions. 

A ladder is merely a special type of tree structure. 
Hierarchical modeling is one other type of tree struc- 
ture, as illustrated in Fig. 29.2b(D(3)). Again, the EM 
algorithm has been extended to implement learning on 
a hierarchical or tree mixture of Gaussians [29.78]. 
Also, a learning algorithm is available for implement- 
ing BYY harmony learning with tree configuration 
determined during learning. A learning algorithm for 
a three-level hierarchical mixture of Gaussians is shown 
in Fig. 12 of [29.3], featured by a hierarchical learn- 
ing flow circling from bottom up as one step and then 
top down as the other step. Similar to (29.16), where 
there is a term of A featuring the difference of BYY 
harmony learning from EM learning, there is also such 
a A term on each level of hierarchy. If these A terms 
are set to be zero, the algorithm degenerates back to 
the EM algorithm. For further details readers are re- 
ferred to Sect. 5.1 in [29.3] and especially equation (55) 
therein. 

Many applications consider several sets of samples. 
Each set is known to come from one model or pat- 
tern class. Typically, one does unsupervised learning on 
each set of samples by a hierarchical mixture of Gaus- 
sians, and then integrates individual hierarchical models 
in a supervised way to form a classifier. 
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Alternatively, we may put together all the individ- 
ual hierarchical mixtures with each as a branch of one 
higher level root of a tree, and then do learning as shown 
in Fig. 12 of [29.3]. The BYY harmony learning algo- 
rithm (including the EM algorithm as its degenerated 
case) for learning a two-level hierarchical mixture of 
Gaussians is shown in Sect. 5.3 of [29.58], and es- 
pecially Fig. 11 therein. This type of learning can be 
regarded as semi-supervised learning in the sense that 
each sample has two teaching labels. One is known, 
indicating which individual hierarchy x, comes from, 
while the other is unknown to be determined, indicat- 
ing which Gaussian component x, comes from. Even 
generally, this type of learning provides a general for- 
mulation that involves the multi-label classification of 
Sect. 29.2.2 (especially labels with a hierarchy). 

There are also real applications that consider a com- 
bination of ladder structures and hierarchical structures. 
For example, what is widely used in speech processing 
is an HMM model with each hidden state associated 
with a two-level hierarchical Gaussian mixture as illus- 
trated in Fig. 11 of [29.58]. Also, extensions are made 
with each Gaussian mixture replaced by a mixture of lo- 
cal subspaces or FA or NFA models. For further details 
readers are referred to Sect. 5.3 and Fig. 14 of [29.3]. 
Another example is considering a two-level hierarchi- 
cal model with both HMM for modeling nonstationary 
temporal dependence and TFA for modeling stationary 
temporal dependence. For further details readers are re- 
ferred to Sect. 5.2.2 of [29.58]. 


29.4 Reinforcement Learning 


Differently to unsupervised learning, reinforcement 
learning gets guidance from external evaluation. Also, 
unlike supervised learning in which the teacher clearly 
specifies the output that corresponds to an input, rein- 
forcement learning is only provided with an evaluative 
value about the action made. Furthermore, reinforce- 
ment learning is featured by a dynamic process in 
discrete time steps. At each step, upon observing the 
current environment and getting some input (if any), the 
learner makes an action and moves to a new state, re- 
ceiving an award or punish value about the action. The 
aim is to maximize the total award received. 

This section provides a brief tutorial on the basic is- 
sues of reinforcement learning, especially TD learning 
and Q-learning. Then, improvements on Q-learning are 
proposed by replacing its built-in winner-take-all com- 
petition mechanism with some unsupervised learning 


Another typical tree structure, as illustrated in 
Fig.29.2b(D(4)), is a learning probabilistic tree, i.e., 
a joint distribution of a set of variables on a tree with 
one node per variable. The most well-known study is 
structuring such tree models for a given set of bi-valued 
variables, as done by Pearl in 1986 [29.79]. Following 
this, one study in 1987 [29.80] extends this to con- 
struct tree representations of continuous variables. It 
has been proved that the tree can be structured from the 
correlations observed between pairs of variables if the 
visible variables are governed by a tree decomposable 
joint normal distribution. Moreover, the conditions for 
tree decomposable normal distribution are less restric- 
tive than those of bi-valued variables. 

Nowadays, many advances have been made along 
this line. Some of the basic results, e.g., (29.15) 
and (29.17) in [29.80], has become a widely used tech- 
nique in network construction for detecting whether an 
edge describes a direct link or a duplicated indirect link. 
For example, considering the association between two 
nodes i,j linked to a third node w with the correlation 
coefficients pj, and pj, we can remove the link j, j if its 
correlation coefficient pj; fails to satisfy, i. e., 

Pij > PiwPwj - (29.29) 
Otherwise we may either choose to keep the link i, j, 
or let three nodes to be linked to a newly added node 
and then remove all the original links among the three 
nodes. 


methods. For further reading readers are referred to tu- 
torials and reviews in [29.6, 81, 82]. 


29.4.1 Markov Decision Processes 


Reinforcement learning is closely related to Markov de- 
cision processes (MDP), which consist of a series of 
states s9,51,...,5;,S;41,.... At a state s,, an action 
a, = T (s+) is selected from the set A of actions accord- 
ing to a policy z, which makes the environment move 
to a new State s,4 1, and the reward r,+; associated with 
the transition (s;,a;,5;41) is received. The goal is to 
collect as much reward as possible, that is, to maximize 
the total reward or return 


N-1 
R= > Ft+1 > 
t=0 
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where N denotes the random time when a terminal state 
is reached. In the case of nonepisodic problems the 
return R = )~°°, y'r,41 is considered by a discount- 
factorO< y <1. 

Given an initial distribution based on which the 
initial state is sampled at random, we can assign the 
expected return E[R|z] to policy m. Since the actions 
are selected according to z, the task is to specify 
an algorithm that can be used to find a policy z to 
maximize E[R]. Suppose we know the state transition 
probability pa(s’|s) = P(s,1 = s’|s; = s) and the cor- 
responding reward r;+1 = R,(s’|s), the standard family 
of algorithms to calculate this optimal policy is featured 
by iterating the following two steps 


(1) x(s) = argmax V“(s), 


D V™(s)=Y~ pro (s"|s) Rare) l) +y], 


(29.30) 


with V7 (s) estimating E[R|s, 7]. The iteration can be 
made in one of several variants as follows: 


@ Doing step (1) once and then repeating step (2) sev- 
eral times or until it converges. Then step (1) is done 
once again, and so on. 

@ Doing step (2) by solving a set of linear equations. 

© Substituting the calculation of 7x (s) into the calcu- 
lation of V* (s) = max, V” (s), resulting in a com- 
bined step 


V* (s) = max 


J Pals’ |s)[Ra(s’|s) 


+ ye , (29.31) 


which is called backward induction and is iterated 
for all states until it converges to what is called the 
Bellman equation. 

© Preferentially applying the steps to states that are in 
some way of importance. 


Under some mild regularity conditions, all the im- 
plementations will reach a policy that achieves these 
optimal values of V*(s) = maxx V” (s) and thus also 
maximizes the expected return E[V”(s)], where s is 
a state that is randomly sampled from the underlying 
distribution. 

In the implementation of MDPs we need to know 
the probability p,(s’|s) per action a. Reinforcement 
learning avoids obtaining this p,(s’|s) with the help 


of stochastic approximation. The two most popular 
examples are temporal difference (TD) learning and Q- 
learning, respectively. The name TD derives from its 
use of differences in predictions over successive time 
steps to drive learning, while the name Q comes from 
its use of a function that calculates the quality of a state- 
action combination. 


29.4.2 TD Learning and Q-Learning 


TD learning aims at predicting a measure of the to- 
tal amount of reward expected over the future. At time 
t, we seek an estimate 7; of R, = par yo bi with 
0 < y < 1. Each estimate is a prediction because it in- 
volves future values of r. We can write 7, = M (s+), 
where JI is a prediction function. The prediction at any 
given time step is updated to bring it closer to the pre- 
diction of the same quantity at the next step, based on 
the error correction 6,4; = R,—TT,(s;). To obtain R, ex- 
actly requires waiting for the arrival of all the future 
values of r. Instead, we use R, = 7,41 + yRr+1 with 
IT,(s;41) as an estimate of R;4, available at step ¢, that 
is, we have 


Opty = ripi + yM Ssi) — (sy) , (29.32) 


which is termed the temporal difference error (or TD 
error). 

The simplest TD algorithm updates the prediction 
function J7, at step ¢ into to a new prediction function 
TI, as follows 


Tui) = M(x) + n1 ifx=s, aa 
BPs TT,(x) otherwise , ` 


where 7 is a learning step size and x denotes any pos- 
sible input signal. The simplest format is a prediction 
function implemented as a lookup table. Suppose that 
s, takes only a finite number of values and that there 
is an entry in a lookup table to store a prediction for 
each of these values. At step ¢, the state s; moves to the 
next s;+ı based on the current status of the table, e.g., 
the table entry for 5s, is the largest across the table, or 
S;4+1 is selected according to a fixed policy. When r;+1 
is observed, only the table entry for s; changes from its 
current value of 7, = I7;(s;) to (s) + 6:41. 

The algorithm uses a prediction of a later quantity 
TT,(s;+1) to update a prediction of an earlier quantity 
IT,(s;). As learning proceeds, later predictions tend to 
become accurate sooner than earlier ones, resulting in 
an overall error reduction. This depends on whether an 
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input sequence has sufficient regularity to make pre- 
dicting possible. When s; comes from the states of 
a Markov chain, on which the r values are given by 
a function of these states, a prediction function may 
exist that accurately gives the expected value of the 
quantity R, for each t. 

Another view of the TD algorithm is that it operates 
to maintain the following consistency condition 


TT(s;) = r1 + yO (si41) . 


which must be satisfied by correct predictions. By the 
theory of Markov decision processes, any function that 
satisfies R; = r;+1 + yR-+1 for all ¢ must actually give 
the correct predictions. The TD error indicates how 
far the current prediction function deviates from this 
condition, and the algorithm acts to reduce this error to- 
wards this condition. Actually, /7,(s,) + 76,41 = (1 — 
ns) + n[riti + yM (s:+1)] is a type of stochastic 
approximation to the value function in (29.30), without 
directly requiring to know the probability p,(s’|s). 

Alternatively, Q-learning calculates the quality of 
a state-action combination, i. e., estimating Q(s,,a;) of 
R, conditionally on the action a; at s+. The implementa- 
tion of Q-learning consists of 


(29.34) 


a, = arg max Q,(s;,a) , 
a EA 


O(x,a)+75,41(a) ifx=s,a=a,; 


Oria) = QO,(x, a) otherwise, 
ô+ (a) = r(S;, a) F Y AE O,(s;+1, €) = O.(S;, a) . 
(29.35) 


At s;, an action a, € A is obtained in an easy compu- 
tation, and then makes a move to a new state s;+1. 
Receiving the reward 7,41 = r(s,, as) associated with 
the transition (s+, ar, S++1), only the table entry for s, 
and a, is updated. 

The format of 5,4, is similar to the one in (29.32) 
with the prediction J7,(s;) replaced by Q(s;,a;) and 
TT,(s;41) replaced by max, Q,(s;41,a). Alterna- 
tively, we may select a;+, by a fixed policy and 
then obtain 6,4; with max, Q,(s;41,a) replaced 
by Q:(5:+1,.4:41), Which leads to a variant of the 
Q-learning rule called state-action-reward-state-action 
(SARSA). Under some mild regularity conditions, sim- 
ilarly to TD learning, both Q-learning and SARSA 
converge to prediction functions that make optimal ac- 
tion choices. 

Both TD learning and Q-learning have variants and 
extensions. In the following, we briefly summarize two 
typical streams: 


@ In (29.33) and (29.35), only the table entry for s, 
is modified, though r;; provides useful informa- 
tion for learning earlier predictions as well. Under 
the name of eligibility traces, an exponentially de- 
caying memory trace is provided on a number of 
previous input signals so that each new observa- 
tion can update the parameters related to these 
signals. 

@ In addition to a lookup table, the prediction func- 
tion can be replaced by a more advanced pre- 
diction function. It could be a linear or non- 
linear regression function F;(@;,0) with input 
signals œr = fe) xe}, Each x® could be 
either a state or an action or even one addi- 
tional feature around a state in one eligibility 
trace, where t can be different from ¢. Then, 
learning adjusts 0 to reduce the error 6,4; or 
6;41(a1). 


29.4.3 Improving Q-Learning 
by Unsupervised Methods 


Examining the Q-learning by (29.35), we observe that 
it shares some common features with the multi-model- 
based learning introduced in Sect. 29.3. For a set A of 
finite many actions, we use the index £= 1,...,k to 
denote each action. Obtaining a, in (29.35) is equiv- 
alent to obtaining pe, in (29.9) with £j; = —O,(s;,J), 
that is, a selection is made by winner-take-all (WTA) 
competition. Then, updating Q,(s,,a) in (29.35) can be 
rewritten as follows 


0;41(8;, O) = 0,(s;,€) + n pe 1+1 (8), 
Q+ (s, £)=Q,(s,£), fors E S+, 


which is similar to the general updating rule by (29.17), 
with pe, selecting which column of Q table to up- 
date. This motivates the following improvements on 
Q-learning, motivated by the multi-model-based learn- 
ing methods in Sect. 29.3.2. 

First, the WTA selection of the above pe, can be 
replaced by an estimation of the posteriori probabilities 
as follows 


(29.36) 


k 
pea = (Elsi), q(els) = et / YT er | 


j=1 
(29.37) 
Putting this into (29.38), we improve the weak points 


incurred from a WTA competition by updating all the 
columns of the Q table with the weights by pe, as 
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a counterpart of (29.12) of the well-known EM algo- 
rithm for learning a finite mixture. 

Second, 5,41 (q@) in (29.35) uses maxa Q;(S;+1, 4) 
as a prediction of the Q-value at s;+1, also by a WTA 
competition that gives an optimistic choice. Alterna- 
tively, we can use the following more reliable one 


541 (81, a;) = (Sn a) + yAre (S41), 
Ares) =) > aGils)Qx(s,f) — Or(s1,ar), (29.38) 
j 


where q(j|s;+1) is given by (29.37) with s = s;+1 in- 
stead of s = s,, to obtain Are +(s:++1) with q(¢|s). 

Third, instead of pg, given by (29.37), we may also 
use a counterpart of (29.16) to implement Q-learning 
with help of BYY harmony learning. That is, we con- 
sider 


Per = (Els) + Are sy], (29.39) 


by which an action is encouraged when its value is 
higher than the average of all actions, while an action 
is discouraged when its value is below the average but 
still not too far away, and then is repelled when its value 
is far below this average. 


29.5 Semi-Supervised Learning 


In many real tasks it is easy to obtain a large amount 
of unlabeled training data but labeling them is expen- 
sive because of the requirement of great human effort 
and expertise or high execution cost. Semi-supervised 
learning [29.83—-86] attempts to exploit unlabeled data 
to help improve the learning performance without as- 
suming human intervention. In situations where the 
unlabeled data are exactly the test data, it is also called 
transductive learning [29.87]. 

Figure 29.3 illustrates why unlabeled data (gray 
points) can be helpful. It can be seen that although both 
classification boundaries are consistent with labeled 
data, the boundary obtained by considering unlabeled 
data is better in generalization. One reason is that the 
unlabeled data can disclose some information about 
data distribution which is helpful for model construc- 
tion. 

There are two popular assumptions connecting the 
distribution information disclosed by unlabeled data 
with label information. The cluster assumption assumes 


Moreover, we may simplify the above p¢ _; by focus- 
ing on a few of major actions, e.g., the winning action 
a, = arg max, ca Q;(s;,a) and its rival action similar 
to rival penalized competitive learning (RPCL) by pg. 
given as follows 


1, €* =argmax; Q,(s;,j) , 
Pea = jy, €=argmaxezex Q,(5:,/), 
0, otherwise , 
(29.40) 


i.e., the winning action is encouraged while its rival is 
repelled. 

BYY harmony learning and RPCL learning lead to 
discriminative Q-learning by which actions at each state 
become more discriminative and thus easier to be se- 
lected. As a result, confusing branches in a searching 
tree will be pruned away. Moreover, we may discard 
one extra action if we observe that its corresponding 


a = (Imag + ng(Elss , (29.41) 


is pushed to zero. Actually, this is the nature of auto- 
matic model selection, which controls the complexity 
of function Q(s, j). 


that data with similar inputs have similar class la- 
bels; the manifold assumption assumes that data live 
in a low-dimensional manifold, whereas unlabeled data 
can help to identify that manifold. The latter can be re- 
garded as a generalization of the former because it is 
usually assumed that the cluster structure of the data 
will be more easily found in the lower-dimensional 


a) Without unlabeled data b) With unlabeled data 


Fig. 29.3a,b Illustration of the usefulness of unlabeled 
data 


S°6z | d Hed 


514 Part D 


Neural Networks 


9°62 | d Hed 


manifold. These assumptions are closely related to low- 
density separation, which specifies that the boundary 
should not go across high-density regions in the in- 
stance space. 

Many semi-supervised learning methods have been 
developed. Roughly speaking, they can be catego- 
rized into four categories. In generative methods, both 
labeled and unlabeled data are assumed to be gen- 
erated by the same model, and thus, the unlabeled 
data can be exploited to model the label estimation 
or parameter estimation process. For example, if we 
assume the data come from a mixture model with T 
components, i. €., 


T 


FAIA) = Do af (x16), 


t=1 


(29.42) 


where œ, is mixing coefficient and 6 = {6,} are the 
model parameters, then label c; can be determined by 
the mixture component m; and the instance x; accord- 
ing to the maximum a posteriori criterion 


arg max ` P (c; = k\m; = j, xj) P (m; = j|xi) , 
mg 
(29.43) 


where estimating P(c; = k|m; = j, xi) requires label in- 
formation, but unlabeled data can be used to help 
estimate P(m; =/j|x;), and hence improve the learn- 
ing performance. Actually, the posteriori probability is 
equivalently given by a mixture of experts that will be 
further addressed next in Sect. 29.6.1. 

In semi-supervised support vector machines 
(S3VM), unlabeled data are used directly to help 
adjust the decision boundary, as illustrated in Fig. 29.3. 
Given / labeled examples and u unlabeled instances, 
the goal is usually accomplished by minimizing an 


29.6 Ensemble Methods 
29.6.1 Basic Concepts 


Ordinary learning methods try to construct one learner 
from training data, whereas ensemble methods [29.14] 
try to construct a set of learners and combine them 
to solve the problem. Such kinds of learning meth- 
ods are also called committee-based learning, meta- 
learning, or multiple classifier systems, although en- 
semble methods have also been found to be helpful in 


objective 


1 l u 
zlwl3e + Cr X eifa + Co D2 FON, 
i=1 j=l 


(29.44) 


where the first term is structural risk, the second 
term is empirical risk on the labeled data (x;, y;), the 
third term is empirical risk on the unlabeled instances 
4 (j= 1,...,u) and the estimated outputs ĵ;, whereas 
C,/C2 balance the contribution of labeled/unlabeled 
data. 

Graph-based methods construct a graph whose 
nodes are the training instances (both labeled and un- 
labeled), and the edges between nodes reflect a cer- 
tain relation, such as similarity, between the corre- 
sponding examples. Then, the learning process is ac- 
complished by propagating label information on the 
graph. 

Disagreement-based methods generate multiple 
learners and exploit the disagreements among the learn- 
ers, where unlabeled data serve as a kind of platform 
for information exchange; if one learner is much more 
confident on a disagreed unlabeled instance than other 
learner(s), then it will teach other(s) by assigning a pre- 
dicted pseudo-label to the instance. A representative of 
this category is co-training [29.31], which constructs 
two learners from two different views, and thus is 
closely related to multi-view learning. 

In addition to classification, semi-supervised regres- 
sion, dimension reduction, clustering, etc., have also 
been well studied. It is worth mentioning that exploiting 
unlabeled data does not always improve the perfor- 
mance, and sometimes the performance may be even 
worse than using only the labeled data. Some recent 
studies have tried to address this issue under the name 
of safe semi-supervised learning [29.88]. 


clustering [29.14, 89,90] and various tasks other than 
classification. 

Figure 29.4 shows a common ensemble architec- 
ture. An ensemble contains a number of base learners, 
or individual learners, component learners, or weak 
learners because the main purpose of ensemble meth- 
ods is to generate strong learners by combining learn- 
ers whose generalization performances are not strong. 
Base learners can be generated by a base learning 
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algorithm, such as a decision tree algorithm, a neu- 
ral network algorithm, etc., and such ensembles are 
called homogeneous ensembles because they contain 
homogeneous base learners. An ensemble can also be 
heterogeneous if multiple types of base learners are 
included. 

The generalization ability of an ensemble is often 
much stronger than that of base learners. Roughly, there 
are three threads of studies that lead to the state-of- 
the art of ensemble methods. The combining classifiers 
thread was mostly studied in the pattern recognition 
community, where researchers usually focused on the 
design of powerful combining rules to obtain a strong 
combined classifier [29.28, 29]. The mixture of experts 
thread generally considered a divide-and-conquer strat- 
egy, trying to learn a mixture of parametric models 
jointly [29.91]. Equation (29.43) is actually a mixture of 
experts for classification, with P(c; = k|m; = j, x;) be- 
ing the individual expert and P(m; = j|x;) the gating net, 
especially for the one given in [29.72] where the gating 
net is given by the posteriori of af (x|6,) in (29.42). The 
ensembles of weak learners thread often works on weak 
learners and tries to design powerful algorithms to boost 
performance from weak to strong. Readers are referred 
to [29.30] for a recent survey on combining classifiers 
and mixture-of-experts as well as their relations, and to 
Sect. 5 of [29.86] for a brief overview on all the three 
threads. 

Generally, an ensemble is built in two steps; that 
is, generating the base learners and then combining 
them. It is worth noting that the computational cost 
of constructing an ensemble is often not much larger 
than creating a single learner. This is because when 
using a single learner, one usually has to generate 
multiple versions of the learner for model selection 
or parameter tuning; this is comparable to generat- 
ing multiple base learners in ensembles, whereas the 
computational cost for combining base learners is 
often small because most combining rules are sim- 
ple. 

The term boosting refers to a family of algorithms 
originated in [29.92], with AdaBoost [29.93] as its rep- 


Problem 


Learner | Learner | se 


Fig. 29.4 A common ensemble architecture 


resentative. This kind of algorithm is usually provably 
able to convert weak learners that are just slightly better 
than random guess to strong learners that have nearly 
perfect performance. 

Algorithm 29.1 shows the pseudo-code of 
AdaBoost. Roughly speaking, the basic idea of 
boosting is to let later learners try to correct the mis- 
takes made by earlier learners, and this is accomplished 
by deriving in each round a new data distribution which 
makes the earlier mistakes more evident. The base 
learners should be able to learn with specific distribu- 
tions; this is usually accomplished by re-weighting or 
re-sampling the training examples according to the data 
distribution in each round. Such a learning process is 
very similar to residual minimization, and it has a close 
relation to additive models, inspiring an interpretation 
that AdaBoost is a stagewise estimation procedure for 
fitting an additive logistic regression model with an 
exponential loss [29.94]. Notice that AdaBoost was 
designed for binary classification, but it has many 
variants for multi-class problems [29.93, 95, 96]. 

It has been proved [29.93] that the generalization 
error of AdaBoost is upper bounded by 


-( TaT 
<< +0(/2) ; 
m 


with probability at least 1 — 5, where €p is the error on 
the training sample D, d is the VC-dimension of base 
learners, m is the number of training samples, and O(-) 
is used instead of O(-) to hide logarithmic terms and 
constant factors. This generalization bound implies that 
the complexity d of base learners and the number T of 
learning rounds need to be constrained; otherwise Ad- 
aBoost will overfit. Empirical studies, however, show 
that AdaBoost often seems resistant to overfitting; that 
is, the test error often tends to decrease even after the 
training error reaches zero. 


(29.45) 


Algorithm 29.1 The AdaBoost Algorithm 
Input: data set D = {@1, y1), (x2, y2), aa (Xm, Ym) 3; 
Base learning algorithm £; number of learning 
rounds T. 
Process: 
1: Diı(x)= 1/m. % initialize the weight distribution 
2: fort=1,...,T: 
3: h,=X(D,D,); % train a classifier h, from D 
under distribution D, 
4: €, = Pyx op, (h(x) Æ f(x)); % evaluate the error 
of h, 
5: ife, > 0.5 then break 
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6 a,=iln ( - ): % determine the weight of h; 
D, (xEXp(—arf ah (x)) 


T Dai) = 7; % update 
the distribution, where Z, is 
% a normalization factor which enables D,+1 

to be a distribution 


8: end 
Output: H (x) = sign (Zai ahi) 


29.6.2 Boosting 


For binary classification, formally, f(x) € {—1, +1}, the 
margin of the classifier h on the instance x is defined 
as f (x)h(x), and similarly, the margin of the ensemble 
A(x) = Da ah is fOO) = DP, fha), 


whereas the normalized margin is 


Eii Od Oh) 
ELi Oy 


where œ, are the weights of base learners. Given any 
threshold 6 > 0 of margin over the training sample D, 
it was proved in [29.97] that the generalization error of 
the ensemble is bounded with probability at least 1 — ô 
by 


fQ)A(x) = ; (29.46) 


T 
€<2"|] vel-®U-e,)!+8 


t=1 
+0 Z ine 
m02 ng i 


where €; is the training error of the base learner hy. 
This bound implies that when other variables are fixed, 
the larger the margin over the training set, the smaller 
the generalization error. Thus, [29.97] argued that Ad- 
aBoost tends to be resistant to overfitting because it can 
increase the ensemble margin even after the training er- 
ror reaches zero. 

This margin-based explanation seems reasonable; 
however, it was later questioned [29.98] by the fact 
that (29.47) depends heavily on the minimum margin, 
whereas there are counterexamples where an algorithm 
is able to produce uniformly larger minimum mar- 
gins than AdaBoost, but the generalization performance 
drastically decreases. From then on, there was much de- 
bate about whether the margin-based explanation holds; 
more details can be found in [29.14]. 


(29.47) 


One drawback of AdaBoost lies in the fact that it 
is very sensitive to noise. Great efforts have been de- 
voted to address this issue [29.99, 100]. For example, 
RobustBoost [29.101] tries to improve the noise toler- 
ance ability by boosting the normalized classification 
margin, which was believed to be closely related to the 
generalization error. 


29.6.3 Bagging 


In contrast to sequential ensemble methods such as 
boosting where the base learners are generated in a se- 
quential style to exploit the dependence between the 
base learners, bagging [29.99] is a kind of parallel en- 
semble method where the base learners are generated 
in parallel, attempting to exploit the independence be- 
tween the base learners. 

The name bagging came from the abbreviation of 
Bootstrap AGGregatING. Algorithm 29.2 shows its 
pseudo-code, where I is the indicator function. Bagging 
applies bootstrap sampling [29.102] to obtain multi- 
ple different data subsets for training the base learners. 
Given m training examples, a set of m training examples 
is generated by sampling with replacement; some orig- 
inal examples appear more than once, whereas some do 
not present. By applying the process T times, T such 
sets are obtained, and then each set is used for training 
a base learner. Bagging can deal with binary as well as 
multi-class problems by using majority voting to com- 
bine base learners; it can also be applied to regression 
by using averaging for combination. 


Algorithm 29.2 The Bagging Algorithm 

Input: Data set D = {@1; y1), (x2, y2), srry (Xm, Ym)}5 
Base learning algorithm £; 
Number of base learners T. 


Process: 
1: fort=1,...,T: 
2: hi = &(D, Dys) % Dos is bootstrap distribution 
3: end 


Output: H(x) = arg max Y I(h,(x) = y) 
yey 


Theoretical analysis [29.99, 103, 104] shows that 
Bagging is particularly effective with unstable base 
learners (whose performance will change significantly 
with even slight variation of the training sample), such 
as decision trees, because it has a tremendous vari- 
ance reduction effect, whereas it is not wise to apply 
Bagging to stable learners, such as nearest neighbor 
classifiers. 
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A prominent extension of Bagging is the random 
forest method [29.105], which has been successfully 
deployed in many real tasks. Random forest incorpo- 
rates a randomized feature selection process in con- 
structing the individual decision trees. For each indi- 
vidual decision tree, at each step of split selection, it 
randomly selects a feature subset, and then executes 
conventional split selection within the feature subset. 
The recommended size of feature subsets is the loga- 
rithm of the number of all features [29.105]. 


29.6.4 Stacking 


Stacking [29.106-108] trains a meta-learner (or 
second-level learner), to combine the individual learn- 
ers (or first-level learners). First-level learners are often 
generated by different learning algorithms, and there- 
fore, stacked ensembles are often heterogeneous, al- 
though it is also possible to construct homogeneous 
stacked ensembles. Also, one similar approach was pro- 
posed in IJCNN1991 [29.109], with meta-learner re- 
ferred to by a different name called associative switch, 
which is learned from examples for combining multiple 
classifiers. 

Stacking can be viewed as a generalized framework 
of many ensemble methods, and can also be regarded 
as a specific combination method, i.e., combining by 
learning. It uses the original training examples to con- 
struct the first-level learners, and then generates a new 
data set to train the meta-learner, where the first-level 
learners’ outputs are used as input features whereas the 
original labels are used as labels. Notice that there will 
be a high risk of overfitting if the exact data that are 
used to train the first-level learners are also used to gen- 
erate the new data set for the meta-learner. Hence, it is 
recommended to exclude the training examples for the 
first-level learners from the data that are used for the 
meta-learner, and a cross-validation procedure is usu- 
ally used. 

It is crucial to consider the types of features for 
the new training data, and the types of learning al- 
gorithms for the meta-learner [29.106]. It has been 
suggested [29.110] to use class probabilities instead 
of crisp class labels as features for the new data, and 
to use multi-response linear regression (MLR) for the 
meta-learner. It has also been suggested [29.111] to use 
different sets of features for the linear regression prob- 
lems in MLR. 

If stacking (and many other ensemble methods) is 
simply viewed as assigning weights to combine differ- 
ent models, then it is closely related to Bayes model 


averaging (BMA), which assigns weights to models 
based on posterior probabilities. In theory, if the cor- 
rect data generation model is in consideration and if 
the noise level is low, BMA is never worse and of- 
ten better than stacking. In practice, however, BMA 
rarely performs better than stacking, because the correct 
data generalization model is usually unknown, whereas 
BMA is quite sensitive to model approximation er- 
ror [29.112]. 


29.6.5 Diversity 


If the base learners are independent, an amazing combi- 
nation effect will occur. Taking binary classification, for 
an example, suppose each base learner has an indepen- 
dent generalization error € and T learners are combined 
via majority voting. Then, the ensemble makes an error 
only when at least half of its base learners make errors. 
Thus, by Hoeffding inequality, the generalization error 
of the ensemble is 


Lr/2] (T ee 1 
ae (Jeo sem (-37 2-1?) : 


(29.48) 


which implies that the generalization error decreases 
exponentially to the ensemble size T, and ultimately ap- 
proaches zero as T approaches infinity. 

It is practically impossible to obtain really indepen- 
dent base learners, but it is generally accepted that to 
construct a good ensemble, the base learners should 
be as accurate as possible, and as diverse as possi- 
ble. This has also been confirmed by error-ambiguity 
decomposition and bias-variance-covariance decompo- 
sition [29.113-115]. Generating diverse base learners, 
however, is not easy, because these learners are gener- 
ated from the same training data for the same learning 
problem, and thus they are usually highly correlated. 
Actually, we need to require that the base learners must 
not be very poor; otherwise their combination may even 
worsen the performance. 

Usually, combining only accurate learners is of- 
ten worse than combining some accurate ones together 
with some relatively weak ones, because the comple- 
mentarity is more important than pure accuracy. Notice 
that it is possible to do some selection to construct 
a smaller but stronger ensemble after obtaining all base 
learners [29.116], possibly because this way makes it 
easier to trade off between individual performance and 
diversity. 
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Unfortunately, there is not yet a clear understand- 
ing about diversity although it is crucial for ensemble 
methods. Many efforts have been devoted to design- 
ing diversity measures, however, none of them is well- 


29.7 Feature Selection and Extraction 


Real-world data are often high-dimensional and contain 
many spurious features. For example, in face recog- 
nition, an image of size m xn is often represented as 
a vector in R™”, which can be very high-dimensional 
for typical values of m and n. Similarly, biological 
databases such as microarray data can have thousands 
or even tens of thousands of genes as features. Such 
a large number of features can easily lead to the curse 
of dimensionality and severe overfitting. A simple ap- 
proach is to manually remove irrelevant features from 
the data. However, this may not be feasible in practice. 
Hence, automatic dimensionality reduction techniques, 
in the form of either feature selection or feature extrac- 
tion, play a fundamental role in many machine learning 
problems. 

Feature selection selects only a relevant subset of 
features for use with the model. In feature selection, 
the features may be scored either individually or as 
a subset. Not only can feature selection improve the 
generalization performance of the resultant classifier, 
the use of fewer features is also less computationally 
expensive and thus implies faster testing. Moreover, it 
can eliminate the need to collect a large number of ir- 
relevant and redundant features, and thus reduces cost. 
The discovery of a small set of highly predictive vari- 
ables also enhances our understanding of the underlying 
physical, biological, or natural processes, beyond just 
the building of accurate black-box predictors. 

Feature selection and extraction has been a classic 
topic in the literature of pattern recognition for sev- 
eral decades; many results obtained before the 1980s 
are systematically summarized in [29.118]. Reviews on 
further studies in the recent three decadse are referred 
to [29.119-121]. Roughly, feature selection methods 
can be classified into three main paradigms: filters, 
wrappers, and the embedded approach [29.120]. Filters 
score the usefulness of the feature subset obtained as 
a pre-processing step. Commonly used scores include 
mutual information and the inter/intra class distance. 
This filtering step is performed independently of the 
classifier and is typically least computationally expen- 
sive among the three paradigms. Wrappers, on the other 


accepted [29.14, 117]. In practice, heuristics are usually 
employed to generate diversity, and popular strategies in- 
clude manipulating data samples, input features, learning 
parameters, and output representations [29.14]. 


hand, score the feature subsets according to their pre- 
diction performance when used with the classifier. In 
other words, the classifier is trained on each of the can- 
didate feature subsets, and the one with the best score 
is then selected. However, as the number of candidate 
feature subsets can be very large, this approach is com- 
putationally expensive, though it is also expected to 
perform better than filters. Both filters and wrappers 
rely on search strategies to guide the search for the 
best feature subset. While a large number of search 
strategies can be used, one is often limited to the com- 
putationally simple greedy strategies: (i) forward, in 
which features are added to the candidate set one by 
one; or (ii) backward, in which one starts with the full 
feature set and deletes features one by one. Finally, 
embedded methods combine feature selection with the 
classifier to create a sparse model. For example, one can 
use the £; regularizer which shrinks the coefficients of 
the useless features to zero, essentially removing them 
from the model. Another popular algorithm is called 
recursive feature elimination [29.122] for use with sup- 
port vector machines. It repeatedly constructs a model 
and then removes those features with low weights. Em- 
pirically, embedded methods are often more efficient 
than filters and wrappers [29.120]. 

While most feature selection methods are super- 
vised, there are also recent works on feature selection 
in the unsupervised learning setting. However, unsuper- 
vised feature selection is much more difficult due to the 
lack of label information to guide the search for relevant 
features. Most unsupervised feature selection methods 
are based on the filter approach [29.123-125], though 
there are also some studies on wrappers [29.126] and 
embedded approaches [29.124, 127-129]. 

Recently, feature selection in multi-task learning 
has been receiving increasing attention. Recall that 
the Z; regularizer is commonly used to induce fea- 
ture selection in single-task learning; this is extended 
to the mixed norms in MTL. Specifically, let W = 
[W1,W2,...,Wr], where w; € Rf is the parameter asso- 
ciated with the t-th task. To enforce joint sparsity across 
the T tasks, the £49,; norm of W is used as the regular- 
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izer, i. e., ||Wlloo. = 2i maxı<i<r |W,i| [29.130]. In 
other words, one uses an so norm on the rows of the W 
to combine the contributions of each row (feature) from 
all the tasks, and then combine the features by using 
the £; norm, which, because of its sparsity-encouraging 
property, leads to only a few nonzero rows of W. 
Instead of only selecting a subset from the exist- 
ing set of features, feature extraction aims at extracting 
a set of new features from the original features. This can 
be viewed as performing dimensionality reduction that 
maps the original features to a new lower-dimensional 
feature space, while ensuring that the overall struc- 
ture of the data points remains intact. The unsupervised 
methods previously introduced in Sect. 29.3.1 can all 
be used for feature extraction. The classic ones consist 
of principal component analysis (PSA) and princi- 
pal subspace analysis (PSA), and their complemen- 
tary counterparts minor component analysis (MCA) 
and minor subspace analysis (MSA), as well as the 
closely related factor analysis (FA), while indepen- 
dent component analysis (ICA) and non-Gaussian fac- 
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is well established but it entails several approx- 30.2 Reproducing Kernel Hilbert Spaces ....... 525 
imations. The theory of Hilbert spaces, which is 
principled and well established, helps solve the 
representation problem in machine learning by 
providing a rich (universal) class of functions where 
the optimization can be conducted. Working with 
functions is cumbersome, but for the class of re- 
producing kernel Hilbert spaces (RKHSs) it is still 
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square of the number of samples, must fit in com- 
puter memory. The computation in this best-case 
scenario is also proportional to number of samples 
square. This is not specific to the SVM algorithm 
and is shared by kernel regression. There are also 
other relevant data processing scenarios such as 
streaming data (also called a time series) where 
the size of the data is unbounded and potentially 
nonstationary, therefore batch mode is not directly 
applicable and brings added difficulties. 

Online learning in kernel space is more efficient 
in many practical large scale data applications. As 
the training data are sequentially presented to the 
learning system, online kernel learning, in general, 
requires much less memory and computational 
bandwidth. The drawback is that online algo- 
rithms only converge weakly (in mean square) to 
the optimal solution, i.e., they only have guaran- 
teed convergence within a ball of radius € around 
the optimum (e is controlled by the user). But be- 
cause the theoretical optimal ML solution has many 
approximations, this is one more approximation 


thatis worth exploring practically. The most impor- 
tant recent advance in this field is the development 
of the kernel adaptive filters (KAFs). The KAF algo- 
rithms are developed in reproducing kernel Hilbert 
space (RKHS), by using the linear structure of this 
space to implement well-established linear adap- 
tive algorithms (e.g., LMS, RLS, APA, etc.) and to 

obtain nonlinear filters in the original input space. 
The main goal of this chapter is to bring closer to 
readers, from both machine learning and signal 

processing communities, these new online learn- 
ing techniques. In this chapter, we focus mainly 

on the kernel least mean square (KLMS), kernel re- 
cursive least squares (KRLSs), and the kernel affine 
projection algorithms (KAPAs). The derivation of 

the algorithms and some key aspects, such as the 
mean-square convergence and the sparsification 

of the solutions, are discussed. Several illustration 
examples are also presented to demonstrate the 

learning performance. 
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30.1 Background Overview 


The general goal of machine learning is to build a model 
from data with the goal of extracting useful structure 
contained in the data. More specifically, machine learn- 
ing can be defined as a process by which the topology 
and the free parameters of a neural network (i. e., the 
learning machine) are adapted through a process of 
stimulation by the environment in which the network 
is embedded [30.1]. There is a wide variety of machine 
learning algorithms. Based on the desired response of 
the algorithm or the type of input available during 
training, machine learning algorithms can be divided 
into several categories: supervised learning, unsuper- 
vised learning, semi-supervised learning, reinforcement 
learning, and so on [30.2]. In this chapter, however, we 
focus mainly on supervised learning, and in particular, 
on the regression tasks. The goal of supervised learn- 
ing is, in general, to infer a function that maps inputs 
to desired outputs or labels that should have the gen- 
eralization property, that is, it should perform well on 
unseen data instances. 

In supervised machine learning problems, we as- 
sume that the data pairs {x;, z;} collected from real- 
world experiments are stochastic and drawn inde- 
pendently from an unknown probability distribution 
P(X, Z) that represents the underlying phenomenon we 
wish to model. The optimization problem is normally 
formulated in terms of the expected risk R(f) defined 
as R(f) = S LE), z)dP(x, z), where a loss function 
L(f(x),z) translates the goal of the analysis, and f be- 
longs to a functional space. The optimization problem 
is to find the minimal expected risk R(f) among all 
possible functions, i.e. f* = mir R(f). Unfortunately, 
we cannot work with arbitrary functions in our model, 
so we restrict f to a mapper class F, and very likely 
f“ ¢ F. For instance, if our mapper is linear, then the 
functional class is the linear set which is small, albeit 
important, and so we will approximate f* by the clos- 
est linear function, committing sometimes a large error. 
But even if the mapper is a multilayer perceptron with 
fixed topology, the same problem exists although the 
error will likely be smaller. The best solution is there- 
fore fr = mimer R(f) and it represents the first source 
of implementation error experimenters face. But this is 
not the only problem. We also normally do not know 
P(X, Z) in advance (indeed in machine learning this 
is normally the goal of the analysis). Therefore, we 
resort to the law of large numbers and approximate 
the expected risk by Ry(f) = 1/N X; L (xi), zi) which 
we call the empirical risk. Therefore, our optimization 


goal becomes fy = milyer Ry(f), which is normally 
achieved by optimization algorithms. The difference 
between the optimal solution R(f*) and the solution 
achieved Ry (fy ) with the finite number of samples and 
the chosen mapper can be written as 


Ry (fy) — RF) 
= [RED REO] + [Rv RED] | 


where the first term is the approximation error while the 
second term is the estimation error. The optimization 
itself is also subject to constraints as we can imag- 
ine. The major compromise is how to treat the two 
terms. Statisticians favor algorithms to decrease as fast 
as possible the second term (estimation error), while 
optimization experts concentrate on supra-linear algo- 
rithms to minimize the first term (approximation error). 
But in large-scale data problems, one major consid- 
eration is the optimization time under these optimal 
assumptions, which can become prohibitively large. 
This paper among others [30.3] defines a third error p 
called the optimization error to approximate the prac- 
tical optimal solution Ry (fy) by Ry (fv), provided one 
can find Ry(fy) simply with algorithms that are O(N) 
in time and memory usage. Basically, the final solution 
of Ry (fv) will exist in a neighborhood of the optimal 
solution of radius p. 

Let us now explain how the powerful mathematical 
tool called the RKHS has been widely utilized in the 
areas of machine learning [30.4,5]. It is well known 
that the probability of linearly shattering data tends 
to one with the increase in dimensionality of the data 
space. However, the main bottleneck of this technique 
was the large number of free parameters of the high- 
dimensional classifiers, which results in two difficult 
issues: expensive computation and the need to regu- 
larize the solutions. The RKHS (also kernel space or 
feature space) provides a nice way to simplify the com- 
putation. The dimension of an RKHS can be very high 
(even infinite), but by the kernel trick the calculation in 
RKHS can still be done efficiently in the original in- 
put space if the algorithms can be expressed in terms of 
the inner products. Vapnik proposed a robust regularizer 
in support vector machine (SVM), which promoted the 
application of RKHS in pattern recognition [30.4, 5]. 

Kernel-based learning algorithms have been suc- 
cessfully applied in batch settings (say SVM). The 
batch kernel learning algorithms, however, usually re- 
quire significant memory and computational burden 
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due to the necessity of retaining all the training data 
and calculating a large Gram matrix. In many practical 
situations, the online kernel learning (OKL) is more ef- 
ficient. Since the training data are sequentially (one by 
one) presented to the learning system, OKL in general 
requires much less memory and computational cost. 
Another key advantage of OKL algorithms is that they 
can easily deal with nonstationary (time varying) en- 
vironments (i. e., where the data statistics change over 
time). 

Traditional linear adaptive filtering algorithms like 
the least mean square (LMS) and recursive least squares 
(RLSs) are the most well known and simplest on- 
line learning algorithms, especially in signal processing 
community [30.6—8]. In recent years, many researchers 
devoted to use the RKHS to design the optimal nonlin- 
ear adaptive filters, namely, the kernel adaptive filters 
(KAF) [30.9]. The KAF algorithms are developed in 
RKHS, by using the linear structure (inner product) 
of this space to implement the well-established linear 
adaptive algorithms and to obtain (by kernel trick) non- 
linear filters in the original input space. Up to now, there 
have been many KAF algorithms. Typical examples in- 
clude the KLMS [30.10], kernel affine projection algo- 
rithms (KAPA) [30.11], kernel recursive least squares 
(KRLS) [30.12], and the extended kernel recursive least 
squares (EX-KRLSs) [30.13]. If the kernel is a Gaus- 
sian, these nonlinear filters build a radial basis function 
(RBF) network with a growing structure, where centers 
are placed at the projected samples and the weights are 
directly related to the errors at each sample. 

The main bottleneck of KAF algorithms (and many 
other OKL algorithms) is their growing structure. This 
drawback will result in increasing memory and com- 
putational requirements, especially in continuous adap- 
tation situation where the number of centers grows 
unbounded. In order to make the KAF algorithms prac- 
tically useful, it is crucial to find a way to curb the 
network growth and to obtain a compact representa- 
tion. Some sparsification rules can be applied to address 
this issue [30.9]. According to these sparsification rules, 


the new input is accepted as a new center (i.e., in- 
serted into the center dictionary) only if it is judged 
as an important input under a certain criterion. Popu- 
lar sparsification criteria include the novelty criterion 
(NC) [30.14], coherence criterion (CC) [30.15], ap- 
proximate linear dependency (ALD) criterion [30.12], 
surprise criterion (SC) [30.16], and so on. In addition, 
the quantization approach can also be used to sparsify 
the solution and produce a compact network with desir- 
able accuracy [30.17]. 

Besides the RKHS, fundamental concepts and prin- 
ciples from information theory can also be applied 
in the areas of signal processing and machine learn- 
ing. For example, information theoretic descriptors like 
entropy and divergence have been widely used as sim- 
ilarity metrics and optimization criteria in information 
theoretic learning (ITL) [30.18]. These descriptors are 
particularly useful for nonlinear and non-Gaussian sit- 
uations since they capture the information content and 
higher order statistics of signals rather than simply their 
energy (i.e., second-order statistics like variance and 
correlation). Recent studies show that the ITL is closely 
related to RKHS. The quantity of correntropy in ITL 
is in essence a correlation measure in RKHS [30.19]. 
Many ITL costs can also be formulated in an RKHS 
induced by the Mercer kernel function defined as the 
cross information potential (CIP) [30.20]. The popular 
quadratic information potential (QIP) can be expressed 
as a squared norm in this RKHS. The estimators of in- 
formation theoretic quantities can also be reinterpreted 
in RKHS. For example, the nonparametric kernel esti- 
mator of the QIP can be expressed as a squared norm of 
the mean vector of the data in kernel space [30.20]. 

The focus of the present chapter is mainly on a large 
family of online kernel learning algorithms, the kernel 
adaptive filters. Several basic learning algorithms are 
introduced. Some key aspects about these algorithms 
are discussed, and several illustration examples are pre- 
sented. Although our focus is on the kernel adaptive 
filtering, the basic ideas will be applicable to many 
other online learning methods. 


30.2 Reproducing Kernel Hilbert Spaces 


A Hilbert space is a linear, complete, and normed space 
equipped with an inner product. A reproducing ker- 
nel Hilbert space is a special Hilbert space associated 
with a kernel k such that it reproduces (via an in- 


ner product) each function f in the space. Let X be 
a set (usually a compact subset of R?) and k(x, y) be 
a real-valued bivariate function on X x X. Then the 
function k(x, y) is said to be nonnegative definite if for 
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any finite point set {x; € X}/_, and for any real number 
set fa; € RY. 


i=1? 


N N 


Yo YE aaj ui. x) >0. (30.1) 


i=1 j=1 


If the above inequality is strict for all nonzero vec- 
tors œ = [&1,..., ay], the function k(x,y) is said 
to be strictly positive definite (SPD). The following 
theorem shows that any symmetric and nonnegative 
definite bivariate function k(x,y) is a reproducing 
kernel. 


Theorem 30.1 (Moore-Aronszajn [30.21, 22]) 

Any symmetric, nonnegative definite function k(x, y) 
defines implicitly a Hilbert space H, that consists of 
functions on X such that: 


1) Vxe X,K(.,x) € Hy, 
2) Vxe X, VF € Hp f(x) = (F, KC) 96; 


where (., .) , denotes the inner product in Hg. 


Property (2) is the so-called reproducing property. 
In this case, we call k(x, y) a reproducing kernel, and 
Hp, an RKHS defined by the reproducing kernel k(x, y). 
Usually, the space X is also called the input space. 
Property 1) indicates that each point in the input space 
X can be mapped onto a function in a potentially much 
higher dimensional RKHS H. The nonlinear mapping 
from the input space to RKHS is defined as ®(x) = 
k(., x). In particular, we have 


(D(x), PO) = (KC. x), KC, Y) Ih 
= k(x, y). (30.2) 


Thus the inner products in high-dimensional RKHS can 
be simply calculated via kernel evaluation. This is nor- 
mally called the kernel trick. Note that the RKHS is 
defined by the selected kernel function, and the simi- 
larity between functions in the RKHS is also defined 
by the kernel since it defines the inner product of 
functions. 

The next theorem guarantees the existence of a non- 
linear mapping between the input space and a high- 
dimensional feature space (a vector space in which the 
training data are embedded). 


Theorem 30.2 (Mercer's [30.23]) 
Let k € Loo(X x X) be a symmetric bivariate kernel 
function. If k is the kernel of a positive integral oper- 


ator in L2 (X), and X is a compact subset of R¢, then 


Yy ELO: J kOe EE > 0. 
x 
(30.3) 


Let pi E€ L2(X) be the normalized orthogonal eigen- 
functions of the integral operator, and à; the corre- 
sponding positive eigenvalues. Then 


M 

K(x, y) = Do gil gil) . (30.4) 
i=1 

where M < oo. Since the eigenvalues are positive, one 


can readily construct a nonlinear mapping ¢ from the 
input space X to a feature space F 


gx—-F, 
0) = [Vine Vinge...) - 


The dimension of F is M, i.e., the number of the pos- 
itive eigenvalues (which can be infinite in the strictly 
positive definite kernel case). 


(30.5) 


Input space 


Feature space 


Fig. 30.1 Nonlinear map g(-) between the input space and 
the feature space 


Table 30.1 Some well-known kernels defined over X x X, 
X CR4(c>0,0>0,p EN) 


Kernels Expressions 

Polynomial K(x, y) = (c + xTy)P 
Exponential k(x, y) = exp(x7y/207) 
Sigmoid k(x, y) = tanh(xTy/o + c) 
Gaussian k(x, y) = exp(—|lx— yll?/207) 
Laplacian k(x, y) = exp(—||x—yl|/207) 
Cosine K(x, y) = exp(Z(x, y)) 
Multiquadratic K(x, y) = yle- +e 


Inverse K(x, y) = 1/y llx— yll? + c 


multiquadratic 
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The feature space F is isometric—isomorphic to the 
RKHS H, induced by the kernel. This can be eas- 
ily recognized by identifying g(x) = ®(x) = k(., x). In 
general, we do not distinguish these two spaces if no 
confusion arises. 

Now the basic idea of kernel-based learning al- 
gorithms can be simply described as follows: Via 
a nonlinear mapping g: X >F (or @: X > Hı), 
the data {x; € X}/_, are mapped into a high dimen- 
sional (usually M >> d) feature space F with a lin- 
ear structure (Fig. 30.1). Then a learning problem in 
X is solved in F instead, by working with {g(x;) € 
F}. As long as an algorithm can be formulated in 
terms of the inner products in F, all the operations 
can be done in the input space (via kernel evalua- 


tions). Because F is high dimensional, a simple linear 
learning algorithm (preferably one expressed solely 
in terms of inner products) in F can solve arbitrar- 
ily nonlinear problems in the input space, provided 
that F is rich enough to represent the mapping (the 
feature space can be universal if it is infinite dimen- 
sional). 

The kernel function K is a crucial factor in all ker- 
nel methods because it defines the similarity between 
data points. Some well-known kernels are listed in Ta- 
ble 30.1. Among these kernels, the Gaussian kernel is 
most popular and is, in general, a default choice due to 
its universal approximating capability (Gaussian kernel 
is strictly positive definite), desirable smoothness and 
numerical stability. 


30.3 Online Learning with Kernel Adaptive Filters 


In this section, we discuss several important online 
kernel learning algorithms, i.e., the kernel adaptive 
filtering algorithms. Suppose our goal is to learn a con- 
tinuous input-output mapping f: U — D based on a se- 
quence of input—output examples (the so called training 
data) {u(i), d(i)}, i= 1,2,..., where U C R” is the in- 
put domain, D C R is the desired output space. This 
supervised learning problem can be solved online (se- 
quentially) using an adaptive filter. Figure 30.2 shows 
a general scheme of an adaptive filter. Usually, an adap- 
tive filter consists of three elements: 


1) The input—output training data. 

2) The structure (or topology) of the filter, with a set 
of unknown parameters (or weights) w. 

3) An optimization criterion J (or cost function). 


An adaptive filtering algorithm will adjust the filter 
parameters so as to minimize the disparity (measured by 


Desired 
d(i) 
+ 


Input 
u(i) 


Adaptive filter 
W(i) 


Cost function 
J = Ele*(i)] 


Fig. 30.2 General configuration of an adaptive filter 


the cost function) between the filtering and desired out- 
puts. The filter topology can be a simple linear structure 
(e.g., the FIR filter) or any nonlinear network structure 
(e.g., MLPs, RBF, etc.). The cost function is, in gen- 
eral, the mean square error (MSE) or the least-squares 
(LSs) cost. The adaptive filtering algorithm is usually 
a gradient-based algorithm. 

The great appeal of developing adaptive filters in 
RKHS is to utilize the linear structure of this space to 
implement well-established linear adaptive filtering al- 
gorithms and to achieve nonlinear filters in the input 
space. Compared with other nonlinear adaptive filters, 
the KAFs have several desirable features: 


1) If choosing a universal kernel (e.g., Gaussian ker- 
nel), they are universal approximators. 

2) Under MSE criterion, the performance surface is 
still quadratic so gradient descent learning does not 
suffer from local minima. 

3) If pruning the redundant features, they have mod- 
erate complexity in terms of computation and 
memory. 


Table 30.2 gives the comparison of different adap- 
tive filters [30.9]. 


30.3.1 Kernel Least 
Mean Square (KLMS) Algorithm 


Among the family of the KAF, the KLMS is the sim- 
plest, which is derived by directly mapping the linear 
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Table 30.2 Comparison of different adaptive filters 


Adaptive filters Modeling capacity 
Linear adaptive filters Linear only 
Hammerstein, Weiner models Limited nonlinearity 
Volterra, Wiener series Universal 
Time-lagged neural networks Universal 

Recurrent neural networks Universal 

Kernel adaptive filters Universal 

Recursive Bayesian filters Universal 


least mean square (LMS) algorithm into RKHS [30.10]. 
Before proceeding, we simply discuss the well-known 
LMS algorithm. 


LMS Algorithm 
Usually, the LMS algorithm assumes a linear finite im- 
pulse response (FIR) filter (or transversal filter), whose 
output, at į iteration, is simply a linear combination of 
the input 


y(i) = w(i— 1)" u(i) , (30.6) 


where w(i— 1) denotes the estimated weight vector at 
(i— 1) iteration. With the above linear model, the LMS 
algorithm can be given as follows 


w(0) =0 
e(i) = d(i) —w(i— 1)"u(i) , 
w(i) = w(i— 1) + ne(iu(i) , 


(30.7) 


where e(i) = d(i) — y(i) is the prediction error, and y > 
O is the step size. The LMS algorithm is in essence 
a stochastic gradient-based algorithm under the instan- 
taneous MSE cost J(i) = e?(i)/2. In fact, the weight 
update equation of the LMS can be simply derived as 


a a , 
wi) == =n Aa o) 
=w(i—1)— nO GLT (d(i) 


we 1) 
—w(i— 1)"u(i)) 
=w(i— 1)+ ne(iju(i). 


The LMS algorithm has been widely applied in adap- 
tive signal processing due to its simplicity and effi- 
ciency [30.6—8]. The robustness of the LMS has been 
proven in [30.24], and it has been shown that a sin- 
gle realization of the LMS is optimal in the Hoo sense. 
The step size y is a crucial parameter and has signifi- 
cant influence on the learning performance. It controls 


(30.8) 


Convexity Complexity 
Yes Very simple 
No Simple 

Yes Very high 
No Modest 

No High 

Yes Modest 

No Very high 


the compromise between convergence speed and mis- 
adjustment. In practice, the selection of step size should 
guarantee the stability and convergence rate of the algo- 
rithm. 

The LMS algorithm is sensitive to the input power. 
In order to guarantee the stability and improve the per- 
formance, one often uses the normalized LMS (NLMS) 
algorithm, which is a variant of the LMS algorithm, 
where the step-size parameter is normalized by the in- 
put power, that is 


w(t) =wi- ———e(i)u(i) . (30.9) 


“Tu aE 


KLMS Algorithm 
The LMS algorithm assumes a linear FIR filter, and 
hence the performance will become very poor if the un- 
known mapping is highly nonlinear. To overcome this 
limitation, we are motivated to formulate a similar algo- 
rithm in a high-dimensional feature space (or equivalent 
RKHS), which is capable of learning arbitrary nonlin- 
ear mapping. This is the motivation of the development 
of the kernel adaptive filtering algorithms. 

Let us come back to the previous nonlinear learning 
problem, i.e., learning a continuous arbitrary input— 
output mapping f based on a sequence of input—output 
examples {u(i), d(i)}, i= 1,2,... Online learning finds 
sequentially an estimate of f such that f; (the estimate at 
iteration i) is updated based on the last estimate f;_; and 
the current example {u(i), d(i)}. This recursive process 
can be done in the feature space. First, we transform 
the input u(i) into a high-dimensional feature space F 
by a kernel-induced nonlinear mapping g(-). Second, 
we assume a linear model in the feature space, which is 
in the form similar to the linear model in (30.6) 

yi) = Ri- DEl), (30.10) 
where (i) = g(u(i)) is the mapped feature vector from 
the input space to the feature space, 2 (i— 1) denotes 
a high-dimensional weight vector in feature space. 


Theoretical Methods in Machine Learning 


30.3 Online Learning with Kernel Adaptive Filters 


Third, we develop a linear adaptive filtering algorithm 
based on the model (30.10) and the transformed training 
data {g (i), d(i)}, i= 1,2,... If we can formulate this 
linear adaptive algorithm in terms of the inner products, 
we will obtain a nonlinear adaptive algorithm in input 
space, namely the kernel adaptive filtering algorithm. 

Performing the LMS algorithm on the model 
(30.10) with new example sequence {g(i), d(i)} yields 
the KLMS algorithm [30.10] 


2(0)=0, 
e(i) = d(i)-— 2 (i— 1)" efi), 
2 (i) = 2i-1)+ nee). 
The KLMS is very similar to the LMS algorithm, 
except for the dimensionality (or richness) of the pro- 
jection space. By identifying g(w) = k(u,.), one can 
easily obtain the learning rule in the input space 


fo=0, 
e(i) = d(i) —fi-1(u@) , 
fi = fi—1ı + ne@k(ud, .) . 


The KLMS can be viewed as the solution of the fol- 
lowing regularized least squares problem 


(30.11) 


(30.12) 


in OA)? +i =f-alBy - G03) 
The above formula can be rewritten as 
min_(e(i)— AKUD)? + — Ail « (30.24) 
Afic Hy n 
where Af; = f; —fi—1. From (30.14), we observe: 


1) The learning of KLMS at iteration i is equivalent 
to solving a regularized least squares problem, in 
which the previous estimate f;_ is frozen, and only 
the adjustment term Af; is solved. 

2) In this least squares problem, there is only one train- 
ing example involved. i. e., {u(i), e(i)}. 

3) The regularization factor is directly related to the 
step size via y = (1 — n)/n. 


It has been proven in [30.10] that the KLMS has 
self-regularization property, i.e., the step size plays 
a similar role as the regularization parameter. 

Given an input u, the output of the KLMS filter, at 
iteration i, will be 


flu) =n) e0uG), u). (30.15) 


j=! 


If the kernel is a radial kernel (e.g., Gaussian kernel), 
the KLMS creates a growing RBF network by allo- 
cating a new kernel unit for every new example with 
input u(i) as the center and ne(i) as the coefficient. 
The network topology of the KLMS filter is shown in 
Fig. 30.3. The procedure of KLMS is summarized in 
Algorithm 30.1. 

It is also straightforward to derive the normalized 
KLMS algorithm. The weight update equation of nor- 
malized KLMS will be 


> ; n oe 
(i) = 2-1) + —* eo) 
lor Bone 


= @2(i-—1)+ e(i)g(i) . 


n 
k(u(i), u(i)) 


If the kernel function is the Gaussian kernel (Ta- 
ble 30.1), we have k(u(i),u(i)) = 1. In this case, the 
KLMS is automatically normalized. 


Algorithm 30.1 Kernel Least Mean Square Algorithm 


Initialization: 

Choose kernel « and step size n 

a, = nd(1), CQ) = {u(1)}. fi = aie (uC), .) 
Computation: 

while {u(i), d(i)}(i > 1) available do 


1) Compute the filter output: fi—ı(u(i)) = 
Da gk Ul), uQ) 

2) Compute the error: e(i) = d(i) — fi—1ı (u(i)) 

3) Store the new center: C(i) = {C(i— 1), u(i)} 


Fig. 30.3 The network topology of the KLMS filter 
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4) Compute and store the new coefficient: a; = 
ne(i) 


end while 
(where a denotes the coefficient vector and C(i) de- 
notes the dictionary at i iteration) 


The KLMS is a simple kernel learning algorithm, 
which requires O(i) operations per iteration. The role 
of the step size ņ in KLMS remains, in principle, the 
same as the step size in traditional LMS. Specifically, 
it controls the compromise between convergence speed 
and misadjustment. The step-size parameter in KLMS 
is also directly related to the optimization error intro- 
duced in [30.3]. 

In KLMS, the kernel is usually chosen to be a Gaus- 
sian kernel. The kernel size (or kernel bandwidth) o 
in the Gaussian kernel is a crucial parameter that 
controls the degree of smoothing and consequently 
has significant influence on the learning performance. 
In practice, the kernel size can be set manually, or 
estimated by rule-based methods (e.g., Silverman’s 
rule [30.25]), or determined automatically using cross- 
validation. 


Mean Square Convergence Performance 
The mean square convergence analysis is very impor- 
tant for adaptive filters. For linear adaptive filters, much 
research has been done in this area and significant re- 
sults have been achieved. For nonlinear adaptive filters, 
the mean square convergence analysis is, in general, 
rather complicated and little studied. The mean square 
convergence analysis of the KLMS is, however, rela- 
tively tractable since it is a simple linear algorithm in 
high-dimensional feature space, and hence its conver- 
gence analysis is much similar to those of the classical 
linear adaptive filters [30.26]. In the following, we dis- 
cuss the mean square convergence performance of the 
KLMS. 

Let us consider the case of nonlinear system identi- 
fication where the output data {d(i)} are related to the 
input vectors {w(i)} via 


d(i) =f* (ui) + vÒ , (30.17) 


where f*(-) denotes the unknown nonlinear mapping 
that needs to be estimated, v(i) stands for the mea- 
surement noise. Suppose the selected kernel is a uni- 
versal kernel (i.e., strictly positive definite kernel). 
Then, according to the universal approximation prop- 


erty [30.27], there is a weight vector 2* € F such that 
dli) = 2*" (i) + v(i). (30.18) 
The prediction error e(i) can thus be expressed as 
e(i) = BG -1)" ep) + VO = eal) + (i) , (30.19) 


where @(i— 1) = 2* — @(i—1) is the weight error 
vector in F, e,(i) 22 (i— 1)" (i) is the a priori error 
at iteration i. 

Subtracting 2* from both sides of the weight up- 
date equation 2 (i) = 2 (i— 1) + ne(de(i), we get 


2(i) = 2(i-1)— neli) . (30.20) 


Define the a posteriori error @, (7) 4 2 (i) (i). Then 
we have 


el) = eali) + (PO - Gi- . (30.21) 
By incorporating (30.20), 
e(i) = eali) — ne(i@(i) ei) 
= ea(i) — ne(i)k (u(i), u()) . 
Combining (30.20) and (30.22), and eliminating the 
prediction error e(i), yields 
(6@ — eal) PW 
kuu) 


Squaring both sides of (30.23), and after some 
straightforward manipulations, we obtain 


(30.22) 


= 2(i-1) + (30.23) 


cig. g0 
2 2 + <P 
cue), 
EUN m (30.24) 
-igne + —2 


(u(i), u(i)) ` 
where || 2 (i)||? = 2 (i)’ @ (i) is the weight error power 


(WEP) in feature space F. Further, taking expectations 
of both sides of (30.24) yields 


na (i) 
AOI K(u(i). u) 
E|I2 O Hel Ey. 


=E||2Gü-1)]+E oO 
k(u(i), u(i)) 


(30.25) 
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The above equation is referred to as the energy conser- 
vation relation in feature space [30.26], which shows 
how the WEP in feature space evolves in time. The 
expression of this fundamental relation is in the form 
similar to those of the energy conservation relation 
for classical linear adaptive filters [30.28—30]. In fact, 
this is not surprising, since the KLMS is a linear 
(but high-dimensional) adaptive algorithm in feature 
space. 

Substituting e,(i) = e4a(i)— ne(i)x(u(i),u(i)) into 
the energy conservation relation (30.25) yields 


E[N ON] = E [IE G- VIP] -2nFleaWe] 


+ n Ele(i) k (u(i), u(i))] . 
(30.26) 


When choosing Gaussian kernel, we have k(u(i), 
u(i)) = 1, and hence 


EIRO] = £[|2@G- D] 


— 2nElea(ie(i)] + n7Ele*(i)] . 
(30.27) 


Gaussian kernel is a normalized and shift-invariant ker- 
nel, which makes the analysis much simpler. Since 
Gaussian kernel is also a default kernel in KLMS, in 
the following, we will focus on Gaussian kernel and 
use (30.27) to analyze the mean square convergence be- 
havior of the KLMS. It is straightforward to generalize 
the discussion to arbitrary shift-invariant kernels. In the 
following, we give an assumption that will be used in 
the analysis. 


Assumption 30.1 A1 

The noise v(i) is zero-mean, independent, identically 
distributed (i.i.d.), and independent of the a priori es- 
timation error e4(i). 


The above assumption is commonly used in con- 
vergence analysis for classical linear adaptive filtering 
algorithms [30.8]. A sufficient condition for the inde- 
pendence between v(i) and e,(i) is the independence 
between v(i) and the input sequence {u(i)}. 

Combining (30.27) and assumption A1, we have 


E [IOP] =E [2 G- VI? ]-2n£ [20] 
+n? (Ele@]+&) . 
(30.28) 


where E? denotes the noise power (variance). It is 
worth noting that Eq. (30.28) depends on the noise v(i) 
through €? only. 


A Sufficient Condition for Mean Square Conver- 
gence. From (30.28), one can easily derive 


EIRO] <£ [2-1] 
 —2nE [e0] + 0° (E [a0] + §) <0 
2: 
po leo] . 
Ele] +8 
(30.29) 
Thus, if we choose the step size such that Vi, n < 
2E[e?(i)|/(Ele2(i)] + 2), the WEP in feature space 
will be monotonically decreasing (and hence conver- 
gent). This sufficient condition for the mean square 
convergence is, interestingly, identical to that of the nor- 
malized LMS algorithm. The essential reason for this is 
that the Gaussian kernel is a shift-invariant and normal- 
ized kernel (k(u, u) = 1). From (30.29), one can also 
observe that, when the noise power EZ is very small, the 


upper bound on step size will be approximately equal 
to 2.0. 


Steady-State Mean Square Performance. Take the 
limit of Eq. (30.28) as i > 00, 


Jim E [2 @OI?] = lim [2 G—1HI7] 
—2n lim E [e30] 
+n? (im e140] +8) 
(30.30) 


If the WEP in feature space reaches a steady-state 
value, i.e., limj—+oo E|||2 (i)||7] = im: Ell 2 (i— 
1)||?], we have 


-2n lim E[2()]+1? (1m e[a@]+8) zi 


(30.31) 
It follows that 
es 
: 2x7 né% 
Jim E [ea] = ae (30.32) 


The a priori error power E[e?(i)] is also referred to 
as the excess mean square error (EMSE) in the adap- 
tive filtering community. From (30.32), we see that the 
steady-state EMSE of KLMS depends only on the step 
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@ Simulation 


0.002 — Theory 
0 > 
0.2 0.4 0.6 0.8 1 
Step size 7 


Fig. 30.4 Simulated and theoretical EMSE versus step 
size 


size and noise variance, and is NOT related to the kernel 
size and the unknown nonlinear mapping. We should 
point out here that, although the kernel size does not 
affect the KLMS steady-state accuracy, it has crucial 
influence on the convergence rate. In most practical sit- 
uations, the training data are finite and the algorithm can 
never reach the steady state. In these cases, the kernel 
size also has significant influence on the final accuracy 
(not the steady-state accuracy). 

We present here a simple simulation example to ver- 
ify the obtained theoretical results. Suppose that the 
training data are generated by the following nonlinear 
system [30.26] 


d(i) = sin(u(i)) + 0.5u(i— 1) —0.1u°(i— 2) + vò . 
(30.33) 


The input sequence {u(i)} is assumed to be a white 
Gaussian process with variance 1.0, and {v(i)} is a zero- 
mean white noise that is independent of {u(i)}. In the 
simulation, except mentioned otherwise, the step size is 
set at n = 0.5, the noise variance is £2 = 0.01, and the 
kernel size is o = 1.0. For different values of the step 
size, noise variance, and kernel size, the simulated and 
theoretical EMSE are illustrated in Figs. 30.4—30.6. Ev- 
idently, the experimental and theoretical results agree 
very well. 


Network Size Control 
The KLMS filter network grows rapidly with each new 
sample following a nonparametric approach. Due to fi- 
nite resources one must cut the growing structure of 
the filter and constrain the network size (number of the 


@ Simulation 
—— Theory 
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Fig. 30.5 Simulated and theoretical EMSE versus noise 
variance 


EMSE 
A 

0.008 
0.007 
0.006 
0.005 


0.004 
@—o—_9—_0—g—_0—_0—0—_ 099900909 


@ Simulation 
—— Theory 
0 > 
0.5 1 IES 2 
Kemel size 0 


0.003 
0.002 
0.001 


Fig. 30.6 Simulated and theoretical EMSE versus kernel 
size 


centers). Some sparsification methods can be applied to 
cope with this issue. According to these methods, new 
samples are inserted into the dictionary, only if they sat- 
isfy a certain sparsification criterion. In the following, 
we briefly discuss several useful sparsification criteria. 

Suppose at i iteration, the dictionary is C(i) = 
{€1,€1,...,€m,}, and the coefficient vector a(i) = 
{@1,2,...,@m,}, where c; is the jth center, a; is the jth 
coefficient, and m; the dictionary size (or network size) 
at i iteration. In this case, the learned mapping is 


mi 


Ai) = Yo aOun). 


j=! 


(30.34) 
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When a new example {u(i + 1), d(i + 1)} is presented, 
the learning system needs to decide whether u(i + 1) 
should be inserted into the dictionary. This decision 
procedure is, in general, based on some sparsification 
criterion. 


Novelty Criterion. Platts NC [30.14] first computes 
the distance of u(i + 1) to the present dictionary 


dis} = min |ju(i+1)—c;||. (30.35) 
gEC(i) ü 


If dis; is smaller than some preset threshold 8; (8; > 0), 
u(i+ 1) will not be added into the dictionary. Other- 
wise, it computes the prediction error e(i+ 1) = d(i + 
1) —f,(u(i+ 1)). Only if the magnitude of the prediction 
error is larger than another preset threshold 82(82 > 0), 
u(i+ 1) will be accepted as a new center. If the input 
domain U is a compact subset, the NC criterion always 
produces a dictionary with finite elements. 


Coherence Criterion. According to the CC [30.15], 
the input u(i + 1) will be inserted into the dictionary if 
its coherence remains below a given threshold pọ, that 
is 


max |k(u(i+ 1), u(c;))| < uo (30.36) 
gEC(i) 


ALD Criterion. The ALD uses the distance of the new 
input to the linear span of the current dictionary in fea- 
ture space, that is [30.12] 


dis, = min |øu(i+ 1))— 2, boe) . (30.37) 
gec 


ALD is computationally expensive especially when the 
dictionary size m; is very large. In order to simplify 
the computation, one can use the following approximate 
distance 


dis = min ||øu+1))-bølc)|| . (30.38) 
MALEKU] 


Surprise Criterion. Surprise is a subjective informa- 
tion measure of an example {u,d} with respect to 
a learning system £, which is defined as the negative 
log likelihood of the example given the learning sys- 
tem’s estimate on the data distribution [30.16] 


Sc(u, d) = —lnp(u,d | £), (30.39) 


where p(u,d | £) is the subjective probability of (u, d) 
hypothesized by £. The surprise $s (u, d) measures how 
surprising the exemplar is to the learning system. The 
surprise of the new example {u(i + 1), d(i + 1)} is 


Sco luli+ 1), di + 1)) 


= —Inp(u(i+ 1),d(i+ 1) | L®), (30.40) 


where £ (i) denotes the present learning system. To sim- 
plify notation, one can write Ss (u(i + 1),d(i+ 1)) as 
SGi+ 1). 

By the definition, if surprise S(i+ 1) is large, the 
new example {u(i+ 1),d(i+1)} contains something 
new for the system to learn or it is suspicious. Other- 
wise, if surprise S(i+ 1) is very small, the new datum 
is well expected by the learning system £(i) and thus 
contains little information to be learned. Usually one 
can classify the new example into three categories 


abnormal: S(i+ 1) >T, , 
learnable: Ti > S(i+1)>T>, (30.41) 
redundant: S(i+ 1) < Tə, 


where T; and T, are threshold parameters. The choice 
of the thresholds and learning strategies defines the 
characteristics of the learning system. In general, a new 
center will be added only if the example is learnable, 
i.e, Ti > S(i+ 1) > Th. 

Besides the aforementioned sparsification methods, 
there is another technique, called the quantization ap- 
proach, to reduce the network size of KLMS. By 
quantization approach, the input space is quantized, if 
the quantization of the new input has already been as- 
signed a center, no new center will be added, while 
the coefficient of that center will be updated. This new 
algorithm is called the quantized KLMS (QKLMS) al- 
gorithm [30.17]. The mapping update equation of the 
QKLMS can be simply expressed as 


fo =0, 
e(i) = d(i) —fi-_1(u@). (30.42) 
fi = fim + ne@«(Qlu(],.) . 


where QJ.] is a quantization operator over input space. 
A simple online vector quantization (VQ) method has 
also been proposed in [30.17]. The QKLMS algorithm 
(with simple online VQ) is described in Algorithm 30.2. 
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Algorithm 30.2 Quantized Kernel Least Mean 
Square Algorithm 


Initialization: 
Choose kernel k, step size n, quantization size € 


a, = nd(1), CQ) = tuQ)}.fi = aik(u(1), -) 
Computation: 
while {u(i), d(i)}(i > 1) available do 


1) Compute the prediction error 


size(C(i—1)) 


3 


j=! 


e(i) = d(i)— a(i—1)«(GG—1),u@)) 
2) Compute the distance between u(i) and C(i— 1) 
dis(u(i), CG- 1)) = |lu — Ge Gi- DI 

where 


argmin 
1<j<size(C(i—1)) 


3) if dis(u(i), C(i— 1)) < £, then 


= lu® —GG— I)|| 


Ci) =C(i—1), a (i) = a (i— 1) + eli) 
else 
Ci) = {CG—-1), uM}, a(i) = [aG—1), ne] 
end if 
end while 
(where C;(i— 1) denotes the jth element of the dictio- 


nary C(i—1)). 


Kernel Maximum Correntropy (KMC) Algorithm 
Like most conventional adaptive filtering algorithms, 
the KLMS adopts the MSE as the optimality cost 
function. The MSE is mathematically tractable, com- 
putationally simple, and optimal for linear Gaussian 
systems. However, MSE may be a poor cost for nonlin- 
ear or/and non-Gaussian (e.g., heavy-tail distributions) 
situations, since it constraints only the second-order 
statistics. To cope with this problem, one may use 
a non-MSE cost, such as a higher order statistics, 
or an information theoretic criterion (entropy, corren- 
tropy, divergence, etc.). In particular, the kernel maxi- 
mum correntropy (KMC) algorithm has been developed 
in [30.31], which is derived by applying the maximum 
correntropy criterion (MCC) to KLMS. 

The correntropy defines a new correlation function 
between two random variables. Let X and Y be two 
random variables with the same dimensions, the cor- 


rentropy is defined by [30.19] 
V(X, Y) _ Exy [Keom (X, Y)] 


(30.43) 
= J Kcorr(X, y)dFxy (x, y) , 
where Keor(.,.) is a Mercer kernel (usually Gaussian 
kernel), and Fyy(x,y) denotes the joint distribution 
function of X, Y. Since any Mercer kernel induces 
a nonlinear mapping ¢g(-) from the input space to a high- 
dimensional (possibly infinite) feature space, and the 
inner product of two points g(X) and (Y) in feature 
space can be implicitly computed by using the Mercer 
kernel, so the correntropy (30.43) can alternatively be 
expressed as 


V(X, Y) = E[( (X), o (Y))] . 


where (.,.) denotes the inner product in the feature 
space induced by Keor(.,.). Clearly, correntropy is 
a generalized correlation function and it is also posi- 
tive definite, i. e., it defines a new RKHS for inference. 
By a simple Taylor series expansion on the kernel, one 
can see that correntropy provides a number that is the 
sum of all the statistical moments expressed by the ker- 
nel. In many applications, this sum may be sufficient to 
quantify better than correlation the relationships of in- 
terest and it is much simpler to estimate than the higher 
order statistical moments. Therefore, it can be consid- 
ered a new type of statistical descriptor and a new cost 
function for adaptive system training. 

Under MCC criterion, the learning cost function is 
V(d(i), YÒ) = Elkcon(d(i), y(i))]. Dropping the expec- 
tation operator, one obtains the instantaneous cost func- 
tion V(d(i), y(i)) = Kcor(d(i), y(i)). Thus, a stochastic 
gradient algorithm in RKHS Hx (which is induced by 
k, NOT by Kor) can be readily derived as follows 


(30.44) 
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f =f +n — Vd. 90) 
ð PA 
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where 0/df;-; denotes Frechet’s differential. 
This algorithm is called the KMC algorithm. If 
Keor(.,-) is a Gaussian kernel, i.e., Kcor(d(i), y(i) = 
exp(—(e(i)?/207)), then KMC (30.45) becomes 


\2 
f= fei + nga ex ( a econ 
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A2 
=f z exp ( ai ) e(i)K(u(i),.) . 
(30.46) 


The algorithm of (30.46) is, in fact, a KLMS algorithm 
with step size w = 4 exp(—(e(i)?/207)). 

To achieve a better performance, one should select 
a suitable kernel size for correntropy. Note that there 
may be two kernel sizes in KMC: the kernel size for the 
RKHS of filter and the kernel size for the cost function. 
Here we talk about the latter. A kernel size update rule 
has been proposed in [30.32], which is 


o(i+ 1) =ao(i)+ (1-a) fies ; 


where o(i) denotes the kernel size of correntropy at 
iteration 7, 0 < a < 1 isa forgetting factor, Bg is the kur- 
tosis of the Gaussian distribution (i. e., Bg = 3), and ße 
and o2 are, respectively, the kurtosis and variance of the 
prediction error. 


(30.47) 


30.3.2 Kernel Recursive 
Least Squares (KRLS) Algorithm 


The recursive least squares (RLS) is another popu- 
lar algorithm in the traditional linear adaptive filtering 
literature, which recursively updates the estimated au- 
tocorrelation matrix of the input signal vector and the 
cross-correlation vector between the input vector and 
the desired response. The convergence rate of RLS is, 
in general, much faster than the LMS algorithm. This 
improvement in performance, however, is achieved at 
the expense of an increase in computational complex- 
ity. Similar to the LMS algorithm, the RLS algorithm 
can also be kernelized. Next, we will discuss the KRLS 
algorithm [30.12]. The derivation of KRLS is based on 
a least squares formulation in the feature space. 

Based on a sequence of available examples (up to 
and including time i— 1) {u (j), d(j) Zi the regularized 
least squares regression in H% can be formulated as 


=] 


min J A-SUN + VI > 


j=l 


(30.48) 


where y > 0 is the regularization factor that controls the 
smoothness of the solution (to avoid overfitting). Note 
that in KLMS, the step size performs a similar role as 
the regularization factor (self-regularization property), 
and hence there is no need to add explicitly a regular- 
ization factor in KLMS. 

By the representer theorem [30.33], the function f 
in H; minimizing (30.48) can be expressed as a linear 
combination of the kernels in terms of the available data 


isl 


fO =} aku@,.). 


j=1 


(30.49) 


The learning problem can also be defined as finding œ € 
R! that minimizes 


min ||d(i— 1) —K(i— Ne(i— 1) | 
a(i—1)ERi—! 


+ yo(i—1)?K(i— 1)a(i— 1), 


(30.50) 


where a (i—1) = [a;,...,a;-1]’, d(i— 1) = [d(1),..., 
d(i—1)]", and K(i— 1) € R©“*— is the Gram ma- 
trix with elements Ky = k(u(j),u(k)), j,k = 1,2,..., 
i— 1. The solution of (30.50) will be 

o* = (yI + K(i-1))'d(i-1), (30.51) 
where J denotes an identity matrix with appropriate di- 


mension. Of course, the above least squares problem 
can alternatively be formulated in feature space F 


i=l 
min J AO- 27 9H) + vIe 


j=l 


(30.52) 


The solution of (30.52) can be derived as [30.9] 


2* = G(i—1)a* 
= @(i— 1)(yI+ K(i—1))~'d@— 1), (30.53) 


where ®(i—1) = [g(1),..., p@(i— 1)] (Hence the Gram 
matrix K can also be expressed as K = @’@). The 
KRLS algorithm will update this solution recursively 
as new data (u(i), d(i)) become available. 

When the new data (u(i), d(i)) are available, the op- 
timal solution of (30.53) becomes 


2* = G(i)(yI+ K())_'d(i). (30.54) 
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Denote 


Oli) = MHKM = (I+ OOH). 
(30.55) 


It is easy to see 


Qti-1)~! 
h(i)? 


h(i) 


s= 
20) = | y+ (i) (i) 


| x (30.56) 


where h(i) = ®(i— 1)" (i). Using the block matrix in- 
version identity [30.9], one can derive 


Oi- Vr) +z)" —z(i) 
—z(i)? L | 
(30.57) 


a= no | 


where z(i) = Q(i-—I1)h(i), r(i) = y + k(u(i), u(i))— 
z(i)’ h(i). Then the coefficient vector can be updated as 
a” (i) = Q(i)d(i) 
— art (2-H) +202)". =O 
=i) | ~z(i)" 1 | 


er D] 

d(i) 

2 ae 1) oo | 
r(i) T te(i) i 


(30.58) 


where e(i) = d(i)—h(i)æ*(i—1) is the prediction 
error. 

Now we have obtained a recursive algorithm to 
solve the kernel least squares problem, namely, the 
KRLS algorithm (see Algorithm 30.3). The compu- 
tational cost of KRLS is O(i) per iteration. The 
KRLS also produces a network with linear growth. 
All the previously mentioned sparsification or quan- 
tization approaches can still be applied to curb the 
network growth. Notice that the algorithm presented 
here is just the basic KRLS algorithm. There are 
many variants or extensions of KRLS, including the 
exponentially weighted KRLS (EW-KRLS) [30.9], 
sliding window KRLS (SW-KRLS) [30.34], fixed- 
budget KRLS (FB-KRLS) [30.35], extended KRLS 
(EX-KRLS) [30.13], and so on. 


Algorithm 30.3 Kernel Recursive Least Squares 
Algorithm 
Initialization: 
Set the regularization parameter y > 0 
Ca) ={u)}, QO) = [k(w(1), #1) + J, 
a* (1) = Q(1)d(1) 
Computation: 
while {w(i), y(i)}(i > 1) available do 
A(i) = [k(w(i), (1)... Ku), u(i — 1)" 
z(i) = Q(i— 1)h(i) 
r(i) = y + kuf, wD) — 27h) 
sv QG- Dri zz zli) 
T ‘a 


e(i) = d(i) —h(i)’a* (i— 1) 
wn _ | e*G-D)—-z)r@7!e(i) 
n= [OG 
end while 
(where C;(i— 1) denotes the jth element of the dictio- 
nary C(i—1)). 


30.3.3 Kernel Affine Projection Algorithms 
(KAPA) 


The KAPA algorithms are nonlinear extensions of the 
conventional affine projection algorithms (APAs) in 
kernel space, which include the KLMS and KRLS as 
special cases [30.9]. Before presenting the KAPA al- 
gorithms, we give a brief introduction of the APA 
algorithms. 

Let d be a zero-mean scalar-valued random vari- 
able, and u be a zero-mean m x 1 random variable with 
a positive-definite covariance matrix R, = E[uu™]. De- 
note rq, the cross-covariance vector rg, = E[du]. Then 
the weight vector w that solves 


min E|d—w' ul? + \]jw||? , 
welRn 


(30.59) 


is given by 
w* = (M Ra) rae. 


The solution w* of (30.59) can also be recursively 
solved using a common gradient-based method 


w(i) = w(i— 1) + y[ran — OF + Ra)w(i— 1] 
= (1— nà)w(i— 1) + [rau — Ruw (i— 1)], 
(30.60) 
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Table 30.3 Weight update equations of four KAPA algorithms 


Algorithms Update equations 

KAPA-1 R (i) = 2(i—1) + nS Old — (i) "VB Gi—1)] 

KAPA-2 R (i) = R (i— 1) + nO (i (e1 + Oi) G() “| [d() — S (TR (i-1)] 
KAPA-3 2 (i) = 1— nA RG- 1) + nO Oldi) — (7 RGi- 1)] 

KAPA-4 2 (i) = 1—n) PG—1) + nO MAI Gi)" H(i) dei) 


Table 30.4 Several kernel learning algorithms related to 
KAPA 


Algorithms Relation to KAPA 
KLMS [30.10] KAPA-1 (L= 1) 
NKLMS [30.9] KAPA-2 (L= 1) 
NORMA [30.36] KAPA-3 (L= 1) 


Kernel Adaline [30.37] 
RA-RBF [30.38] 
SW-KRLS [30.34] 
RegNet [30.39] 


KAPA-I (L= N) 
KAPA-3 (nà = 1,L =N) 
KAPA-4 (n = 1) 
KAPA-4 (n = 1,L =N) 


or Newton’s recursion (for the case à Æ 0) 


w(i) = w(i— 1) +A + Ry)! 
X [Fau — AI + Ra )w(i— 1)] 
= (1—n)w(i— 1) + NOI + Ru) ran - 
(30.61) 


If the regularization factor à = 0, Newton’s recursion 
should be 


w(i) = w(i— 1) + n(el + Ru) [rau — Ruw(i— 1)] , 

(30.62) 

where £ is a small positive number to avoid numerical 
instability. 

Suppose we have access to the observations (train- 
ing examples) of u and d: {u(i),d(i)}, i=1,2,... 
Then the APA algorithms can be easily derived by ap- 
proximating R, and rg, in Algorithms (30.60)—(30.62). 
Based on the L most recent observations, the covari- 
ance matrix R, and the cross-covariance vector rg, can 
be simply approximated by 

7 1 
R, = -UOVO , 
r (30.63) 
Pau = UOI ; 
where 
U(i) = [u(i— L+ 1) PEE u(i)]mxL , 
d(i) = [di-—L+1),...,d@]’. 
Combining (30.60) and (30.63) yields 


w(i) = (1—nd)w(i—1) + qU Old -U (i) w(i-1)] . 


(30.64) 
When i = 0, (30.64) becomes 
w(i) = w(i— 1) + nU Dld) — U ("w(i — 1)]. 
(30.65) 
Similarly, combining (30.61) and (30.63), we have 
w(i) = (1—n)w(i— 1) 
+ (14+ U(i)U(i)") U Odi) 


=(1-n)w(i— 1) 
+ U(i)(AL+ VOTUAT dÀ) , 


(30.66) 


where the second equation comes from the matrix in- 
verse lemma. 

Further, using the approximations of (30.63), algo- 
rithm (30.62) becomes 


w(i) = w(i— 1) + ne + UDUT! 


(30.67) 
x U( ld (i) — U()Tw(i— 1)] . 


Algorithm (30.67) is equivalent to (by the matrix in- 
verse lemma) 


w(i) = w(i— 1) + nUD (e+ UTU)! 
x [d(i) — U(i)w(i— 1)] . 
(30.68) 


Algorithms (30.65), (30.68), (30.64), and (30.66) are, 
respectively, referred to as the APA-1, APA-2, APA-3, 
and APA-4 algorithms. 

Reformulating the above APA algorithms in fea- 
ture space yields the KAPA algorithms [30.11], whose 
weight update equations are summarized in Table 30.3. 

The KAPA algorithms are directly related to many 
other OKL algorithms [30.9]. Typical examples are pre- 
sented in Table 30.4. 
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30.4 Illustration Examples 


30.4.1 Chaotic Time Series Prediction 


Mackey-Glass (MG) Time Series 
First, we consider the Mackey—Glass time series. The 
time sequence is generated (with a sampling period T = 
6s) from the following time-delay differential equa- 
tion [30.9] 


dx(t) ax(t— 1) 
T bx(t) + Lra o’ 


(30.69) 


where b = 0.1, a = 0.2, and t = 30. The 10 most recent 
values (u(i) = [x(i— 10), ...,x(i— 1)]”) in the past are 
used as the input to predict the present value x(i). A seg- 
ment of 500 samples is used as the training data and 
another 100 as the testing data. The data are corrupted 
by additive Gaussian noise with zero mean and variance 
0.0016. Figure 30.7 shows the learning curves of LMS 
and KLMS algorithms. Evidently, the KLMS converges 


MSE 
012, 


0 100 200 300 400 500 
Tteration 


Fig. 30.7 Learning curves of LMS and KLMS in MG time 
series prediction (adopted from [30.9]) 


Table 30.5 Performance comparison among LMS, 
KLMS, and RN 


Algorithms Training MSE Testing MSE 
LMS 0.021 + 0.002 0.026 + 0.007 
KLMS (yn = 0.1) 0.0074 + 0.0003 0.0069 + 0.0008 
KLMS (n = 0.2) 0.0054 + 0.0004 0.0056 + 0.0008 
KLMS (n = 0.6) 0.0062 + 0.0012 0.0058 + 0.0017 


RN (A= 0) 00 0.012 + 0.004 
RN (A= 1) 0.0038 + 0.0002 0.0039 + 0.0008 
RN (A= 10) 0.011 + 0.0001 0.010 + 0.0003 


to a much smaller value of the testing MSE. This is an 
expected result as the MG time series is a nonlinear sys- 
tem. In the simulation, the Gaussian kernel is used, and 
the kernel parameter (a = 1/(207)) is set at a= 1.0. 
The step sizes of the LMS and KLMS are both set at 
0.2. Table 30.5 presents the performance comparison 
among LMS, KLMS with different step sizes, and reg- 
ularization network (RN) with different regularization 
parameters. The performance of KLMS is much better 
than LMS and is comparable to RN with the best regu- 
larization. This is indeed surprising since RN is a batch 
mode kernel regression method while KLMS is a sim- 
ple stochastic gradient algorithm in RKHS. 


Lorenz Time Series 
Next, we consider the Lorenz chaotic time series, gen- 
erated from a nonlinear, three-dimensional dynamic 
system [30.17] 


Bx + 

— = —px š 

dt n 

dy 

= Ey 30.70 
T (z—y) ( ) 
oe + 

gy TI t-z, 


where the parameters are B = 8/3, 8 = 10, and p = 28. 
The sample data are obtained using the first-order ap- 
proximation with step size 0.01. The state x is picked for 
short-term prediction task. The signal is preprocessed to 
be zero mean and unit variance (Fig. 30.8). 


0 500 1000 1500 2000 
Sample 


Fig. 30.8 A segment of the processed Lorenz time series 
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We use the previous five consecutive samples 
u(i) = [x(i—5),...,x(i—1)]’ to predict the current 
sample x(i). The performances of QKLMS, KLMS- 
NC, KLMS-SC, and the standard KLMS are compared. 
Here, KLMS-NC and KLMS-SC denote the sparsified 
KLMS with, respectively, the novelty and surprise cri- 
terion. The Gaussian kernel with the kernel parameter 
a= 1.0 is used. The step sizes are all set at y = 0.1, 
and the other parameters are tuned such that all the 
algorithms except KLMS yield almost the same final 
network size (Fig. 30.9). Figure 30.10 shows the av- 
erage learning curves over 100 simulation runs with 
different segments of the signal, where the testing MSE 
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Fig. 30.9 Network sizes of QKLMS, KLMS-NC, and 
KLMS-SC in Lorenz time series prediction 
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Fig. 30.10 Learning curves of QKLMS, KLMS-NC, 
KLMS-SC, and KLMS in Lorenz time series prediction 


is calculated based on 200 test data (the filter is fixed 
in the testing phase). Simulation results clearly indicate 
that the QKLMS exhibits much better performance, 
achieving almost the same testing MSE as the KLMS 
but with small network size. 


30.4.2 Frequency Doubling 


In frequency doubling, both the input and desired data 
for the learning system are sine waves with frequencies 
fo and 2f, respectively (Fig. 30.11). In this exam- 
ple, 1500 samples are used as the training data and 
another 200 as the testing data. The data are cor- 


A 
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Fig. 30.11 Simulation data in frequency doubling (adopted 
from [30.31]) 
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Fig. 30.12 Learning curves of KLMS, KMC, and MCC in 
frequency doubling (adopted from [30.31]) 
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rupted by an impulsive mixture Gaussian noise, with 
probability density function p(x) ~ 0.9N (0, 0.01) + 
0.1N (2, 0.01) [30.31]. Let the dimension of the input 
vector be 2. Figure 30.12 shows the average learning 
curves of KLMS, KMC, and MCC (adaptive FIR filter 
under MCC criterion). It is clear that the KMC algo- 
rithm outperforms both KLMS and MCC algorithms. 
Simulation results suggest that the KMC algorithm per- 
forms well under impulsive noise environment. 


Primary signal 


Noise source n(i) 


u(i) , / 
Adaptive filter 


Fig. 30.13 Basic structure of the noise cancellation system 
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Fig. 30.14 Average learning curves of NLMS, KLMS- 
NC, and KAPA-2-NC in noise cancellation (adopted 
from [30.9]) 
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Fig. 30.15 Basic structure of the nonlinear channel 


30.4.3 Noise Cancellation 


Noise cancellation is very important in signal process- 
ing where an unknown interference has to be removed 
based on some reference measurement. Figure 30.13 
shows the basic structure of a noise cancellation system. 
The goal of the noise cancellation is to use the refer- 
ence measurement u(i) as the input to the adaptive filter 
and to obtain the filter output y(i) as an estimate of the 
unknown noise source n(i), such that the noise can be 
subtracted from the noisy measurement d(7) to improve 
the signal-to-noise ratio (SNR). 

In this example, the noise source is assumed to be 
white and uniformly distributed over [—0.5, 0.5]. Fur- 
ther, the nonlinear interference distortion function is 


u(i) = n(i) —0.2u(i— 1) 1)n(i—1) 
+ 0.1In(i— 1) + 0.4u(i— 2) . 


During the training phase the primary signal is assumed 
to be s(i) = 0, that is, the system simply tries to recon- 
struct the noise source from the reference measurement. 
We use the NLMS, KLMS-NC, KAPA-2-NC (L= 
10) algorithms. The average learning curves over 200 
Monte Carlo simulations are illustrated in Fig. 30.14. 
In the simulation, the step sizes for NLMS, KLMS- 
NC, and KAPA-2-NC are 0.2, 0.5, and 0.2, respectively. 
The Gaussian kernel is used for both KLMS-NC and 
KAPA-2-NC with kernel parameter a = 1.0. The toler- 
ance parameters for KLMS-NC and KAPA-2-NC are 
8; = 0.15 and 8) = 0.01. The noise reduction (NR) fac- 
tor, defined as 


ue (30.71) 


E{n(i)”] 
119810 Fin) — OP 
and the corresponding final network sizes are listed in 
Table 30.6. The performance improvement of KAPA 
over KLMS is obvious. 


30.4.4 Nonlinear Channel Equalization 


The final example is on nonlinear channel equalization, 
where the nonlinear channel consists of a serial con- 
nection of a linear filter H(z) and a static nonlinearity 
(Fig. 30.15). The problem setting is as follows: 


Table 30.6 Performance comparison of NLMS, KLMS, 
and KAPA-2 in noise cancellation 


Algorithms Network size NR (dB) 

NLMS N/A 9.09 + 0.45 
KLMS-NC 407+ 14 15.58 + 0.48 
KAPA-2-NC 370 14 21.99 + 0.80 
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Fig. 30.16 Learning curves of LMS, APA-1, KLMS- 
NC, KAPA-1-NC, and KAPA-2-NC in nonlinear channel 
equalization (adopted from [30.9]) 


A binary signal {s(1),s5(2),...,5(N)} is fed into 
a nonlinear channel. At the receiver end of the channel, 
the signal is further corrupted by additive i.i.d. Gaussian 
noise, and then is observed as {r(1),r(2),...,r(N)}. 
The objective of channel equalization is to learn an in- 
verse filter that recovers the original signal with as low 
an error rate as possible. This problem can be formu- 
lated as a regression problem with input—output training 
data {(r(t+ D), r(t+D—1),...,r(t+D—I1+1)), sO}, 
where / is the time embedding length, and D is the 
equalization time lag. In this example, the nonlinear 
channel is defined by x(t) = s(t) + 0.5s(t— 1), r(t) = 
x(t) —0.9x(t)* + n(t), where n(f) is a white Gaussian 
noise with variance o°. 

We compare the performance of LMS, APA-1, 
KLMS-NC, KAPA-1-NC (L= 10) and KAPA-2-NC 
(L = 10). The noise variance is assumed to be 0.01. 
l=3, and D=2 in the equalizer. For KLMS-NC, 
KAPA-I-NC and KAPA-2-NC, the Gaussian kernel 
with kernel parameter 1.0 is used, and the NC is em- 
ployed with 8; = 0.26, 8; = 0.08. Figure 30.16 shows 
the average learning curves over 50 Monte Carlo sim- 
ulations, where the MSE is calculated between the 
continuous output (i.e., before taking the hard deci- 
sion) and the desired signal. Figure 30.17 plots the 
dynamic changes of the network sizes during the train- 
ing. In addition, different noise variances are set. To 
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Fig. 30.17 Network sizes of KLMS-NC, KAPA-1-NC, and 
KAPA-2-NC over training in nonlinear channel equaliza- 
tion (adopted from [30.9]) 
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Fig. 30.18 Performance comparison of LMS, APA-1, 
KLMS-NC, KAPA-1-NC, and KAPA-2-NC with dif- 
ferent SNR in nonlinear channel equalization (adopted 
from [30.9]) 


make the comparison fair, we tune the NC parame- 
ters (ô; and 2) to make the network size almost the 
same (around 100) in each scenario. The simulation re- 
sults in terms of bit error rate (BER) are presented in 
Fig. 30.18, where the normalized SNR is defined as 
10 log,)(1/07). 
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30.5 Conclusion 


Online learning has found its place in a wide range of 
applications, especially in situations where the number 
of training data is extremely large or the data statistics 
change fast over time. Recent studies suggest that many 
online learning algorithms can be efficiently extended 
to kernel space, provided that these algorithms can be 
expressed in terms of inner products, since the inner 
products in high-dimensional kernel space can be sim- 
ply calculated using the kernel function in input space. 
At present, most of the well-known linear adaptive filter- 
ing algorithms, such as the LMS, RLS, and APA, have 
been kernelized. These new algorithms, namely the ker- 
nel adaptive filtering algorithms, can solve incremen- 
tally arbitrary nonlinear problems in the input space, if 
the kernel space is rich (high-dimensional) enough to 
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31. Probabilistic Modeling in Machine Learning 


Davide Bacciu, Paulo J.G. Lisboa, Alessandro Sperduti, Thomas Villmann 


Probabilistic methods are the heart of machine 
learning. This chapter shows links between core 
principles of information theory and probabilistic 
methods, with a short overview of historical and 
current examples of unsupervised and inferen- 
tial models. Probabilistic models are introduced as 
a powerful idiom to describe the world, using ran- 
dom variables as building blocks held together by 
probabilistic relationships. The chapter discusses 
how such probabilistic interactions can be mapped 
to directed and undirected graph structures, 
which are the Bayesian and Markov networks. We 
show how these networks are subsumed by the 
broader class of the probabilistic graphical mod- 
els, a general framework that provides concepts 
and methodological tools to encode, manipulate 
and process probabilistic knowledge in a computa- 
tionally efficient way. The chapter then introduces, 
in more detail, two topical methodologies that 
are central to probabilistic modeling in machine 
learning. First, it discusses latent variable mod- 
els, a probabilistic approach to capture complex 
relationships between a large number of observ- 
able and measurable events (data, in general), 
under the assumption that these are generated 
by an unknown, nonobservable process. We show 
how the parameters of a probabilistic model in- 
volving such nonobservable information can be 
efficiently estimated using the concepts under- 
lying the expectation—maximization algorithms. 
Second, the chapter introduces a notable example 
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of latent variable model, that is of particular 
relevance for representing the time evolution of 
sequence data, that is the hidden Markov model. 
The chapter ends with a discussion on advanced 
approaches for modeling complex data-generating 
processes comprising nontrivial probabilistic in- 
teractions between latent variables and observed 
information. 


31.1 Probabilistic and Information-Theoretic Methods 


Information theory is closely connected to probability 
theory and statistics. In particular, the standard defi- 
nition of information contained in a random variable 
X with a probability density function P(X) is well 
known to be /(X) = — log(P(X)), with the correspond- 


ing Shannon entropy, in differential form, given by the 
average information 


H(P) =~ | PCa) tog (POO) ae. (31.1) 
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One of the fundamental theorems of information the- 
ory, the second Gibbs theorem, states that the normal 
distribution achieves maximum entropy, hence maximal 
average information from all distributions with known 
variance. To show this in the univariate case, consider 
the normal distribution in the standard form 


m (- X- | 
P 202 i 


It is straightforward to show that for the natural loga- 
rithm 


P(X) = 


27m0? 


z J P(X) log(P(X))dx = ; + log (v2x0?) 


= J G(X) log(P(X))dx, 
where G(X) is any arbitrary density function with vari- 
ance f G(X)(X— u)?dx = o°. Therefore, the difference 


in average information between the two density func- 
tions necessarily observes the following 


- f rœ log Pdr f GOO log(G(X))dx 


=- f Geo log(POX)}ax-+ [GEO log(G(X))dx 


=- f Geo 1g (E2) dx 


using Jensen’s inequality log(x) < x— 1 and the normal- 
ization property f P(X) = f Q(X) = 1. This is a par- 
ticular instance of Gibbs inequality and proves that the 
asymptotic distribution of the central limit theorem also 
maximizes entropy. 

This led, in probability theory, to the definition 
of natural measures of dissimilarity closely related 
to the expectation of information difference, e.g., the 
Kullback—Leibler (KL) divergence [31.1] 


P(X) 


Dxt (P||Q) = [ro toe (7) dx, (31.2) 


Q(X) 
as generalized distances between probability distribu- 
tions P and Q. 

The KL divergence occurs frequently in machine 
learning, where the development of learning strategies 
links information theory with statistical and biologi- 
cally motivated concepts. For instance, the perceptron 


model was established as a simple but mathematically 
tractable model of a biological neuron as the smallest 
information processing unit in brains [31.2]. Recog- 
nition that gradient descent provided a pragmatic but 
effective solution to the credit assignment problem, 
namely which values the hidden nodes should have, 
led to the multilayer perceptron as powerful compu- 
tational tools for classification and regression. Initially 
maximum likelihood optimization was used for param- 
eter estimation, following the tried and tested statisti- 
cal concepts of normally distributed errors leading to 
a sum-of-squares loss function in regression and, for 
classification, the Bernoulli distribution for binary data 
and the so-called cross-entropy (31.2) for multinomial 
class assignments, the latter two likelihood functions 
measuring information divergence averaged over the 
true distribution given by the empirical class labels. 

Information theoretic aspects (e.g., mutual informa- 
tion) were also considered in neural models in order 
to avoid overtraining [31.3], for instance in Boltzmann 
networks which directly mirror information princi- 
ples in statistical mechanics [31.4]. Related approaches 
are used currently for deep learning models, where 
information principles drive the feature representa- 
tions [31.5]. 

The correspondence between maximum entropy 
and maximum likelihood outlined above is just one 
aspect of the application of information-theoretic con- 
cepts in machine learning. The next section outlines fur- 
ther developments linked first to source identification 
through blind signal separation and matrix factoriza- 
tion methods. These concepts from signal processing 
identify important degrees of freedom that may be used 
as hidden variables in probabilistic models, discussed 
later in the chapter. Furthermore, the application of 
information-theoretic methods extends also to the au- 
tomatic identification of prototypes for use in compact 
data representations that include dictionaries defined by 
methods such as vector quantization, typically with un- 
supervised approaches. 

Supervised methods are introduced as probabilis- 
tic models, focusing first on discriminative methods. 
This indicates that the maximum likelihood approach 
is limited in its predictive power in generalization to 
out-of-sample data, because it allows models to be 
generated with very little bias but with considerable 
variance — for a more detailed discussion of this point 
refer [31.6]. What this means in practice is that flexible 
models such as neural networks are prone to overfitting 
unless the complexity of the model is controlled along 
with the extent to which the model fits the data. The 
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latter is described by the likelihood, but the model com- 
plexity can be controlled in a number of different ways. 
In probabilistic models an efficient framework to max- 
imize the generality of probabilistic inference models 
is to apply the maximum a posteriori (MAP) frame- 
work which optimizes the posterior probability of the 
model parameters given the data but also given prior 
distributions for the parameters, typically limiting their 
size by assuming a zero-centred normal distribution as 
the prior. This is the basis of the method of automatic 
relevance determination, explained in Sect. 31.2. 

While discriminative models are efficient approxi- 
mators for nonlinear response functions, both in regres- 
sion and in the estimation of class conditional density 
functions, they are difficult to interpret and can gener- 
ally be considered as black boxes, meaning that they 
are not readily interpreted to give insights about the 
data. A topical and widely used alternative approach 
is to model the joint distribution of the data directly. 
This is ideally done by factorization into subgraphs into 
which the multivariate structure of the data is broken-up 
using strict conditional independence requirements, as 
discussed in Sect. 31.2. Inference can then proceed us- 
ing Bayes theorem introduced in (31.6). 

An alternative approach to modeling the joint distri- 
bution of the covariates is to use the mutual correlation 
in the data to identify important degrees of freedom that 
may be hidden in the sense that they are not directly ob- 
served. This generates latent variable representations 
that naturally fit into the framework of probabilistic 
modeling. However, the introduction of additional vari- 
ables also introduces complexity into the optimization 
process for estimating their values. This leads naturally 
to the introduction of expectation maximization (EM), 
a general approach of particular value for estimating 
mixture models, discussed in Sect. 31.3. 

So far the modeling methodologies focus on snap- 
shots of the data, without taking into consideration the 
time evolution of the covariates. To do this requires 
explicit parametrization, for which arguably the most 
widely used probabilistic approach is hidden Markov 
models (HMM). These models are build on the concepts 
of conditional independence, latent variables, and ex- 
pectation maximization to model the time evolution of 
sequences of covariate measurements, in the last sub- 
stantive Sect. 31.4 


31.1.1 Information-Theoretic Methods 


While the statistical properties of perceptrons are 
widely investigated [31.6], the more difficult prob- 


lem of establishing statistical independence is becom- 
ing increasingly important and novel algorithms have 
been presented during the last decade [31.7]. Their 
applicability is enormous, ranging from variable selec- 
tion, to blind source separation (BSS) and statistical 
causality. Frequently, the difficult question of statistical 
dependence in data is replaced by the easier consid- 
eration of estimation and application of data correla- 
tions for learning strategies. A recent approach tries 
to determine independence by generalized correlation 
functions [31.8]. In this context of decorrelation and 
independence, BSS and nonnegative matrix factoriza- 
tions [31.9] of data channels are based on statistical 
deconvolution. A comprehensive overview for BSS, 
independent component analysis (ICA) and nonneg- 
ative matrix and tensor factorization (NMF) can be 
found in [31.10-12], respectively. Different aspects can 
be investigated, like ICA and BSS maximizing con- 
ditional probabilities [31.11]. A relevant connection 
exists between NMF and probabilistic graphical models 
comprising hidden variables [31.13], which is briefly 
discussed in Sect. 31.3.4. 

Other recent approaches in this field incorporate in- 
formation theoretic principles directly: Pham [31.14] 
investigated BSS based on mutual information, 
whereas [31.15] applied /-divergences. The infomax 
principle for ICA was considered in depth [31.16], as 
was the problem of learning overcomplete data rep- 
resentations and performing overcomplete noisy blind 
source separation, e.g., the sparse coding neural gas 
(SCNG) [31.17]. Recent results including modern di- 
vergences (generalized w--divergences) were recently 
published [31.18]. Obviously, information theoretic di- 
vergence measures like Rényi-divergences (belonging 
to the family of w-divergences) capture directly the sta- 
tistical information contained in the data, as expressed 
by the probability density function [31.19, 20]. This 
property can be used for unsupervised model estimation 
for instance in vector quantization, when divergences 
are used as dissimilarity measure [31.21]. 

Information optimum vector quantization by proto- 
types is a widely investigated topic in clustering and 
data compression, based on the optimization of the 
y-reconstruction error 


Evo) = f Iv wile PW = wav. 


where P(V = v) is the data density of the vector data v 
and ||v— w(v)||z is the Euclidean distance of the data 
vector and the prototype w(v) representing it. One of 
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the key results concerning information theoretic prin- 
ciples for vector quantization is Zador’s magnification 
law [31.22]: if the data vectors v are given in q- 
dimensional Euclidean space, then the magnification 
law p ~ P% holds. Here, p(w) is the prototype density 
with the magnification factor 


q 


a= . 
qty 


This is the basic principle of vector quantization based 
on Euclidean distances. For different schemes like self- 
organizing maps, Neural Gas variants with slightly 
different magnification factors are obtained depend- 
ing on the choice of neighborhood cooperation scheme 
applied during prototype adaptation [31.23—25]. Infor- 
mation optimum magnification for œ = 1 is equivalent 
to maximum mutual information [31.22]. Yet, it is pos- 
sible to control the magnification for most of these 
algorithms by different strategies like localized or fre- 
quency sensitive competitive learning. For an overview, 
we refer to [31.23]. If the Euclidean distance is re- 
placed by divergence measures, optimum magnification 
a = | can also be achieved by maximum entropy learn- 
ing [31.26], or by the utilization of correntropy [31.27]. 
Vector quantization algorithms directly derived from in- 
formation theoretic principles based on Rényi entropies 
are intensively studied in [31.28], also highlighting its 
connection to graph clustering and Mercer kernel-based 
learning [31.29]. 

Other information theoretic vector quantizers opti- 
mize the mutual information between data and proto- 
types, or the respective KL divergence, instead of min- 
imizing a reconstruction error [31.30]. Based on this 
principle, several data embedding, or dimensionality re- 
duction techniques, have been developed as alternatives 
to multidimensional scaling. These approaches are fre- 
quently used to visualize data. Prominent examples are 
stochastic neighborhood embedding (SNE) [31.31] or 
variants thereof: for instance, t-SNE uses outlier-robust 
Student-f-distributions for data characterization instead 
of Gaussians [31.32]. The generalization to other diver- 
gences than KL can be found in [31.33]. 

Another role for information theory in machine 
learning is in feature selection. Removing irrelevant or 
redundant features not only leads to a simplification of 
the model and a reduced requirement for data acquisi- 
tion, but it is also central for maximizing the generality 
of the model when it is applied to future data. Most 
feature selection approaches are supervised schemes, 
hence using class information or expected regression 


values. Strategies to achieve this goal can be classi- 
cal Bayesian inference schemes of which automatic 
relevance determination (ARD) is a good example (de- 
scribed further in Sect. 31.2), or statistical approaches 
based on mutual correlation or covariances [31.34, 35]. 
An alternative approach to feature selection is to use 
mutual information 


K(X, Y) = De (V(X, Y) PŒ@QY)) 


between random variables X and Y with probabil- 
ity densities P and Q, respectively, and joint density 
J [31.36]. Here, the features are treated as random vari- 
ables to be compared and mutual information measures 
the information loss resulting from removal of variables 
from the model. Learning classification together with 
feature weighting in vector quantization is known as 
relevance learning [31.37]. Recent developments to in- 
troduce sparseness according to information theoretic 
constraints are discussed in [31.38, 39]. 

Information-theoretic measures such as mu- 
tual information, can be explicitly estimated from 
data [31.40]. This is used in the context of vectorial 
data analysis to obtain consistent and reliable estima- 
tors with topographic maps or kernels [31.41]. Further 
applications of information theoretic learning also use 
Rényi entropy 


az loe ( f eœ) a) 


as a cost function instead of the mean squared error, 
resorting, for computational efficiency, to Parzen esti- 
mators [31.42] or nearest neighbor entropy estimation 
models. For effective computation of an approximate 
of the mutual information /(X, Y), the quadratic Rényi 
entropy H2(p) or the closely related information energy 
are common choices [31.43]. Parzen window-based es- 
timators for some information theoretic cost functions 
have also been shown to be cost functions in a cor- 
responding Mercer kernel space [31.44]. In particular, 
a classification rule based on an information theoretic 
criterion has been shown to correspond to a linear clas- 
sifier in the kernel space. This leads to the formulation 
of the support vector machine (SVM) from information 
theory principles. 


A, (P) = 


31.1.2 Probabilistic Models 


Kernel models are known for having excellent discrim- 
ination performance, but they are typically not well 
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calibrated. This is because they are designed to be 
efficient binary class allocation models rather than es- 
timators of the posterior probability for membership of 
each class C. As an example, SVMs allocate inputs to 
classes on the basis on a binary-valued indicator vari- 
able that generally does not have a link function to 
a probability density estimate. This type of models is 
known as discriminative models, a well-known variant 
being Fisher’s linear discriminant. As the name implies, 
the central model is linear in the covariates, 
y=w'x 

optimizing, for binary classification, a discriminant 
function derived from the mean m; and variance s; of 
each class (i. e., i = 1,2), namely 


H= (rm = rn)” 
Sp tS 

In general, given the two data cohorts, the covariance 
matrix of the data S has a strict decomposition into 
within- and between-class covariance matrices as S = 
Sw + Sp. For an overall data mean vector m and a total 
of N; data points in each class, these matrices are given 
by 


N 


S=)°(@i-—m)"Gi-m)), 


i=1 


2 N 
Sy = 5 5 (i — m) (Œi — m;)) , 


j=1 i=1 


The solution to the optimization of J (w) is 
—1 
was, (m2—m), 


where the inverse of the within-class covariance matrix 
S, positions the discriminant hyperplane so as to min- 
imize the overlap between the projections of the data 
points in each class onto the direction of the weight 
w. This illustrates the observation that, in general, this 
projection will not be calibrated with a probabilistic es- 
timate such as the logit 
P(C|X) 
logit(P(C|X)) = log ( = San! . 


The correct calibration is found in a class of generalized 
linear models of the form 


y(x) =f (wx + wo), 


where f(-) is known as the activation function in ma- 
chine learning and its inverse is called a link function 
by statisticians [31.6]. Perhaps the best-known choice 
of activation is the sigmoid function, where the proba- 
bilistic model becomes logistic regression and the linear 
index w’ x represents exactly the logit (P(C|X)). This is 
very widely used and a generally well-calibrated model, 
even when severe class imbalance is present. 

It is often quoted that generalized linear mod- 
els are limited by the discriminant forms determined 
by the linear scores, which must therefore be hyper- 
planes. However, this ignores the observation that, in 
most practical applications, suitable attribute represen- 
tations are defined using domain knowledge, typically 
by binning variables into discrete states. This turns 
the probabilistic estimators into linear-in-the-parameter 
models with significant discrimination potential for 
nonlinearly separable data. In effect, if the link func- 
tion is properly tuned to the noise structure of the data 
and in particular when there are larger numbers of in- 
dependent covariates, well-designed generalized linear 
models are competitive with flexible machine learning 
models, the more so as the limitation of using a linear- 
in-the-parameters scoring index now works as a form 
of regularization limiting the complexity of the model. 
Moreover, the linear index provides a strong element 
of interpretability whose importance to application do- 
main experts cannot be overestimated. Notwithstanding 
the power of machine learning, generalized linear mod- 
els should always be used as benchmarks to set against 
nonlinear models. 

An alternative to probabilistic linear models is the 
wide range of flexible direct estimators of P(C|X) 
among which arguably the most widely used model re- 
mains the multilayer perceptron (MLP). Similarly to 
linear statistical models, however, it is important to 
note that the estimation of class conditional probabil- 
ities with an MLP is contingent on using a correct 
activation function at the output node together with 
a suitable choice of loss function, which must be one 
of the entropy functions outlined in the previous sec- 
tion. So, in binary classification, the log-likelihood 
function with a Bernoulli distribution should be used 
in conjunction with a sigmoid activation function. In 
the multinomial case, we would need an extension of 
the sigmoid function, the softmax activation, together 
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with the cross-entropy as the loss function, since this 
is the correct measure of the divergence between the 
estimated and observed probability density functions. 
Similarly, for nonlinear regression, the activation func- 
tion should be linear with the usual sum-of-squares 
error function, provided the inherent noise in the data 
can be assumed to be normally distributed with zero 
mean, since this is where the loss function is derived 
from. In the event where the noise variance, for in- 
stance, is dependent on the covariates, heteroscedastic 
noise models must be used to derive appropriate loss 
functions [31.6]. 

While the strength of neural networks is their uni- 
versal approximation capability, in the sense of fitting 
any multivariate surface to an arbitrarily small error, 
this flexibility also makes them prone to overfitting, po- 
tentially resulting in data models with little bias but 
large variance, in direct contrast to generalized linear 
models. In both cases, it is necessary to control the com- 
plexity of the model and this is best done by adding 
a penalty term to enforce the principle of parsimony, 
colloquially known as Occam’s razor (lex parsimo- 
niae). Arguably, the most commonly used and effective 
scheme is to apply Bayes’ theory at the level of fitting 
the model parameters, then to the regularization hyper- 
parameters, and finally to model selection itself. 

As we saw previously, the output of the MLP rep- 
resents a direct estimate of the posterior probability of 
class membership P(C|X). This approach can be gener- 
alized for the analysis of longitudinal data where each 
individual subject is follow up over a period of time 
starting with a defined recruitment point and ending ei- 
ther at the end of a defined observation period or when 
an event of interest is observed, whichever occurs first. 
This is often called survival modeling and is typically 
used to estimate event rates in the presence of censor- 
ship, e.g., where the outcome of interest, for instance 
recovery from an illness, is observed in some subjects 
for only part of the allowed period of follow-up due 
to other events taking over, such as another condition 
setting-in, which prevent the observer from ever know- 
ing whether or not the subject would have recovered 
from the original illness, which is the event of interest. 
For discrete time, these models can be estimated using 
the standard MLP with an additional input node coding 
the time intervals. The output of the MLP again repre- 
sents a conditional probability, but now the probability 
of the subject surviving each time interval given that 
the subject survived until the start of the time interval. 
This defines the hazard function h;(x;), for subject with 
covariate vector x; and predictions over the /th discrete 


time interval, which is given by 
hi(x;) =P(T< t;|T > t1, Xi). 


For a single event of interest, i. e., a single risk factor, 
the log-likelihood function exactly mirrors that used in 
binary classification, treating as independent the proba- 
bility estimates for each of the N subjects and over the 
discrete time intervals where the subject was observed, 
i. e., up to the end of the follow-up period or until cen- 
sorship. This leads to the following loss function 


N i 


Ls = IT IT [na [1 =h) : 


i=1/=1 


(31.3) 


where the binary indicator variable d; = 0 if the event 
of interest was not observed for the subject during the 
specific time interval, and is 1 otherwise. This loss func- 
tion is known as a partial likelihood, since it is measured 
only over time periods where the outcomes for each 
subject are observed, an approach that has been ex- 
tended to the multinomial case to provide a rigorous 
treatment of censorship with flexible models in the con- 
text known as competing risks [31.45]. 

Application of the Bayesian regularization frame- 
work consists in maximizing the posterior probability 
for the model parameters w, given the data set D, the 
regularization hyperparameters œ and the choice of the 
model structure, e.g., selected covariates H, namely 


P(D|w, a, H)P(wia, H) 


P(w|D,a, H) = PED 


(31.4) 


The first term on the right-hand side of Eq. (31.4) 
denotes the probability of the model fitting the data, 
represented by the exponential of the entropy term dis- 
cussed in the introduction and defined for longitudinal 
data by (31.3), hence 


P(D|w, œ, H) =e. 


The second term in (31.4) represents a prior distribu- 
tion of the model parameters typically with a quadratic 
loss term corresponding to independent zero-mean uni- 
variate Gaussian distributions, sometimes called weight 
decay terms. A particularly efficient implementation of 
Bayesian regularization is to assign a separate weight 
decay term to each covariate, indexing the covariates by 
m of which there are Ng, with the N, hidden nodes in- 
dexed by n. This allows each covariate to be separately 
turned on or off depending on how informative it is 
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for fitting the observations about the outcome variable, 
a process known as automatic relevance determination 
(ARD) [31.4]. Expressed in full, this gives 


—G(w,a) 
P(wla, H) = ————.,_ where G(w,a 
wa D= (wa) 
1 No Nm a on a 
=z tm wa z=] (=) 


In principle, the best values for the regularization hyper- 
parameters, i.e., the weight decay parameters a, are 
those which minimize their posterior probability 


P(D\|a, H)P(a|H) 


P(a|D, H) = PDI) 


However, the denominator of (31.4) cannot be obtained 
in closed form, so a Laplace approximation is typi- 
cally around a stationary point in the loss function as 
a function of the weights. This amounts to a local Tay- 
lor expansion of 


P(Dla, H) = | Pwa, A) POsja, Haw 


e S(w.a) 
= | ———d, 
Z(t) 
where the linear term in the weights vanishes because 
of stationarity leading to 


1 
S*(w,a) ~ S(wMP a) + 5 Wh YT — wh) ; 


from which the posterior probability for the hyperpa- 
rameter results 


—S(w™? æ) 


Mw —1/2 
a. w (27x)? det(A) ; 


P(a|D,H)« 
In practice, what this means is that the log-odds ratio, 
given by the activation of the output node of the MLP 
can be assumed to have a univariate normal distribution 
whose variance is given by the Hessian of the matrix 
S with respect to the weights; g is the gradient of the 
activation a with respect to the weights, namely 


1 -( (azap)? ) 
P(alX,D) = 70r) Pe A 


with ayp denoting the most probable value of the ac- 
tivation function, i.e., the direct output of the MLP 
without marginalization, and 


S= g Ae: 


The so-called marginalized estimate of the MLP out- 
put is now the posterior distribution integrated over the 
activation a. In the above expression, g is the gradient 
of the activation with respect to the network weights 
and A is the corresponding Hessian; hence the matrix of 
second partial derivatives. For binary classification and 
single-risks modeling, this is given by a neat analytical 
expression 


h(x;, 0) = f OPa, D)da 


( aM? (xi) 

= g | —————. (31.5) 
V1 + (x/8)8"A—'g 

with g(-) denoting the sigmoid function. This adjust- 
ment to the original MLP output, i. e., aP, shows the 
regularization process in operation: stationary points, 
where the weights are well defined, have small vari- 
ance s? and therefore their value remains almost un- 
changed. Conversely, flat valleys in the loss function, 
where stationary points for the weights have broad 
Gaussian distributions, are penalized by reducing the 
value of the argument of the sigmoid function in 
(31.5) toward nil, reflecting an increase in uncertainty 
by shifting the MLP output toward the don’t know 
threshold. 

A probabilistic alternative to discriminative ap- 
proaches consists of generative models, where Bayes’ 
theorem is once again put into practice to estimate 
the posterior probability of class membership P(C;|X), 
from the class conditional density functions P(X|C,) 
and prior probabilities for the classes P(C;,), that 
is 


P(X|Cy)P(Cx) 
PAI P(G))” 


P(C;|X) = (31.6) 


where classes are indexed by k and the sum-rule has 
been used to expand the denominator. Suitable mod- 
els for the probability density functions (pdf) of the 
data given each class will depend on the nature of the 
data. However, it is straightforward to show for two 
classes that if the pdfs are normal distributions with 
equal variance, then the posterior probability will have 
exactly the functional form of the logistic regression 
model. This can be taken as an explanation in proba- 
bilistic terms of the potential limitations of this linear 
model, since different classes in practice tend to have 
distinct variances, even when that data sets for each 
class are approximately normally distributed. A natural 
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extension of this approach is to use a mixture of Gaus- 
sian distributions. This is a very flexible model that can 
parameterize also multimodal density functions. In the 
interest of space, we refer the interested reader to a stan- 
dard textbook [31.6]. 

The two approaches of discriminative and gener- 
ative models may be combined by using generative 
models to build kernels. These kernels define similarity 
between two covariate vectors x and x’ by correlation 
between the respective pdfs, with the values of the ker- 
nel function given by k(x, x’) = P(X = x)- P(X’ =x’) 
for suitable choices of the probability functions. A ker- 


31.2 Graphical Models 


In this section, we give a basic introduction to graphical 
models, a general framework for dealing with uncer- 
tainty in a computationally efficient way. Probabilistic 
models that we treat in the next sections belong to this 
framework. Here, we introduce the two main classes 
of graphical models, Bayesian and Markov networks, 
discussing different methods for performing probabilis- 
tic inference. Specific instances of learning within this 
framework, are given in the next two sections. For the 
sake of presentation, here we limit our presentation to 
discrete random variables; however, graphical models 
can be defined on continuous variables or mixed vari- 
ables. The material covered in this section is based 
on [31.6, 46, 47]. 

A graphical model allows us to represent a fam- 
ily of joint probability distributions in terms of a di- 
rected or undirected graph, where nodes are associ- 
ated with random variables, and edges represent some 
form of direct probabilistic interaction between vari- 
ables. Being able to compactly represent the joint 
probability distribution of a set of random variables 
X = {X,...,Xn} is very important: any probabilistic 
query involving the variables X;,...,X, can be an- 
swered by knowing their joint probability distribution 
P(X\,...,X,). For example, assume variables to be 
discrete, and suppose we want to know the posterior 
probability of X; and X, given all the other variables, 
i.e., P(X1, X2|X3,...,Xn). We can easily answer this 
query by computing 


P(X, X2|X3, tee Xn) 
7 P(X1,...,Xp) 
X xi edom(x P(X) = x1, X2 = X2, X3, caa Xn) ` 
X2Edom(X2) 


nel so designed will naturally form a Gram matrix. 
Such kernels lead naturally to the use of latent vari- 
ables 


k(x,x’) = JPO = x|Z = i) P(X’ =x'|Z =i) 


x P(Z =i), 


with weighting coefficients P(Z) reflecting the strength 
of the latent variable Z indexed by i. An example of this 
approach in practice will be seen in the HMMs later in 
this chapter (see Sect. 31.4.2). 


Unfortunately, storing the joint probability values asso- 
ciated with all the different assignments x), ... , Xn is not 
feasible: if d; is the size of dom(X}), all the different as- 
signments are = di, i.e., an exponential number of 
entries. This situation, however, constitutes the worst 
case. In fact, in many application domains, indepen- 
dence properties allow us to factorize the joint distribu- 
tion into compact parts which can be stored efficiently. 
Graphical models provide the language to compactly 
represent these factors, enabling in many cases infer- 
ence and learning over a compact parameterization of 
the joint distribution as graphical manipulations. 

Graphical models can be characterized accord- 
ing to the type of probabilistic interaction between 
variables they model. Directed graphs (Bayesian net- 
works) are used to express causal relationships between 
random variables (i.e., cause — effect relationships), 
while undirected graphs (Markov networks) are better 
suited to express probabilistic constraints among subset 
of variables to which it is difficult to ascribe a direction- 
ality (graphical models containing both directed and 
undirected edges are possible; however, they will not 
be covered here). In both cases, the joint distribution is 
factorized according to the notion of conditional inde- 
pendence. 


Definition 31.1 Conditional Independence 

Let X, Y, Z be sets of random variables with X; € X, 
Y; € Y, Z€ Z. X is conditionally independent of Y 
given Z (denoted as X1LY|Z) in a distribution P if, for 
all values x; € dom(X;), y; € dom(Y;), zi € dom(Z;) 


P(X =x, Y = y|Z = z) = P(X = x|Z = z) 


x P(Y =y|Z =2), 
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where X =x denotes Xj = x1,- .. , Xny = Xn, Y =y 
denotes Yı = y1, ..., Yny = Yny, Z =Z denotes Zı = 
Zis- -<s Znz = Znz» and ny = |X|, ny = |Y|, nz = |Z]. 


It is not difficult to see that if X1LY|Z, then it is 
also true that P(X|Y, Z) = P(X|Z). In fact, using the 
product rule for probabilities, we have P(X, Y|Z) = 
P(X|Y, Z)P(Y|Z). 

In the following, we will discuss how conditional 
independence is used within Bayesian and Markov net- 
works to factorize the joint distribution. Inference and 
learning will be discussed as well. 


31.2.1 Bayesian Networks 


Bayesian networks are directed acyclic graphs used to 
model causal relationships between random variables: 
an edge X; — X3 is used to express the fact that vari- 
able X, (cause) influences variable X> (effect). The 
combination of this interpretation in conjunction with 
the exploitation of conditional independence, where 
applicable, allows the efficient probabilistic modeling 
of many relevant application domains. In general, the 
product rule can be used to factorize the joint distribu- 
tion of variables X1, X2, X3, .. . , Xn as 


n 


P(X1,X2,X3,...,Xn) = | [PŒ X2. X1). 


i=1 


Bie?) 


The conditional independence relationships can be used 
to simplify the form of each factor in (31.7), i.e., 
by eliminating variables from the conditioning part, 
thus drastically reducing the number of probability 
values that need to be specified to define the factor. 
For example, if we assume that all the variables are 
Boolean, then the number of entries needed to define 
P(X, |X1,X2,...,Xn—1,) would be 2”—'. If we consider 
a simple scenario in which the variable X,, is depen- 
dent only on X,,_;, the corresponding simplified factor 
becomes P(X,|X1, X2,...,Xn—1) = P(X,|Xn—1), which 
only requires two entries. 

The naive Bayes model used in classification tasks 
can be understood as a Bayesian network, where the 
variable associated with the class label C is the cause 
and the variables X,,...,X, used to describe the at- 
tributes of the current input are the effects. The un- 
derlying conditional independence assumption is fairly 
simplistic, but allows a very parsimonious factoriza- 
tion of the joint distribution. By assuming that the class 
label does not depend on the attributes, and that the at- 


tributes are conditionally independent with respect to 
each other given the class label, i. e., Yi, j P(X;, X;|C) = 
P(X;|C)P(X)|C), naive Bayes factorizes the joint distri- 
bution as 


P(C,X1,X2,X3,....Xn) = P(O) [ ] PIC). 


i=1 


The details of this model are not discussed in this chap- 
ter, but a good didactic reference is [31.6]. 

In general, after simplification via conditional in- 
dependence, factors are in the form P(X;|Xj,,...,Xj,). 
where Xj,,...,X;, are denoted as parents of X;, and 
the notation pa(X;) is used with the following meaning 
pa(X;) = {Xj,,...,X;,}. The factor associated with vari- 
able X; can thus be rewritten as P(X;|pa(X;)) and the 
joint distribution as 


P(X, X2,X3,...,Xn) = | [PŒ (31.8) 


i=1 


The graphical representation of a Bayesian network is 
shown in Fig. 31.1. The graphical model includes one 
node for each involved variable. Moreover, a variable 
that is conditioned (effect) with respect to a parent 
one (cause) receives a directed edge from that variable. 
For example, in the Bayesian network represented in 
Fig. 31.1, we have pa(X7) = {X2, X3}, i. e., the set con- 
stituted by the two nodes from which X7 receives an 
edge. This means that the factor associated with X7 is 
P(X7|X2, X3). In Fig. 31.1, we have reported one popu- 
lar way to specify the parameters of P(X7|X2, X3) when 
the involved variables are discrete, i. e., the conditional 
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Fig. 31.1 An example of Bayesian network. Conditional probability 
tables are shown only for variables Xs and X7. Different types of 


probabilistic influence among variables are highlighted 
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probability distribution table (CPD table). The CPD of 
X7 in Fig. 31.1, for instance, reports the probability of 
X = t, given each possible assignment of values to its 
parents. The CPD table associated with X5 is reported 
as well. By using the CPD tables associated to all nodes, 
the joint distribution can be rewritten as 

P(X, ...,X7) = P(X1)P(X3) P(X2|X1) P(X7|X2, X3) 


x P(Xs |X7)P(X6 |X7)P(X4 |X) i% 


Note that different distributions can be obtained by 
using different values for the entries of the CPD ta- 
bles. Thus, a Bayesian network is actually representing 
a family of distributions: all the distributions that are 
consistent with the conditional independence assump- 
tions used to simplify the factors. In fact, up to now, we 
have discussed how starting from a universal decompo- 
sition of the joint distribution via the product rule (note 
that such decomposition is not unique as it depends on 
the presentation order assigned to the variables), a set 
of conditional independence assumptions can be used 
to simplify the factors, leading to the corresponding 
graphical representation given by the Bayesian net- 
work. An important question, however, is whether the 
topological structure of a Bayesian network allows for 
the direct identification of other (conditional) inde- 
pendence relationships, i.e., whether there exist other 
(conditional) independence relationships that must hold 
for any joint distribution P that is compatible with the 
structure of a specific Bayesian network (note that ad- 
ditional relationships may hold only for some specific 
distributions, i. e., some specific assignment of values to 
the entries of the CDP tables). As we will see later, the 
answer to this question is important to devise general- 
purpose inference algorithms on Bayesian networks. 
A general procedure, called d-separation (directed sep- 
aration), can answer the question. It is based on the 
observation that two variables are not independent if 
one can influence the other via one or more paths in the 
graph. Let us exemplify this concept on the Bayesian 
network reported in Fig. 31.1, where we have high- 
lighted four different basic cases: 


1. Indirect causal effect: X, can influence X7 via X> if 
and only if X2 is not observed (a variable is said to 
be observed if the value assigned to that variable is 
known). 

2. Indirect evidential effect: X4 can influence X7 via X6 
if and only if X6 is not observed. 

3. Common cause: Xs can influence X¢ (and viceversa) 
via X7 if and only if X7 is not observed. 


4. Common effect: Xə can influence X; (and viceversa) 
if and only if either X7 or one of X7’s descendants 
(in this case, X5, X6, X4) is observed. 


The topological structure encountered in the com- 
mon effect is called v-structure and it plays a relevant 
role in the d-separation procedure. In general, it is clear 
from above that probabilistic influence does not fol- 
low edge direction. Thus, when considering a longer 
trail, e.g., the path from X; to X4, we have to consider 
whether each part of the trail allows probabilistic influ- 
ence to flow or not (according to the four basic cases 
described above). 


Definition 31.2 Active Trail 

Let X,,...,X, be a trail in a Bayesian network G, 
and £F be a subset of observed variables in G. The trail 
X1,...,X, is active given £ if: 


@ Whenever a v-structure X;—1 —> X; < Xj 1 does oc- 
cur, X; or one of its descendants belong to £; 
@ No other node along the trail belongs to £. 


Of course, by definition, if X; € £ or X, € E the trail is 
not active. Examples of active/not active trails from the 
Bayesian network represented in Fig. 31.1 are: the trail 
Xı, X2, X7, X6, X4 is active given the set £ = {X3, X5}, 
while it is not active whenever either X or X7 or X6 be- 
longs to £; on the other hand, the trail X; , X2, X7, X3 is 
active if X) Z £ and either X7 or X5 or X or X4 belongs 
to £. 

The Bayesian network represented in Fig. 31.1 
does not allow more than one trail between any cou- 
ple of nodes. In general, however, two nodes may 
have several trails connecting them and one node 
can influence the other one as long as there exist 
at least one active trail among them. This intuition 
is captured by the definition of the concept of d- 
separation. 


Definition 31.3 d-Separation 

Let X, Y, Z be nonintersecting sets of nodes of 
a Bayesian network. X and Y are d-separated given Z 
if there is no active trail between any node X € X and 
Y € Y given Z. 


The d-separation test can be used to precisely char- 
acterize the independence relationships which hold for 
probabilistic distributions that factorize according to the 
given Bayesian network. 
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In the following, we introduce another class of 
graphical models, i. e., Markov networks, which are de- 
scribed by undirected graphs. 


31.2.2 Markov Networks 


Directed edges in Bayesian networks are suited to de- 
scribe causal relationships between random variables. 
In many cases, however, the probabilistic interaction 
between two variables is not directional. In these cases, 
it is natural to consider undirected graphs, i. e., Markov 
networks. An undirected edge between variables X 
and Y represents a probabilistic constraint between the 
two variables. On the other hand, if X and Y are not 
connected, then we can state a conditional indepen- 
dence assertion involving them if and only if there are 
no active trails connecting them in the graph. Note 
that, since edges are now undirected, a trail is not ac- 
tive if and only if any of the variables in the trail 
is observed. This leads us to discuss which kind of 
joint distribution factorization a Markov network does 
represent. 

If we go back to the concept of active trail, it is clear 
that if we consider a subset S of fully connected nodes 
in the undirected graph, i.e., nodes in S are connected 
to each other, then any X, Y € S will be connected by so 
many trails involving nodes in S \ {X, Y} that it is wise 
to consider a single factor ¢s involving all nodes in S. 
Technically, S is called a clique, and we are actually in- 
terested in maximal cliques, i.e., cliques which cannot 
be extended in size by considering another node of the 
graph. For example, the maximal cliques of the Markov 
network given in Fig. 31.2 are 


c= {X1, X3, X5} 02 = 1X1, Xo} A 
c3 = {X2, X4} , c4 = {X3, X4} . 


Note that, while {X), X5} is a clique, it is not maximal 
since we can add X3 obtaining a larger clique. 

A different factor can be associated with each maxi- 
mal clique c;. By using a global normalization constant 
for the joint distribution factorization, a factor associ- 
ated with a clique c; can be modeled by a potential func- 
tion ¢-,(-), i.e., any nonnegative function (see Fig. 31.2 
for involving Boolean variables). Thus, the factoriza- 
tion of the joint distribution for the example in Fig. 31.2 
is 


1 
P(X, X2,X3,X4,X5) = Zee (X1, X3,X5) Pe, (X1, X2) 


X Pc; (X2, X4) bey (X3, X4) , 


where the normalization constant 


Z= > Pe, (X1, X3, X5) eo (X1, X2) 


Vix; EX; 
x Pez X2, X4) Pex (X3 , X4) 


is called the partition function. If with x we denote 
an assignment of values to the variables X),...,Xn 
and with x,, the corresponding assignments associated 
with variables in the clique c;, the general formulas for 
a Markov network are 


1 
P(X, vets Xn) i z I] Pei Xc) ’ 


Vic; 


where 


Z=) || $0). 


x Wig 


If the potential functions are restricted to be strictly pos- 
itive, then it is possible to find a precise correspondence 
between factorization and conditional independence. 
In fact, if we consider the set of all possible dis- 
tributions defined over variables of a given Markov 
network, then the set of such distributions that are con- 
sistent with the conditional independence statements 
that can be derived by using the adapted concept of 
active trails and d-separation coincides with the set 
of distributions that can be expressed as a factoriza- 
tion of the form given above with respect to maximal 
cliques of the network (Hammersley—Clifford theo- 
rem). 


onan pX, X) ı 


p(X, X3, X| 
: P(X, X4) 


Fig. 31.2 An example of Markov network involving five vari- 
ables. Maximal cliques and corresponding potential functions are 
highlighted. An example of potential function is given for clique 
{X>, X4}, where we have assumed that X2 and X4 are Boolean vari- 
ables 
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For practical reasons, it is convenient to express 
a strictly positive potential function as a Boltzmann dis- 
tribution, i. e., 


he; Xc) = eT EC) ; 


where E(x,,) is called an energy function. Since the 
joint distribution is the product of potentials, the total 
energy is obtained by adding the energy functions of 
each of the maximal cliques. Energy functions are very 
useful since, in the absence of a specific probabilistic in- 
terpretation for the potential functions, assignments of 
values that have high probability can be given low en- 
ergies, while less probable assignments will correspond 
to high energies. 

Let us give an example of application of Markov 
networks: image de-noising. The task is to remove noise 
from a binary image Y where the pixels Y; are —1 or 
+1. Each observed pixel Y; is obtained by a noise-free 
image X with pixels X; where, with some small proba- 
bility, the sign of the pixel is flipped. Since neighboring 
pixels in the noise-free image are strongly correlated, 
as well as the two variables Y; and X;, due to the small 
flipping probability, we can use a Markov network like 
the one depicted in Fig. 31.3 to capture this knowledge. 
The total energy function encoding such prior knowl- 
edge would be 


E(X,Y)=-B J XX—n D> XN, 
Xj.XjEX XjEX 
Yicy 


where all the maximal cliques are considered and cou- 
ples of pixels with the same sign get lower energy 
values. Since we are interested in removing noise from 


Pixel i 


Fig. 31.3 A Markov network for image de-noising. Y; is 
the binary variable representing the state of pixel i in the 
noisy observed image, while X; refers to the noise-free im- 
age 


the observed pixels Y;, we add a bias toward pixel val- 
ues that have one particular sign, by summing a term 
hX; to the energy function for each pixel in the noise- 
free image 


E(X,Y)=h) X- X XX—n Do XY. 
XiEX Xi. XjEX XjEX 
Yiey 
Note that his operation is legal since it corresponds to 
multiplying the potential function, which are arbitrary 
nonnegative functions, by a nonnegative function. 
The factorized joint distribution over Y and X is 
then defined as 


1 
P(X, Y) = ai 


Probabilistic inference can now be performed by clamp- 
ing the value of Y to the observed image, which implic- 
itly corresponds to a conditional distribution P(X|Y) 
over free images, and by computing the assignments 
to X that minimizes the total energy of the Markov 
model, i.e., the assignment of values to pixels of X 
with highest probability given the observed image Y. 
The resulting assignment of values to X will return the 
(presumed) noise-free version of Y. 

In the following, we briefly present different ap- 
proaches to perform probabilistic inference in Bayesian 
and Markov networks. 


31.2.3 Inference 


Performing probabilistic inference in a graphical model 
over a set of random variables X means being able 
to answer any probabilistic query involving X. Since 
a graphical model, either a Bayesian or a Markov net- 
work, describes a factorization of the joint distribution, 
any probabilistic query can be answered, so the problem 
reduces to find efficient procedures to perform infer- 
ence. In the following, we report some of the most 
typical form of queries: 


© Conditional: In this case, we are interested in com- 
puting P(Y|Z =e), where Y,E CX, with YN 
E = Ø, where Y are the query variables and £ = 
{E\,...,E,} are the evidence variables for which 
specific values e = {e),..., ez} have been observed. 
@ Most probable assignment: Given evidence E = e, 
we are interested in computing the most likely as- 
signment y* to YC X\£. There are two main 
variants for this kind of query: most probable ex- 
planation (MPE) and maximum a posteriori (MAP). 
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A MPE query must solve the problem 


y* = arg max P(Y =y, £ =e), 
y 


where Y = X \ £, while a MAP query must solve 
the problem 


* = argmax ð P(Y=y,Z=z/E =e), 
y* = arg ma 2 (Y=y IE =e) 


where Z=X\E\Y. 


From the point of view of inference, both directed 
and undirected networks can be treated in the same way. 
In fact, directed networks can be converted to undi- 
rected networks. This is done by observing that factors 
in directed networks can be understood as factors cor- 
responding to cliques in an undirected graph obtained 
by mutually connecting all the parents of each node by 
new undirected edges and by dropping direction from 
the original directed edges. This procedure is known as 
moralization and the resulting undirected graph is the 
moral graph. By this means, all the variables involved 
in factors of the directed graph (e.g., CPTs) will be 
contained in corresponding cliques of the moral graph. 
Thus, we can focus on undirected graphs. 

From a computational point of view, in the worst 
case, probabilistic inference is difficult: every type of 
probabilistic inference in graphical models is NP-hard 
or harder. Specifically, the complexity of inference is re- 
lated to a topological property of the graphical network 
called treewidth. Approximate inference methods have 
been devised to deal with such computational complex- 
ity. Unfortunately, approximate inference turns out to 
be hard, in the worst case. Nevertheless, if the treewidth 
of the graphical network is not too large (e.g., in poly- 
trees), exact inference can be performed in a reasonable 
amount of time. Moreover, in many practical cases, ap- 
proximate inference is efficient and adequate. 

There are three major approaches to perform in- 
ference: exact algorithms, sampling algorithms, and 
variational algorithms. The former tries to compute the 
exact probabilities while avoiding repeated computa- 
tions. The second approach aims to efficiently approx- 
imate probabilities by sampling, in a smart way, the 
universe of events. Finally, the third approach allows 
us to treat both exact and approximate inference within 
the same conceptual framework. In the following, we 
briefly sketch the main ideas underpinning these ap- 
proaches. 


Exact Algorithms 

Let us illustrate one of the basic ideas of exact algo- 
rithms, i. e., variable elimination, by using the Markov 
network shown in Fig. 31.2, where we assume all vari- 
ables to be Boolean. Suppose we are interested in 
computing the marginal probability P(X). We can get 
it by summing the factorized joint distribution over the 
remaining variables 


POG) = > 590, Xs) 


XI X30 X4 ë X 


x p(X), X2)b (X2, X4)h(X3, X4) .- 


Naïve computation of the above equation would require 
O(2°) operations, since each summand involves five 
Boolean variables. However, we can rearrange the sum- 
mands in a smarter way 


1 
P(X) = 7 9 9X1. X2) X | P(r, Xa) X | (Xs, Xa) 


x X p(X, X3, Xs) 


X5 


= ; J 9%, Xo) D> G(X, X4) 


x 5 (X3, X4)ms (X1, X3) 


X3 


1 
=3 > (X1, X2) > (Xz, X4)m3 (X1, X4) 


x4 


1 
- X 9i, X2)m4 (X1, X2) 


1 
= git %) , 


where the m; terms are the intermediate factors obtained 
by summation on variable X;. Note that Z can be com- 
puted by summing on variable X2. Moreover, the total 
computational complexity reduces to O(2*) since no 
more than three variables occur together in any sum- 
mand. In general, the maximal number of variables that 
occur in any summand is determined by the elimina- 
tion order. Since many different elimination orders may 
be used, the lowest complexity is obtained by the order 
that minimizes this maximal number, which is related 
to the treewidth of the graph. Unfortunately, finding the 
optimal elimination order is NP-hard. 

One positive aspect of the elimination approach 
is that it also works for continuous variables since 
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Fig. 31.4 Example of cluster graph, where the direction of the flow of computation is shown under each edge, while the 
scope of the computed factor transmitted to the other node after variable elimination is shown over each edge 


it is only based on the topology of the graph. How- 
ever, the elimination procedure returns only a single 
marginal probability, while it is often of interest to 
compute more than one marginal probability. Luckily, 
we can generalize the idea to efficiently compute all 
the single marginals. Here we give some hints on 
how to do it. Consider the sequence of intermediate 
factors generated in the example above. They can be 
indexed by the variables in their scope, i.e., 1.3.5 
= ¢ġ(X1, X3, Xs), Wi3.4 = $ (X3, X4 )ms (X1, X3), Y1,2,4 
= (Xz, X4)m3(X1,X4),Wi.2 = b(X1,X2)ma(X1, X2). 
Graphically, we can represent them via a cluster 
graph, where each node is associated with a subset of 
variables (i.e., the scope of intermediate factors) and 
the undirected edges support the flow of computation 
of the elimination process. In our example, the cluster 
graph is shown in Fig. 31.4, where we have shown the 
direction of the flow of computation under each edge, 
and the scope of the computed factor transmitted to the 
other node after variable elimination over each edge. 
The variable X, in the rightmost node is underlined to 
remark that it is the target of the flow of computation. 
In general, since each edge is associated with a variable 
elimination, it is not difficult to realize that the cluster 
graph is in fact a tree (called clique tree or junction 
tree). This structure can also be used for computing 
other marginals. In order to see that, we have to observe 
that the scope of the rightmost node is a subset of the 
scope of the node at its left, so it can be merged with 
this last node; moreover, each initial potential must be 
associated with a node with consistent scope, e.g., 


—S 
$ (Xi, X2) Q (Xz X4) 


o (Xi, X3, Xs) o (X3, Xa) 


Now, suppose we want to compute P(X3) by eliminat- 
ing all the other variables. We have to select a node 
which contains X; in its scope, e.g., the middle node. 
The flow of computation should now converge toward 
that node, as shown in 


Xi, X3 Xi, X4 
Xi, X3, X5 — Xi Xs, X4 [m 
Ms m 


Xı, Xo, X4 


Any elimination order consistent with the above flow 
will do the work, e.g., we first consider the leftmost 
node and eliminate X5 by transmitting the message 


ms (X1, X3) = J $ (X1. X3. Xs) 


X5 


to the middle node. Then, we do the same for the 
rightmost node, by eliminating X2 and transmitting the 
message 


mo(X1,X4) = X p(X, X2)p (Xo, X4) . 


Finally, the middle node can merge the two received 
messages with the local potential obtaining 


(X3, X4)ms (X1, X3)m (X1, X4) , 


which is an unnormalized version of the joint dis- 
tribution P(X,, X3, X4). Marginal P(X3) can then be 
computed by summing out X; and X4 and normalizing 
the result. Note that the same flow can be used to com- 
pute P(X,) and P(X4): in the first case, the final stage 
will sum out X; and X4, while in the second case it will 
sum out X; and X3. 

In general, all the factors needed by all the nodes to 
compute the marginals of the variables in their scope, 
can be computed by a sum-product message passing 
scheme where, having selected an arbitrary node as 
root, messages are transmitted from the leaves up to the 
root and then back from the root to the leaves. If ev- 
idence is present, restricted potentials (i.e., potentials 
where evidence variables are bound to the observed val- 
ues) are used. MEP and MAP queries can be answered 
by using a max-sum algorithm, which is a variation 
of the sum-product algorithm exploiting a trellis over 
all the values the variables can take. The message 
passing scheme sketched above can also be imple- 
mented using division, giving raise to the Belief Update 
algorithm. 


Sampling Algorithms 
The strategy adopted by sampling algorithms to per- 
form (approximate) inference is to approximate the 
joint distribution via estimates computed on a set of 
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representative instantiations of all, or some of, the vari- 
ables of the graphical model. Unlike exact inference, 
some techniques are specialized for directed networks. 
For example, a simple approach to estimate the joint 
probability in a Bayesian network is Forward Sam- 
pling. It starts by considering any topological ordering 
of the variables, e.g., for the network in Fig. 31.1 the 
order X1, X3, X2,X7,X5,Xo6,X4 will do the job. Then 
random samples are generated by following the order 
and by picking a value for each variable according to 
its distribution. Note that variables with conditional dis- 
tributions will be considered only when specific values 
for their parents have already been generated, so that the 
conditional probability for those variables is fully spec- 
ified. Once M full samples are generated in this way, the 
probability of a specific event P(E = e) is estimated as 
the fraction of samples where variables in £ take val- 
ues e. If the query is of the form P(Y|£ = e), samples 
which are not consistent with the evidence are rejected 
(rejection sampling) and the remaining samples used 
to estimate the conditional distribution on variables Y. 
With this approach, however, a large amount of gener- 
ated samples are discarded. 

An improvement on this aspect is given by the 
likelihood weighting algorithm, which is based on the 
observation that evidence variables can be forced to as- 
sume only the observed values in a sample as long as the 
sample is weighted by the likelihood of the evidence. 
This means that a weight is associated with each sample 
and the weight is given by the product of all the poste- 
rior probabilities corresponding to the observed values 
for the evidence variables, i. e., 


Wsample = I] P(E; = e;|pa(E;)) . 
EEE 


Estimates are then computed considering weighted 
samples. Likelihood weighting turns out to be a spe- 
cial case of a more general approach called importance 
sampling which aims at estimating the expectation of 
a function relative to some distribution. 

Improved sampling methods, which can also be ap- 
plied to Markov networks, are given by Markov chain 
Monte Carlo methods. Unlike the methods described 
so far, these methods generate a sequence of samples, 
in such a way that later samples are generated by dis- 
tributions that provably approximate with increasing 
precision the target posterior probability (i. e., the query 
P(Y|E = e)). 

The simpler method uses Gibbs sampling: an ini- 
tial assignment of values for the unobserved variables 


is generated from an initial distribution; subsequently, 
in turn, each unobserved variable is sampled using 
the posterior probability given the current sample for 
all other variables. This distribution can be computed 
efficiently by using only factors associated with the 
Markov blanket, i.e., the neighbors of the variable to 
be resampled in the Markov network (in Bayesian net- 
works, the Markov blanket of a node is given by the set 
of its parents, its children and the parents of its chil- 
dren). Using the theory of Markov chains (discussed in 
Sect. 31.4.1), it is possible to show that, under some 
assumptions, the sequence of generated distributions 
converges to a stationary distribution, where the frac- 
tion of time in which a specific assignment of values 
to variables (sample) does occur in the sequence is ex- 
actly proportional to the posterior probability of that 
assignment. 

A drawback of Gibbs sampling is that it uses only 
local moves (i. e., resampling of a single variable), lead- 
ing to very slow convergence for assignments with 
low probability. More effective methods, based on the 
Metropolis—Hastings approach, enable for a broader 
range of moves. Further, more advanced approaches al- 
low us to consider partial assignments in conjunction 
with a closed-form distribution for unassigned vari- 
ables. Others use deterministic methods to explicitly 
search for high-probability assignments to approximate 
the joint distribution. 


Variational Algorithms 
Probabilistic inference can be formulated as a con- 
strained optimization problem. This allows both to 
rediscover exact inference algorithms, such as the ones 
we have briefly discussed above, and to design ap- 
proximated inference algorithms, by simplifying either 
the objective function to optimize and/or the admissi- 
ble region for optimization. The possibility to devise 
theoretically founded approximation algorithms is par- 
ticularly appealing in cases where the joint distribution 
is characterized by a factorization with associated large 
treewidth. Research in this area has been recently very 
active, yielding to several interesting results. Here we 
do not have the space for a proper technical treatment, 
so we try to give only a brief introduction to the main 
ideas. 

Variational approaches are based on the idea of 
approximating an intractable probabilistic distribu- 
tion with a simpler one, which allows for inference. 
This simpler distribution is selected from a family of 
tractable distributions, as the distribution that is the best 
approximation to the desired one. Can we define a mea- 
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sure of the quality of the approximation that can be 
used for the minimization process? A good measure is 
the KL-divergence introduced in (31.2). Let us denote 
a distribution that factorizes according to the graphical 
model G as 


1 
PaX)=F JI be) (31.9) 


Vi,ci 


and let Q(X) be a member of the tractable distributions 
we use to approximate Pg(X). Then, a nice feature of 
KL-divergence is that it allows us to efficiently solve 
the optimization problem 


arg min Di BAE) 


without requiring to perform inference in Pg(X). In 
fact, using the factorization of Pg(X) in (31.9), it is not 
difficult to show that 


Dg (Q(X)||Pg(X)) = log Z— > 


Vi.cj 
+ Egflog Q(X)] , 


Egllog $<] 


(31.10) 


and, since log Z does not depend on Q(X), minimizing 
Dx (Q(X)||Pg(X)) is equivalent to maximizing the en- 
ergy functional term 


3 


Wici 


Egllog ¢.;] — Eollog Q(X)] . 


Following from the definition in (31.1), Họ(X) = 
—Eo[log Q(X)] is the entropy of Q, while the first term 
in (31.10) is referred to as energy term. 

Different variational methods correspond to differ- 
ent strategies for optimizing the energy functional. The 
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Knowledge hindered in the complex relation between 
a large number of observable variables can be surfaced 
under the assumption that a simpler and unobservable 
process exists, which is responsible for generating the 
complex behavior of manifest data. Such an unobserv- 
able generative process can be modeled through the use 
of latent variables, as opposed to observable variables, 
that are not directly measurable, but can be inferred 
from observations and can explain the relation between 
manifest data. Intuitively, latent variables can be un- 


name variational is used since all of them adopt the 
general strategy of reformulating the optimization prob- 
lem by introducing new variational parameters to be 
used for optimization. In particular, each specific choice 
of values for the variational parameters expresses one 
member, i.e., Q(X), of the family of tractable distri- 
butions we want to use. The optimization procedure 
searches the space of variational parameters to find the 
Q* (X) that best approximates Pg(X). It is important to 
understand that the family of tractable distributions will 
actually corresponds to a set of constraints, involving 
the variational parameters that must be satisfied while 
maximizing the energy functional. By using Lagrange 
multipliers these constraints can be merged together 
with the energy functional, giving rise to a Lagrangian 
function that must be maximized. By taking the partial 
derivatives with respect to the variational parameters 
and the Lagrange multipliers, the solution to the op- 
timization problem can be characterized by a set of 
fixed-point equations. These equations can then be used 
to straightforwardly devise an iterative solution. 

Different variational methods work with different 
types of approximations. There are two main sources 
of approximation, which can be used singularly or in 
conjunction. One source is the energy functional, which 
can be substituted by a functional easy to manipulate 
while preserving a good degree of approximation. An- 
other source of approximation are the constraints, 1. e., 
the definition of the family of tractable distributions, 
which may not be fully consistent with the factoriza- 
tion represented by the graphical model (in this case, 
denoted as pseudo-distributions). 

We do not have space here to give more details; 
however, it is worth to mention that while convergence 
proofs of several variational methods are available, it 
is not so common to find theoretical guarantees on the 
approximation error made by the specific method. 


derstood as an attempt to model the unknown physical 
process generating the observations or as an abstraction 
providing a simplified representation of the manifest 
data, e.g., clusters. 

Probabilistic models that attempt to explain obser- 
vations in terms of latent variables are called latent 
variable models. In probabilistic terms, the simplifi- 
cation introduced by latent variables results in condi- 
tional independence assumptions, such that (subsets of) 
observable variables can be considered conditionally 
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independent when their hidden explanation, i. e., the la- 
tent variable assignment, is given. Similarly to observed 
variables, latent variables can be discrete or continu- 
ous: their nature, together with that of the observations, 
determines different types of probabilistic models. Nev- 
ertheless, parameter estimation in the different latent 
variable models can be achieved through a general it- 
erative principle, known as expectation—maximization. 


31.3.1 Latent Space Representation 


To understand the intuition at the basis of latent space 
representation, consider a joint distribution P(X) = 
P(X,,...,Xy) defined over N joint observed random 
variables X;. As discussed in Sect. 31.2.1, without 
any simplifying assumption, the number of free pa- 
rameters of this simple model grows as O(2‘—!) for 
Boolean variables, which quickly becomes unmanage- 
able for large N. One way to control the number of free 
parameters of a model, without taking too simplistic 
assumptions (e.g., X; being 1.i.d.), is to introduce a col- 
lection of latent, or hidden, variables Z = {Z,,..., Zg}. 
The latent variables are unobserved but can be used 
to factorize the joint distribution P(X) while allow- 
ing to capture (some of) the correlations between the 
X = {X,...,Xy} observed variables. More formally, 
latent variables are such that 


P(X) = J P(X|Z=HP(Z=2)dz, (31.11) 


that is the general formulation for the likelihood of a la- 
tent variable model. The details of the latent variable 
model, and the tractability of the integral in (31.11), are 
determined by the form of the conditional distribution 
P(X|Z) and by the marginal probability P(Z). A com- 
mon approach in latent variable models is to assume 
that observed variables become conditionally indepen- 
dent given the latent variables, that is 


N 
Px) = | T [PÆ = 9P =z). (31.12) 


z i=l 


A basic assumption for this latent model to be effective, 
is that the conditional and marginal distributions should 
be more tractable than the joint distribution P(X). For 
instance, in a simple scenario with discrete observa- 
tions and latent variables, this entails that K « N. Not 
surprisingly, the same intuition is applied, in a deter- 
ministic context, for dimensionality reduction (cf. the 
number of projection directions in PCA) and clustering. 


Different types of latent variable models are defined 
based on the nature of the latent and observed variables, 
as well as depending on the form of the conditional 
and marginal probabilities. In the following, we discuss 
two general classes of latent variable models with con- 
tinuous and discrete hidden variables, which are factor 
analysis and mixture models, respectively. 


31.3.2 Learning with Latent Variables: 
The Expectation-Maximization 
Algorithm 


Learning, in a probabilistic setting, entails working with 
the model likelihood. In latent variable models, the like- 
lihood in (31.11) might be difficult to treat due to the 
marginalization inside the logarithm, which can po- 
tentially couple all the model parameters. Despite the 
diversity of the models that can be designed, based on 
the general expression in (31.11), there exist a general 
principle to estimate their parameters. 

The expectation—maximization (EM) algorithm 
[31.48] is a general iterative method for the maximiza- 
tion of the likelihood under latent variables. The key 
intuition of the EM algorithm is to define an alterna- 
tive objective function where the parameter coupling 
introduced by the marginalization of the hidden vari- 
ables is removed. The EM algorithm maximizes the 
marginal data likelihood P(X|@), where 0 are the model 
parameters, through a tractable lower bound defined 
by introducing a function of the latent variables, i.e., 
Q(Z), into the data likelihood through marginalization. 
For notational simplicity, consider the case of discrete 
latent variables. For any nonzero distribution Q(Z), it 
holds 


£(0) = log P(X|6) 
a Z = z|0) 
P(X, Z = z|8) 
=| 
o2) Ot E ae 
> Tow) log P(X, Z|8) 


= D Q(z) log Q(z) = £(2,9), (31.13) 


where the lower bound £(Q, 0) < £(0) is obtained by 
the application of the Jensen inequality to the con- 
cave log function. The joint distribution P(X, Z|@) is 
known as the complete data likelihood, where the term 
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complete refers to the fact that the marginal data likeli- 
hood P(X|@) is completed with the observations z for 
the latent variables. 

The Expectation—maximization algorithm defines 
an alternate optimization process where the bound 
£(Q,) is maximized with respect to Q(-) and 0. In 
general, this is performed by two independent maxi- 
mization steps that are repeated until convergence: 


@ Expectation (E) Step: For 6 fixed, find the distribu- 
tion O“+) (z) that maximizes the bound £ (Q, 0); 

@ Maximization (M) Step: Given the distribution 
Q(z)“+"), estimate the model parameters 9+) 
that maximize the bound £(Q°+) , 0); 


where the superscript denotes the estimate at time t. 
Clearly, the optimal solution for the E-step is attained 
when 


oft (z) = PE = z|X, 9°), (31.14) 


that is when the lower bound in (31.13) becomes an 
equality. In practice, to explicitly evaluate the complete 
likelihood in LQ, gM), we would need to observe the z 
assignments. These are unknown, since latent variables 
are unobservable. However, given the marginalization 
of z in (31.13), we can substitute the unavailable z ob- 
servations with their expected values, by considering 
them as another random variable. To this end, it suf- 
fices that the E-step computes the expected value of 
the complete log-likelihood log P(X, Z|0) with respect 
to Z. These observations provide the final form of the 
classical EM algorithm: 


© £E-step: Given the current estimate of the model pa- 
rameters 6, compute 


QCD (6/8) = Exx gw [log P(X, Z|0)] ; 
(31.15) 


© M-step: Find the new estimate of the model param- 
eters 


OOF) = argmax QTD (6/0). (31.16) 


In other words, the E-step estimates the value of the 
otherwise unobserved latent variables, while the M-step 
finds the parameters that maximize the current estimate 
of the log-likelihood. In practice, the E-step often re- 
duces to estimating the expectation of Z as its posterior 


P(2Z|X, 0), while the M-step uses these values as suf- 
ficient statistics to update the model parameters 00+») , 
This alternate optimization is typically iterated until the 
log-likelihood does not change much between consecu- 
tive estimates, or when a number of maximum iterations 
is reached. Note that the two-step EM optimization pro- 
cess is prone to local optima. Hence, its convergence 
can be slow and, often, its solutions tend to be depen- 
dent on the initialization. 

The EM algorithm assumes that we can calcu- 
late the expected value of the complete log-likelihood. 
However, there are cases in which the required summa- 
tion is not computationally feasible (e.g., with infinite 
summations where the integral has no close-form solu- 
tion): in this cases, the approximated inference methods 
described in Sect. 31.2.3 can be used to define nonex- 
act EM algorithms. For instance, stochastic versions 
of the EM algorithm are obtained by approximating 
the infeasible summation using (e.g., Gibbs) sampling 
from the posterior distribution P(Z|X,6). The clas- 
sical EM algorithm is a ML method providing point 
estimates of the model parameters 6. The variational 
Bayes (VB) [31.6] method has been introduced to ob- 
tain a fully Bayesian solution that returns a posterior 
distribution of the parameters P(@), instead of their 
point estimate. VB is based on an analytical approxi- 
mation of the joint posterior of the latent variables and 
model parameters that yields to a generalization of the 
EM alternate optimization, where the maximization at 
the M-step is taken over possible distributions Q(0), in- 
stead of on @ itself. 


31.3.3 Linear Factor Analysis 


Factor analysis (FA) is an example of a latent variable 
model for continuous hidden and manifest variables. 
In its simplest linear form, it is a classical statistical 
model widely used for generative dimensionality re- 
duction. Similarly to its deterministic counterparts, e.g., 
PCA, it forms a low-dimensional embedding of a set 
of observations D = (x),...,Xn), where each obser- 
vation x is a D-dimensional vector of reals. FA finds 
a lower dimensional probabilistic representation of D, 
by assuming that the features of each x are indepen- 
dently generated by K real-valued latent variables Z = 
{Z,,...,Zx}, with K < D (see the associated graphical 
model in Fig. 31.5). 

The FA model, assumes that observations are linked 
to the latent vectors through a linear model 


x=Fzt+b+e, (31.17) 
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Fig. 31.5 Linear factor analysis: the observed D- 
dimensional variable X is related to the K latent variables 
Z = {Z1,... , Zg} through a linear mapping 


where € ~ N (e|0, W) is the Gaussian distributed noise 
with zero mean and covariance W, b is a bias vector 
and F is the factor loading matrix. The latent vari- 
ables are the factors and are generally assumed to be 
distributed as Z ~ N (z|0, Ix) = P(Z), where Ix is the 
K-dimensional identity matrix. Under such Gaussian 
assumptions, and given the linear model in (31.17), the 
conditional distribution of the observations is 

P(X =x|Z =z) = N (x|Fz+b, Y), (31.18) 
which, inserted in (31.11), provides the distribution for 
the FA complete likelihood 


P(X) = J P(X|Z)P(Z)dz = N (x|b, FF’ +W). 


(31.19) 


The form of the noise covariance ¥ determines the type 
of FA model: in general, this is chosen as a diagonal 
matrix with a vector of (Y1, ..., Wp) values on the main 
diagonal. When the diagonal elements are all equal to 
a single value o° € R, the FA reduces to the special case 
of the Probabilistic PCA [31.49]. 

Learning of the FA parameters 0 = (W, F) (b is usu- 
ally set a priori to the mean of the data) is obtained 
by maximum likelihood estimation. The most popular 
approach to obtain such estimates is based on solving 
an eigen-decomposition problem. Given the nature of 
FA as a latent variable model, its 0 parameters can also 
be estimated by applying EM to the logarithm of the 
complete likelihood in (31.19). The latter approach is, 
however, less used in general, given its slower conver- 
gence. 


31.3.4 Mixture Models 


The term mixture models identifies a large family 
of latent variable models comprising discrete hidden 


variables and generic manifest variables. A mixture 
model assumes that each observation is generated by 
a weighted contribution of a number of simple distri- 
butions, selected by the hidden variables. The simplest 
form of mixture model assumes that an observation 
is independently generated by a single mixture com- 
ponent. Widely popular elements of this family are 
the Gaussian mixture model for continuous observa- 
tions and the mixture of unigrams for multinomial data. 
In the following, we discuss an example of more ar- 
ticulated generative processes comprising observations 
with mixed component memberships. 


Probabilistic Latent Semantic Analysis 
Probabilistic latent semantic analysis (pLSA) [31.50] 
has been introduced to model mixed membership obser- 
vations, where a manifest sample is allowed to be gener- 
ated by multiple latent variables. Its primary application 
is on documental analysis, where latent variables are 
interpreted as topics to be identified in a collection of 
documents. Intuitively, in the mixture of unigrams, each 
document is assigned to a unique topic and, as a conse- 
quence, all the words in a document are constrained to 
belong to a single topic. The pLSA model relaxes this 
assumption by allowing words in a document to belong 
to different topics, obtaining a multitopic representation 
for the documents in the collection. 

The typical pLSA setting includes a dataset of 
multinomial samples, which are the documents D = 
{d,,...,dy}. Each document is an L-dimensional vec- 
tor of word counts of length equal to the size of the 
reference dictionary. In other words, the ith observed 
sample is a vector d; = (wi, nok wi), where wi is the 
number of occurrences of the jth word of the vocabulary 
in the ith document. This data is typically summarized 
in a rectangular L x N integer matrix n, such that each 
row n(-,d;) contains the word counts for document dj. 
The variables identifying words and documents, i. e., W; 
and Dj;, are observed, in contrast with the set of top- 
ics Z = {Z,,...,Zgx}, which are the latent variables. 
In pLSA, every observation n(wj, dj) is associated with 
a latent topic z by means of the hidden variable Z,. 

The fundamental probabilities associated with this 
model are P(D = d;i), that is, the document probabil- 
ity, P(W = w,|Z = zz), that is, the probability of word 
wj conditioned on topic zg, and P(Z = z,|D = dj), that 
is the conditional probability of topic zę given docu- 
ment d;. Given the nature of the manifest and hidden 
variables, all probabilities involved in pLSA are multi- 
nomials. The pLSA defines a (quasi) generative model 
for the word/document co-occurrences whose gener- 


563 


E'LE | d Hed 


564 PartD 


Neural Networks 


E'LE | d Hed 


ative process is described by Fig. 31.6, using plate 
notation. This is a concise representation for graphical 
models involving replications: rectangular plates denote 
replication of their content for a number of times given 
by term on the bottom right (e.g., N and La for the outer 
and inner plates in Fig. 31.6, respectively); each shaded 
circular item denotes an observed variable, while empty 
circles identify latent variables. 

The conditional independence relationships in 
Fig. 31.6 allow us to factorize the joint word-topic 
distribution: by using the parent decomposition rule in- 
troduced in (31.8), it yields 


P(W;, Di) = P(D:)P(W;|D:) 
K 
= P(D;) X P(Z|D;)P(WiIZx) (31.20) 
k=1 


that is the specific pLSA form of the general latent topic 
factorization in (31.12). The second equality in (31.20) 
is given by the marginalization of the latent topics Z 
and by the conditional independence assumption of the 
pLSA model, stating that word w; and document d; can 
be considered independent given the state of the la- 
tent variable Z. In other words, the word distribution 
of a document is modeled as a convex combination of 
K topic-specific distributions P(W;|Z;,). Such decom- 
position has a well-known characterization in terms of 
Nonnegative matrix factorization [31.13]. 

Estimation of the pLSA parameters 0 = 
{P(W|Z.),P(Z.|D;)} is obtained by maximization 
of the log-likelihood 


D W D W 
£(0) = log] | | ] P(W,, Dy" =F Y nowa) 
i=1j=1 i=1j=1 
K 
P(D) ) > P(Zx|Di)P(W)|Zx) 
k=1 


x log 


> 


(31.21) 


eo. © 


La 
N 


Fig. 31.6 Graphical model for the probabilistic latent se- 
mantic analysis: indices for the random variables D, Z, and 
W are omitted in the plate notation. The term Ly, denotes 
replication for the Lg words present in the dth document 


where P(W;, D;) has been expanded using the formu- 
lation in (31.20). As with other latent topic models, 
this maximization problem can be solved through the 
iterative EM-algorithm discussed in Sect. 31.3.2. Fol- 
lowing (31.15), the E-step computes the expectation 
of the complete likelihood P(Z, W, D) with respect to 
the pLSA latent topics, assuming observed documents 
and words. It easily shows that the resulting E-step 
computes 


P(Z;|Di) P(W; Z9 ® 
a P(Zy [Dj)© P(W)|Zy) ; 
(31.22) 


P(Z,|Wj, Di) = 


that is the probability of the topic Z, given word W; 
in document D,, estimated using the current values (at 
time #) of the model parameters 9 = {P(W|Z,)©, 
P(Z,|D;)}. Note that the decomposition on the right- 
hand side of Eq. (31.22) has been obtained by factor- 
ization of the posterior P(Z,|W;, Dj) using the Bayes 
theorem. 

The M-step equations (31.16) are obtained by dif- 
ferentiating the pLSA log-likelihood, extended with ap- 
propriate Lagrange multipliers for normalization, with 
respect to the P(Z;,|Dj;) and P(W;|Z,) parameters. The 
resulting update equations are 


E; nw, di) P(Z Wj, Di) 
i nj, di) 


P(Z|D) 0t? = 


’ 


(31.23) 
E2, nj, d;)P(Zi|W;, Di) 


Ea Die nw, di) P(ZelWj, Di) 
(31.24) 


P(W|Z,) °F? — 


The two-step optimization is iterated until a likelihood 
convergence criterion is met: often a validation set, or 
a tempered version of the EM are used in order to avoid 
model overfitting [31.50]. 


Advanced Topic Models 
The pLSA was the first mixed membership model 
allowing a single observed sample to be generated 
by multiple latent topics at the same time. How- 
ever, pLSA cannot be considered a fully generative 
model. In fact, the document-specific mixing weights 
for the topics are not sampled from a distribution, 
rather they are selected from P(z;,|d;) based on the in- 
dex of document d;. Hence, pLSA indexes only those 
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documents that are in the training set D and can- 
not directly model the generative process of unseen 
test documents. In other words, the pLSA is basi- 
cally assigning null probabilities to all inputs that are 
not in the training set. The folding-in heuristic has 
been proposed to opportunistically solve this limitation, 
by assigning latent variables in the test-data to their 
MAP values before computing the test-set perplexity. 
However, the folding-in approach has been shown to 
lead to overly optimistic estimates of the test-set log- 
likelihood [31.51]. 

The latent Dirichlet allocation (LDA) [31.52] has 
been proposed as a Bayesian approach to address such 
modeling limitation of pLSA. It extends pLSA by treat- 
ing the multinomial weights P(Z|D) as additional latent 
random variables, sampling them from a Dirichlet dis- 
tribution, that is the conjugate prior of a multinomial 
distribution. Using conjugate distribution eases infer- 
ence as it ensures that the posterior distribution has the 
same form of the prior. The latent variable decomposi- 
tion of the LDA log-likelihood is 


P(W = wld.a, B) = J XC P(W = wiIZ =z,¢) 


x P(Z = 2/8) P(O|a)P(p|B)d0 , 
(31.25) 


where P(W|Z, ) is the multinomial word-topic distri- 
bution with parameters ¢ sampled from the Dirichlet 
distribution P(¢|6). The term P(Z|@) is the topic dis- 
tribution having 0 as document-specific multinomial 
parameter being sampled from the Dirichlet P(0|a). 


31.4 Markov Models 


Time series and, more generally, sequences are a form 
of structured data that represents a list of observations 
for which a complete order can be defined, e.g., time 
in a temporal sequence. Let a sequence of length T be 
Yn = Y1,---, Yr, Where the bold notation is used to de- 
note the fact that y is a compound object (in practice, 
however, this is can be treated as a set of random vari- 
ables). The term y; is used to denote the tth observation 
with respect to the total order. Position t is often referred 
to as time when dealing with time-series data. 

Two sequences are generally the results of indepen- 
dent trials, hence they can be considered 1.i.d. samples. 
However, the elements composing a sequence fail to 
meet such i.i.d. property. Therefore, in principle, a prob- 


o o 


La 


Fig. 31.7 Graphical model for the latent Dirichlet alloca- 
tion 


The terms œ and f are the hyperparameters of 
the Dirichlet distribution, see Fig. 31.7 for the model 
plate notation. Direct EM inference is impossible for 
LDA, since the integral in (31.25) is intractable due 
to the couplings between the parameters within the 
topic marginalization. Again, approximate and stochas- 
tic Bayesian inference methods, such as those in 
Sect. 31.2.3, are used to fit the LDA parameters, includ- 
ing VB [31.52], expectation propagation [31.53], and 
Gibbs sampling [31.54]. 

The principles underlying pLSA and LDA have 
inspired the development of latent topic models that ac- 
count for more articulated assumptions on the form of 
the hidden generative process. For instance, hierarchical 
LDA [31.55] proposes a generative process where ob- 
servations are generated by a topic tree instead of being 
drawn from a flat topic collection. Further, specialized 
latent variable models have been developed for specific 
applications, such as author-topic analysis in scientific 
literature [31.56] and image understanding [31.57]. 


abilistic model for y would be required to specify the 
joint distribution P(Y;,.. . , Yr). For discrete valued ob- 
servations yz, the joint distribution grows exponentially 
with the size of the observation domain. Clearly, this 
would make the use of the probabilistic model fairly 
impractical due to the exponential size of the param- 
eter space. To reduce such parameterization, Markov 
chains make the simplifying assumption that an obser- 
vation occurring at some position ¢ of the sequence, 
only depends on a limited number of its predecessors 
with respect to the complete order. In a time series, 
this entails that an observation at the present time, only 
depends on the history of a limited number of past 
observations. Markov chains allow us to model such 
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history dependence and are the heart of the hidden 
Markov model (HMM), which is the most popular ap- 
proach to model the generative process of sequential 
data. 

The HMM is a notable example of latent variable 
model: in the following, we provide an overview of 
the associated learning and inference problems. For 
simplicity, presentation focuses on sequences of finite 
length T and discrete time t. Sequence elements y, can 
be either discrete valued or defined over reals, without 
major impact on the model. The section also discusses 
how the HMM causation assumption can be modified 
to give rise to alternative approaches, with interesting 
applications that overshoot simple sequence modeling. 


31.4.1 Markov Chains 


A Markov chain is a simple stochastic process for 
sequences. It assumes that an observation y; at time 
(position) t only depends on a finite set of L > 1 pre- 
decessors in the sequence. The number of predecessor 
L influencing the new observation is the order of the 
Markov chain. 


Definition 31.4 Markov Chain 
An L-order Markov chain is a sequence of random vari- 


ables Y = Y,..., Yr such that for every t € {1,..., T}, 
it holds 
P(Y, =yl¥1,.-.,¥-1, Yita,---, Yr) 
= P(Y, = y| Y1,- - , Y1) - (31.26) 


Following from the discussions in Sect. 31.2.1, (31.26) 
states that the L predecessors of Y, define the set 
of its Bayesian parents pa(Y,) = {Y;-1,..., Y;-1}. For 
a first-order Markov chain, i. e., L = 1, (31.26) reduces 
to P(Y, = y;|Y:-1 = y;-1). Such conditional indepen- 
dence assumption formally encodes the intuition that 
the current observation can be predicted from the 
sole knowledge of the preceding sample. The graph- 
ical model of a first-order Markov chain is shown in 
Fig. 31.8, whose joint distribution decomposes as 


P(Y,,..., Yr) = P(Y%1)P(¥2|¥1), P(¥s|¥2) 
xX.. .P(Yr|Yr—:) 


T 
= PY) | | PEAY). 


t=2 


(31.27) 


The first element Yı has an empty conditioning part 
given that is has no predecessor. Its probability P(Y,) 


Fig. 31.8 Graphical model for a first-order Markov chain 
of length T, where pa(Y;) = {Y;—1} 


is referred to as marginal or prior probability, while the 
term P(Y,|Y,—1) is the transition probability. 

A Markov chain is stationary or homogeneous, if 
the transition probability does not depend on the time 
(position) ¢. In other words, the parameterization of the 
Markov chain is such that 


PY, =y an =y) =f0,y). 


where the transition distribution is a function f(y’, y) 
of the sole observations y, y’. An interesting stationary 
first-order Markov chain is that whose random variables 
take values from a finite alphabet of discrete symbols 
i,j € {1,...,M}. In these chains, the transition proba- 
bility 

Ay = P, = i/¥-1 = J) (31.28) 
denotes the probability of occurrence of the ith symbol 


preceded by symbol j. For convenience, such proba- 
bility is represented by the element Aj of the M x M 


transition matrix A = [A] = ,- Similarly, the marginal 
distribution defines the elements 

m; = P(Y; =i) (31.29) 
of the Mx 1 initial state vector 7 = EAM These 
Markov chains can be straightforwardly interpreted as 
state-transition systems, where each symbol i of the 
alphabet is a state and a state-transition arrow exists be- 
tween states i and j having a nonzero Aj entry in the 
transition matrix. 

The Markov chains described by (31.28) and 
(31.29), despite their simplicity, have found wide ap- 
plication, e.g., in modeling of physical phenomena, 
economic time series, and information retrieval. Learn- 
ing Markov chains requires fitting the M? parameters 
of the transition matrix plus an M-dimensional prior, 
where M is the size of the observation alphabet. Effi- 
cient methods exists to fit stationary first-order Markov 
chains by maximum likelihood (ML). By using the de- 
composition in (31.26), substituting the definitions in 
(31.28) and (31.29), the Markov chain log-likelihood 
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for a generic sequence y writes 


M 
£(0) = log P(Y = y|0) = log Il gina) 


/=1 


T M N _ 
X I] ĮI A aes 


t=2 i j=1 


(31.30) 


where 0 = (A, 7) are the model parameters and ô(y; = 
i, Y;-1 =J) is the indicator function. For instance, it 
equals | if a transition from y,_; = j to y; = ican be ob- 
served in the sequence and it is 0 otherwise. Similarly, 
(yı = ï’) = 1 if and only if the first symbol of the se- 
quence is i’. The final expression of the log-likelihood 
is obtained by taking the log into the products and 
adding appropriate Lagrange multipliers for normaliza- 
tion. The ML estimate is obtained by differentiating this 
final expression with respect to parameters Aj and 7;, 
yielding 


a 607 =i, y1 =J) 


Aij = ; (31.31) 
i Di DL ly: =i, y1 =j) 
a EN tidak 
X4 6(1 =i) 


Intuitively, the ML estimate corresponds to counting the 
number of transitions from symbol j to i across time 
(similarly for the initial state). Generalization to a set of 
N samples sequences y” is straightforward: it suffices to 
count transitions both in time and across samples, and 
similarly for the initial symbols yj. 


31.4.2 Hidden Markov Models 


Markov chains model sequential data assuming that se- 
quence elements are generated by a fully observable 
stochastic process. In the discrete-state Markov chain, 
this requires each state of the process to correspond to 
an observable element of the sequence, i.e., en event. 
On the other hand, most real-world systems generate 
observable events that are correlated, but not coinci- 
dent, with the state of the generating process. More 
importantly, the only available information can be the 
outcome of the stochastic process at each time, i.e., 
event y,, while the state of the system remains unob- 
servable, i. e., hidden. The HMM allows modeling more 
general stochastic processes where the state transition 
dynamics is disentangled from the observable infor- 
mation generated by the process. The state-transition 


Fig. 31.9 A first-order HMM with hidden states S, chosen 
on the discrete domain {1,...,C}, fort=1...T 


dynamics is assumed to be nonobservable and is mod- 
eled by a Markov chain of discrete and finite latent 
variables, i. e., the hidden states. The observable infor- 
mation is then generated by such hidden states similarly 
to how latent variables generate observations in mixture 
models (see Sect. 31.3.4). 

The graphical model of an HMM is exemplified in 
Fig. 31.9: the hidden states are latent variables S,, while 
the sequence elements Y, are observed. 

The conditional dependence expressed by the arrow 
S, — Y, indicates that the observed element of the se- 
quence at time ¢ is generated by the corresponding hid- 
den state S, through the emission distribution bs, (y:) = 
P(Y; = y,|S; = s,). The unknown state-transition dy- 
namics is modeled by the first-order Markov chain 
of discrete and finite hidden states S,. By applying 
the Markovian decomposition in (31.27) to the hid- 
den states chain, the joint distribution of the observed 
sequence y= y,,..., yr and associated hidden states 
S= 5S ,...,S7 writes as 


T 
P(Y =y,S=s) = P(Si) | [ P(S:IS-) P“AS:) - 
t=2 


(31.33) 


The actual parameterization of the probabilities in 
(31.33) depends on the form of the observation and 
hidden states variables. From Sect. 31.8, a stationary 
hidden states chain is known to be regulated by the 
C x C matrix of state transitions Aj = P(S, = 1|S;—-1 = 
j) and by the C-dimensional vector of initial state 
probabilities m; = P(S, = i), where i,j are drawn from 
qilsesey C}. For discrete sequence observations y; € 
{1, M}, the emission distribution is an M x C emission 
matrix B such that its elements are 


bi(k) = By = P(Y, = k|S; =i). 


(31.34) 


For continuous observations y,, the state assignment 
S, = i selects the ith emission distributions b;(y,) = 
P(Y,|S; = i) from a mixture of C candidates. 
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An HMM is a latent variable model defined by P(Y|6) = X PY,S = s|) 
the 0 =(z,A,B) parameters and, implicitly, by the s 
(unkown) number of hidden states C. In [31.58], T 
three notable inference problems are identified for an a > P(S)) I] P(S,|S;-1)P(Y;|S;) , 
HMM. Shy. sST =2 
ee (31.35) 
Definition 31.5 Evaluation Problem 
Given a model @ and an observed sequence y, deter- Where the joint probability P(Y, S|@) has been factor- 
mine the likelihood P(Y = yl) of the sequence being ized according to the HMM assumption in (31.33). 
generated by the model. Direct computation of (31.35) is generally infea- 
E sible, as it would require O(TCT) operations. This 
probability can be efficiently computed, with O(TC’) 
Definition 31.6 Learning Problem operations, through accumulation of a recursive term 
Given a dataset of N observed sequences D= that is computed by scanning the sequence from left to 
fy!,...,y%} and the number of hidden states C, find right. The procedure is known as forward algorithm: let 
the parameters 7, A and B that maximize the probability Yi: be the observed subsequence from position 1 to f, 
of model 0 = {7, A, B} having generated the sequences define the forward probability as 
in D. 
a(i) = P(Y 1:1 = Yi, Si = il) (31.36) 
Definition 31.7 Optimal States Problem that is the probability of observing a partial sequence up 
Given 4 módel 0 and an observed seducice y- tidak position ź and the underlying hidden process being in 
: F i 1 M state i at time ¢. A recursive formulation of the œ,(i) 
optimal state assignment s = sŤ, .. . , s7 for the under- is obtained by introducing the hidden state S 
lying hidden Markov chain. ET SO AEC eee MEGEN RALE ee 
D by marginalization, yielding 
These classical inference problems are addressed us- c 
D ing efficient and numerically stable recursive algo- a,(i) = PM: Yi St = i, S1 = jl 9) 
4 rithms that exploit message passing on the HMM j=l l f 
= junction tree (Sect. 31.2.3) to factorize the, other- i é 
= wise hardly tractable, joint maximization problems. = > P(Y, = y,|S; =i, 9) 
F The underlying intuition is a recursive computa- 


tion of intermediate probabilities (messages) that are 
passed forward and backward along the sequence 
(the junction tree, in practice) to accumulate evi- 
dence for solving the joint problem. A discussion of 
the key aspects of these solutions is provided in the 
following. 


Evaluation 

The evaluation problem refers to measuring how well 
a given HMM matches an observed sequences. Let 
the model be 0 = (x,A,B) and the observed se- 
quence y = y1,..., Yr, the objective is to find P(Y = 
yl). To effectively compute this probability in the 
HMM assumption, it is needed to introduce the hid- 
den states assignment corresponding to the observed 
sequence y. Following the general approach for latent 
variable models in Eq. (31.11), these are introduced 
through marginalization on the joint assignment s = 
i EEEREN i 


j=1 
x P(S; = i|S;-1 =j, 0) 
x P(Y i::—1 = Yiz—1; Si-1 = J0) 


Cc 
= biy) X Aisi (i) : (31.37) 


j=l 


where the second equality follows from the condi- 
tional independence assumptions of the model. Since, 
pa(S,) = {S;—1}, the chain element S; is completely de- 
termined by the hidden state at previous time S,—1; 
similarly, emission Y, is conditional independent from 
the rest, given the hidden state S,. 

The forward recursion scans the observed sequence 
from left to right and recursively computes the @,(i) val- 
ues in each position t = 1,..., T using (31.37). At each 
observed position f, the a,(i) values are computed for 
each i € {1,..., C}, since the hidden states are not ob- 
served. The basis of the recursion is at t = 1, where the 
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(31.37) reduces to a (i) = b;(y1)2;, such that yı is the 
first element of the observed sequence. The likelihood 
of the full sequence y = y;:r is computed at the end of 
the forward recursion as 


C Cc 
P(¥|9) = X` PY i:r, Sr = 19) = Dari). 


i=l i=1 


(31.38) 


Learning 

Learning of an HMM 6 = (z,A, B) amounts to find- 
ing the values of the parameters z, A and B that are 
most likely to have generated a dataset of observed 
i.i.d. sequences D = {y!,..., yN}. From the evaluation 
problem, we know how to measure the quality of the 
matching between a sequence y and a model @ using the 
likelihood P(Y|@). The HMM learning problem can be 
solved through ML estimation of 6 parameters consid- 
ering the hidden states as latent variables. As discussed 
in Sect. 31.3.2, this problem can be solved through ap- 
plication of the EM algorithm, whose HMM version is 
referred to as Baum—Welch algorithm [31.59], which is 
a form of sum-product inference algorithm introduced 
in Sect. 31.2.3. Marginalization of the hidden states as 
in (31.35), yields to the HMM log-likelihood on the 
dataset D 


N 
£(6) = log | | P(v"|6) 


n=1 


Tn 


x] [PSPS 
t=2 
(31.39) 


where overscript n refers to the nth sequence y” and T, 
is the corresponding length. The likelihood in (31.39) 
is intractable due to the nonobservable state assignment 
that introduces the marginalization term. Following the 
principles of the EM algorithm, we assume to know the 
unobserved state assignment, as in (31.30). This can be 
achieved by introducing indicator variables z}; for the 
unknown assignment, such that z} = 1 if the chain is 
in state 7 at position ¢ of the nth sequences, and it is 0 
otherwise. Given this (assumed) knowledge about the 


hidden state assignments, if is possible to write the cor- 
responding completed likelihood 


£.(9) 


Ta C 
x [| [[ P6 = ilS} =) Pars? = 


t=2 i j=1 


Tn C 


N Cc 
=), į) algm+) 9 ay 


n=l [ i=1 t=2ij=1 


x logAy +z; log bi) p > (31.40) 


where the latter equality introduces the parameters 6 in 
place of the corresponding probabilities and brings the 
logarithms into the products. 

The EM procedure is applied to the complete log- 
likelihood in (31.40). Following (31.15), the E-step 
computes the expected value of £.(0) with respect 
to the distribution of the indicator variables Z = {zi}, 
conditional on the observed sequences D and the cur- 
rent estimate of the parameters 6“. Given £.(0) as in 
(31.40), taking its conditional expectation with respect 
to the hidden variables Z, it yields to the following pos- 
terior probability: 


Exy,o [zu] = P(S; = ily) , 


where superscript n is omitted for notational simplic- 
ity. The estimation of this posterior is known as the 
smoothing problem. In the Baum—Welch algorithm, this 
is efficiently solved by a double recursion that exploits 
the following decomposition of the joint probability 


(31.41) 


P(S, = i, y) = P(S, = i, Yin, Yi4 1:7) 
= P(S; =i, Yi) 


x P(Y ir$: = 1) = œp) , 
(31.42) 


where the observed contribution from the predecessors 
of t (i.e., Y1.) is separated from that of its succes- 
sors (i. e., Y;41:7). The cancelations in (31.42) follow 
from the fact that S, d-separates (see definition in 
Sect. 31.2.1) the elements of the two subsequences, i. e., 
Yı: and Y;41:r. 
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The first term in (31.42) is the œ;(i) probability 
defined in (31.36), which can be computed through 
the forward algorithm. The 6,(i) term can also be 
computed through a recursive procedure known as 
backward algorithm, due to the inverted direction with 
respect to the forward recursion. Consider the following 
recursive decomposition 

BiG) = P(Yr-7|Si-1 = J) 

c 
= OPW ar. S: = iS =) 

i=l 

c 
= OPIS, = DPY op n-715; =i) 

i=1 


x P(S; = i|S;-1 =J) 


Cc 
=) di) B, MA; (31.43) 


i=1 


it can be computed for 2 < t < T by scanning the se- 
quence backward, assuming r(j)= 1 for each je 
i ¢ eee Ot 

The final expression of the smoothed posterior in 
(31.41) is given by the joint œ — 6 recursions, known as 
the forward—backward algorithm, that is 


ee ee ae 

(i) = P(S, = iY) = P(Y) 

_ OU (31.44) 
a1 DB) 


Note that the forward and backward recursions can be 
ran in parallel, since the values of a and f do not de- 
pend on each other. To complete the derivations of the 
sufficient statistics for the M-step, it is also necessary to 
estimate the joint posterior 

= P(S, =i, S; 


=jlY), (31.45) 


Ezy,ow Bazan] 


which can be straightforwardly factorized into known 
probabilities along the lines of (31.42). It turns out that 
such joint posterior can be estimated using the a — ĝ 
probabilities computed by the forward-backward algo- 
rithm, that is 


P(S; =i, S; =j|Y) 
Qı HA: OD bÀ) 


T EE i a (m Amb BD 
(31.46) 


Yr (ij) = 


Parameters 0 = (x,A, B) are re-estimated at the M- 
step, with update equations that follow straightfor- 
wardly from the maximization problem in (31.16). It 
suffices to differentiate (31.40), extended with appropri- 
ate Lagrange multipliers to account for the sum-to-one 
constraints. Intuitively, the update equations can be 
straightforwardly written from the ML estimates for 
observable Markov chains in (31.31) and (31.32). It suf- 
fices to substitute the observed state counts, obtained 
through the indicator function 6(-), with the virtual 
counts y(-) estimated by (31.44) and (31.46) at the E- 
step. For the hidden state transition and initial state 
distributions this yields to 


N g” n E" 
Zimi De Yt, r—1 (i, D 
N 
Dna pa Y0) 


vy (i). 


n=1 


Aj = 


and 2;= (31.47) 


The estimate of the parameters B depends on the form 
of the emission distribution: if the observed sequences 
take values k from a finite alphabet {1,...,M}, the cor- 
responding multinomial emission in (31.34) is updated 
by 

N Tn 


=> 50. =k 


n=l t=1 


(31.48) 


where ô(-) is the indicator function counting the oc- 
currences of the symbols k in the observed sequences. 
Real-valued sequences are modeled usually through 
Gaussian emissions, whose parameters are fit as usual 
through maximization of the complete log-likelihood. 

Particular care must be taken to avoid numerical 
problems when implementing the forward-backward 
algorithm. Both recursions work with multiplications 
of small numbers: hence, the values of œ and 6 can un- 
derflow for long sequences. To this end, it is advisable 
to perform them in log-space or to work with scaled 
versions of the œ and f probabilities [31.60]. A sequen- 
tial version of the smoothing algorithm exists [31.61] 
that directly computes the smoothed posterior y;(i) = 
P(S, = i|Y) through a y-recursion that uses the œ val- 
ues generated by the forward algorithm. 


Optimal State 
Once a model 0 has been trained, it can be interesting 
to determine the most likely hidden state assignment 
s* that has generated an observed sequence y. This 
inference problem, known also as decoding, has differ- 
ent solutions, since several optimal assignment exists, 
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depending on the interpretation of what an optimal as- 
signment is. For instance, the optimal hidden sequence 
can be the one maximizing the expected count of correct 
states. On the other hand, an optimal assignment might 
be the sequence of hidden states s* with the maximum 
joint probability P(Y = y, S =s*). 

The former optimality condition is solved by select- 
ing, at each position ¢, the most likely state given by the 
sequence, i. €., 


s* = arg _max „P(S: =ilY). (31.49) 
Clearly, this amounts to select the most likely state for 
each position independently, using the posterior com- 
puted by the Baum—Welch algorithm. Conversely, the 
latter optimality condition estimates the joint hidden 
state assignment 


s* = arg max P(Y, S=s). (31.50) 


This is a complex inference problem that can be ef- 
ficiently solved though a dynamic programming ap- 
proach, known as the Viterbi algorithm. Note that the 
two optimality definitions generally lead to different so- 
lutions. For instance, the Viterbi solution is constrained 
to provide only state transitions allowed by the generat- 
ing distribution, while this is not the case for the Baum, 
Welch solution, given that hidden states are selected in- 
dependently. 

The Viterbi algorithm is based on a backward re- 
cursion that exploits a factorization of the maximization 
problem in (31.50). Consider the restricted problem of 
determining the hidden state of the tail element T 


T 
max P(Y, Sr = sr) = max I] P(Y,|S,)P(S;|S;—1) 
at T =l 
T-1 
= I] P(Y,|S,)P(S;|S;—1) max P(¥;|Sr)P(Sr|Sr—1) ; 
=l 


(31:51) 


where the joint probability factorizes according to the 
Markov chain assumption. We can isolate the maxi- 
mization problem in the rightmost term 


€r—1(Sr—-1) = pe P(Y,|Sr = sr) 
x P(Sr = sr|S7—1 = Sp—1) $ (31.52) 


that is a message conveying information on the max- 
imization of the tail element to the penultimate po- 
sition. Substituting the definition of €7—;(sr—;) back 


in (31.51) and adding the maximization with respect 
to sr—1, suggests the recursive formulation of e€.(-) for 
a generic position ź— 1, i. e., 


E1 (S1) = max P(Y;|S; = 5;) 


x P(S; = si|S1 = 5-1) (51) 5 
(31.53) 


for 2 < t < T, where er(sr) = 1 is the basis of the re- 
cursion. At each step ¢ of the backward recursion, the 
Viterbi algorithm computes the e-message for each pos- 
sible assignment of the hidden state of t and propagates 
it to the predecessor ¢— 1. The recursion ends at the ini- 
tial element of the sequence, where the initial optimal 
state is obtained as 


s{ = arg max P(Y,|S, = s)P(S; = s,)ei(s). 


(31.54) 


The assignment of the remaining hidden states is ob- 
tained by backtracking through the forward recursion 


s¥ = arg max P(Y,|S; = s) 
x P(S: = s|S;-1 = s7_, )ér(s) . (31.55) 


Note that the Viterbi algorithm is a special case 
of a max-sum inference algorithm introduced in 
Sect. 31.2.3. 


31.4.3 Related Models 


Higher Order Markov Models 

Hidden Markov models serve as a starting point for the 
design on more complex Markov generative processes, 
besides the obvious extension to higher order hidden 
chains [31.62]. Factorial HMMs [31.63] generalize the 
original model by defining super states that are collec- 
tions of K discrete hidden states, each being part of an 
independent Markov chain (see Fig. 31.10). This facto- 
rial model results in K hidden Markov chains running 
in parallel: at each time step, the emission depends on 
the K-dimensional super state, but each state variable is 
decoupled from those of the other chains and evolves 
according to its own dynamics. By this means, it is 
possible to efficiently encode the state dynamics of K 
objects evolving independently that interact to jointly 
determine the observation (e.g., K cars moving in the 
traffic and jointly determining traffic jams). 
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Fig. 31.10 Factorial HMM with K = 3 independent hidden 
Markov chains 


Fig. 31.11 A bottom-up hidden tree Markov model for 
a simple structure with five nodes: the generative process 
follows the direction of the arrows, i.e., from the leaves to 
the root (t = 1) 


Nonhomogenous HMMs 
Relaxation of the homogeneity assumption led to the 
input/output hidden Markov model (IO-HMM) [31.64] 
that allow modeling the causal dependence of the hid- 
den generative process from an additional input se- 
quence x. Basically, the IO-HMM enables nonhomoge- 


31.5 Conclusion and Further Reading 


Graphical models have been discussed as an excel- 
lent framework for probabilistic modeling of articulated 
processes that can be described by a static set of ran- 
dom variables tied up by probabilistic relationships. 
Such relationships need not to be necessarily known, 
a-priori. Several approaches exists to infer them from 
data, i.e., to determine the presence of a correspond- 
ing edge in the graphical model. However, the same 
approaches tend to fix the structure of the graphical 
model, once this is determined from the data. In other 
words, these graphical models represent a static picture 


neous transition and emission distributions that are ex- 
plicitly dependent (1. e., parameterized) on the currently 
observed label of the input sequence. An IO-HMM im- 
plements a mapping, referred to as transduction, from 
an observed input sequence x into an output (target) 
sequence y, realized by the input-conditional hidden 
process P(Y|X). Interesting applications of IO-HMM 
are in learning transformations between modalities in 
multimedia data [31.65], exploratory analysis of finan- 
cial time series [31.66] and gene data analysis [31.67]. 


HMMs for Structured Data 

Hidden tree Markov models represent the generative 
process of more complex, tree-structured information 
(see Fig. 31.11). Differently from the sequential do- 
main, the direction of the generative process leads 
to different representational capabilities when dealing 
with trees. Top-down approaches [31.68] model all pos- 
sible paths from the root to the leaves of the tree. 
Bottom-up models [31.69] propose a generative process 
from the leaves to the root, where complex structures 
are generated by composition of simpler substructures. 
Recently, an extension of the IO-HMM has been pro- 
posed to learn transductions between trees [31.70]. 


Bayesian and Nonparametric Extensions 
HMMs have been extended to allow a countably infinite 
number of hidden states through a Bayesian approach 
where state distributions are modeled by Dirichlet pro- 
cesses [31.71]. Abstracting from the direction of the 
arrows in Fig. 31.9 leads to a discriminative proba- 
bilistic model known as liner-chain conditional random 
fields [31.72], whose capability to model long term de- 
pendences is widely used in natural language parsing 
and computer vision. 


of the process, where the set of random variables and 
associated relationships is held fixed from a point on- 
ward. The nature of sequence data calls for the ability to 
model more dynamic phenomena. Processing of video 
information requires Markov networks that can unfold 
their structure across the video sequence. Even classic 
text analysis needs to account for novel generative dy- 
namics, where texts are produced as dynamic streams 
instead of being static collections of words, e.g., con- 
sider blog posts and associated comments, or the stream 
of social networks status updates. Therefore, the hori- 
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zon of current research is pushing graphical models to 
more dynamic formulations where, on the one hand, the 
structure is allowed to change over time and, on the 
other hand, the model is allowed to dynamically self- 
tune the number of parameters that is most adequate 
to represent the process at each time. Following the 
intuitions underlying the HMM approach, dynamical 
graphical models are being proposed that are capable 
of unfolding their structure across time, to better model 
the dynamics of complex time-varying processes. At 
the same time, concepts from nonparametric Bayesian 
statics are being used to develop models where latent 
variables can be dynamically adjusted to sample from 
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Marco Signoretto, Johan A. K. Suykens 


This chapter addresses the study of kernel meth- 
ods, a class of techniques that play a major role 
in machine learning and nonparametric statistics. 
Among others, these methods include support vec- 
tor machines (SVMs) and least squares SVMs, kernel 
principal component analysis, kernel Fisher dis- 
criminant analysis, and Gaussian processes. The 
use of kernel methods is systematic and prop- 
erly motivated by statistical principles. In practical 
applications, kernel methods lead to flexible pre- 
dictive models that often outperform competing 
approaches in terms of generalization perfor- 
mance. The core idea consists of mapping data 
into a high-dimensional space by means of a fea- 
ture map. Since the feature map is normally chosen 
to be nonlinear, a linear model in the feature space 
corresponds to a nonlinear rule in the original do- 
main. This fact suits many real world data analysis 
problems that often require nonlinear models to 
describe their structure. 

In Sect. 32.1 we present historical notes and 
summarize the main ingredients of kernel meth- 
ods. In Sect. 32.2 we present the core ideas of 
statistical learning and show how regularization 
can be employed to devise practical learning al- 
gorithms. In Sect. 32.3 we show a selection of 
techniques that are representative of a large class 
of kernel methods; these techniques — termed 
primal—dual methods — use Lagrange duality 
as the main mathematical tools. Section 32.4 
discusses Gaussian processes, a class of kernel 
methods that uses a Bayesian approach to per- 
form inference and learning. Section 32.5 recalls 
different approaches for the tuning of parame- 
ters. In Sect. 32.6 we review the mathematical 
properties of different yet equivalent notions of 
kernels and recall a number of specialized kernels 
for learning problems involving structured data. 
We conclude the chapter by presenting applica- 
tions in Sect. 32.7. 
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32.1 Background 


This chapter addresses the study of kernel methods, 
a class of techniques that play a major role in machine 
learning and nonparametric statistics. 

The development of kernel-based techniques [32.1, 
2] has been an important activity within machine 
learning in the last two decades. In this period, 
a number of powerful kernel-based learning algo- 
rithms were proposed. Among others, these methods 
include support vector machines (SVMs) and least 
squares SVMs, kernel principal component analysis, 
kernel Fisher discriminant analysis, and Gaussian pro- 
cesses. The use of kernel methods is systematic and 
properly motivated by statistical principles. In practi- 
cal applications, kernel methods lead to flexible pre- 
dictive models that often outperform competing ap- 
proaches in terms of generalization performance. The 
core idea consists of mapping data into a high-di- 
mensional space by means of a feature map. Since 
the feature map is normally chosen to be nonlin- 
ear, a linear model in the feature space corresponds 
to a nonlinear rule in the original domain. This 
fact suits many real world data analysis problems 
that often require nonlinear models to describe their 
structure. 


32.1.1 Summary of the Chapter 


In the rest of this section we present historical notes 
and summarize the main ingredients of kernel meth- 
ods. In Sect. 32.2 we present the core ideas of sta- 
tistical learning and show how regularization can be 
employed to devise practical learning algorithms. In 
Sect. 32.3 we show a selection of techniques that 
are representative of a large class of kernel meth- 
ods; these techniques — termed primal—dual methods — 
use Lagrange duality as the main mathematical tools. 
Section 32.4 discusses Gaussian processes, a class 
of kernel methods that uses a Bayesian approach to 
perform inference and learning. Section 32.5 recalls 
different approaches for the tuning of parameters. In 
Sect. 32.6 we review the mathematical properties of 
different yet equivalent notions of kernels and recall 
a number of specialized kernels for learning prob- 
lems involving structured data. We conclude the chapter 
by presenting applications in Sect. 32.7. Additional 
information can be found in a number of existing tu- 
torials on SVMs and kernel methods, including [32.3- 
Fl: 


32.1.2 Historical Background 


The study of the mathematical foundation of kernels 
can be traced back at least to the beginning of the nine- 
teenth century in connection with a general theory of 
integral equations [32.8,9]. According to [32.10] the 
theory of reproducing kernel Hilbert spaces (RKHS) 
was first applied to detection and estimation prob- 
lems by Parzen [32.11]. Properties of (reproducing) 
kernels are thoroughly presented in [32.12]. A first 
systematic treatment in the domain of nonparametric 
statistics can be found in [32.13]. Modern mathemati- 
cal reviews include [32.14, 15]. The first use of kernel 
in the context of machine learning is generally at- 
tributed to [32.16]. The linear support vector algorithm, 
which undoubtedly had a prominent role in the his- 
tory of kernel methods, made its first appearance in 
Russia in the 1960s [32.17, 18], in the framework of 
the statistical learning theory developed by Vapnik and 
Chervonenkis [32.19, 20]. Later, the idea was developed 
in connection to kernels by Vapnik and co-workers at 
AT&T labs [32.21—24]. The novel approach was rooted 
on a solid theoretical foundation. Additionally, studies 
began to report state-of-the-art performances in a num- 
ber of applications, which further stimulated research 
on kernel-based techniques. 


32.1.3 The Main Ingredients 


Before delving into the details we now present the 
general setting for statistical learning problems and 
then briefly review the main ingredients of a sub- 
stantial part of kernel methods used in machine 
learning. 


Setting for Statistical Learning 
The setting of learning from examples comprises three 
components [32.25]: 


1. A generator of input data. We shall assume that data 
can be represented as vectors of RP. These vectors 
are independently and identically distributed (i.i.d.) 
according to a fixed but unknown probability distri- 
bution p(x). 

2. A supervisor that, given input data x, returns an out- 
put value y according to a conditional distribution 
p(y|x) also fixed and unknown. Note that the super- 
visor might or might not be present. 
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3. A learning machine (or learning algorithm) able to 
choose an hypothesis 


fe 8). (32.1) 


Note that the hypothesis f is a function of x and de- 
pends upon a parameter vector 0 belonging to a set 
©. The corresponding hypothesis space is then 


S={fx;,0):0 c0}, (32.2) 
which is one-to-one with the parameter space ©. 


When the supervisor is present the learning problem 
is called supervised. The goal is to find that hypothesis 
that best mimics the supervisor response. When the su- 
pervisor is not present, the learning problem is called 
unsupervised. In this case, the aim is to find an hypoth- 
esis that represents the best concise representation of 
the data produced by the generator. 

In both cases we might be interested either in the 
whole domain or we might be concerned only with 
a specific subset of points. 


Feature Mapping and Kernel Trick 
Kernel methods are a special class of learning al- 
gorithms. Their main idea consists of mapping input 
points, generally represented as elements of R?, into 
an high-dimensional inner product space F, called the 
feature space. The mapping is performed by means of 
a feature map ¢ 


0) :R? >F, 
xe p(x). (32.3) 


One then approaches the learning task of interest by 
finding a linear model in the features space according 
to training points ¢ (x1), ..., (xv) € F. Since the fea- 
ture map is normally chosen to be nonlinear, a linear 
model in the feature space corresponds to a nonlinear 
rule in R?. Alternative kernel methods differ in the way 
the linear model in the feature space is found. Nonethe- 
less, a common feature across different techniques is 
the following. If the algorithm can be expressed solely 
in terms of inner products, one can restate the problem 
in terms of evaluations of a kernel function 


k:RP xR? SR, 
(x,y) = k(x, y), (32.4) 


by letting 


k(x, y) = o(x) 60). (32.5) 


This fact, usually referred to as the kernel trick, is of 
particular interest for the cases where the feature space 
is infinite dimensional, which prevents direct computa- 
tion in the feature space. In practice, one often starts 
with designing a positive definite kernel, which guaran- 
tees the existence of a feature map ¢ satisfying (32.4). 


Primal—Dual Estimation Techniques 
As we shall see, an important class of machine learn- 
ing methods consists of primal—dual learning tech- 
niques [32.1,2, 26]. In this case, one starts from a pri- 
mal model representation of the type 


f(x; w, b) = w! (x) +b 
= widi(x) +d. (32.6) 


With reference to (32.1) note that here we have the 
tuple 0 = (w, b). The primal problem is then a mathe- 
matical optimization problem aimed at finding optimal 
w € F and b e R. Notably, the right-hand side of (32.6) 
is affine in (x); however, since @¢ is in general a non- 
linear mapping, f is a nonlinear function of x. 

A first approach consists of solving the primal 
problem. The information content of training data is 
absorbed into the primal model’s parameters during 
the procedure to find optimal parameters; the eval- 
uation of the model (32.6) on new patterns (out-of- 
sample extension) does no longer require the use of 
training data; therefore, they can be discarded after 
training. 

A second approach relies on Lagrangian duality ar- 
guments. In this case, the solution is represented in 
terms of dual variables a;,a@2,...,a@y and solved in 
«a,€ R” and be R. The dual model representation is 
then 


N 
fa, b) = YO oink %n,x) +b (32.7) 


n=l 


and depends upon the training patterns x1, x2, ..., Xy € 
R?. The representation in (32.6) is usually called para- 
metric, while (32.7) is the nonparametric representa- 
tion [32.26]. 
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32.2 Foundations of Statistical Learning 


In this section, we briefly recall the main nomenclature 
and give a basic introduction on statistical learning the- 
ory. Historically, statistical learning theory constituted 
the theoretical foundation upon which the main meth- 
ods of support vector machines were grounded. The 
theory is similar in spirit to a number of alternative 
complexity criteria and bias-variance trade-off curves. 
Nowadays, it remains a powerful framework for the de- 
sign of learning algorithms. 


32.2.1 Supervised and Unsupervised 
Inductive Learning 


We have already introduced the distinction between su- 
pervised and unsupervised. Three important learning 
tasks are found within this categorization: regression, 
classification, and density estimation. In regression the 
supervisor’s response takes values in the real numbers. 
In classification the supervisor’s output takes values in 
the discrete finite set of possible labels Y. In particular, 
in the binary classification problem Y consists of two 
elements, e.g., Y = {—1, 1}. Density estimation is an in- 
stance of unsupervised learning: there is no supervisor. 
The functional relation to be learned from examples is 
the probability density p(x) (the generator). Supervised 
and unsupervised learning are concerned with estimat- 
ing a function (an optimal hypothesis) over the whole 
input domain R? based upon a finite set of training 
points. Therefore, they are inductive approaches aiming 
at the general picture. 


32.2.2 Semi-Supervised and Transductive 
Learning 


Semi-Supervised Inductive Learning 
In supervised learning the N training data are i.i.d. pairs 


{@1: y1), @2y2), -ON YN) CR? xY, (32.8) 


each of which is assumed to be drawn according to 


p, y) = polpa). (32.9) 


There is yet another inductive approach, namely semi- 
supervised learning. In semi-supervised learning one 
has a set of labeled pairs (32.8), as in supervised learn- 
ing, as well as a set of unlabeled data 


{XN+ XN+2 -o XN+ r} C RP (32.10) 


i.i.d. from the generator p(x), as in unsupervised learn- 
ing. The purpose is the same as in supervised learning: 
to find an approximation of the supervisor response. 
However this goal is achieved by a learning algorithm 
that takes into account the additional information com- 
ing from the unlabeled data. According to [32.27], 
semi-supervised learning was popularized for the first 
time in the mid-1970s although similar ideas appeared 
earlier. Alternative semi-supervised learning algorithms 
differ in the way they exploit the information from the 
unlabeled set. One popular idea is to assume that the 
(possibly high-dimensional) input data lie (roughly) on 
a low-dimensional manifold [32.28-31]. 


Transductive Learning 

In induction one seeks for the general picture with 
the purpose of making an out-of-sample prediction. 
This is an ambitious goal that might be unmotivated 
in certain settings. What if all the (unlabeled) data 
are given in advance? Suppose that one is only inter- 
ested in prediction at given (finitely many) points. It is 
expected that this less ambitious task results in sim- 
ple inference problems. These ideas are reflected in 
the approach found in transductive learning formula- 
tions. As in semi-supervised learning in transductive 
learning one has training pairs (32.65) as well as test 
(unlabeled) data (32.10). However, differently from 
semi-supervised learning one is only interested in mak- 
ing predictions for the test data (32.10). 


32.2.3 Bounds on Generalization Error 


Transductive and inductive inference share the common 
goal of achieving the lowest possible error on test data. 
In contrast with induction, transduction assumes that in- 
put test data are given in advance and consist of a finite 
discrete set of patterns drawn from the same distribution 
as the training set. From this perspective, it is clear that 
both transductive and inductive learning are concerned 
with generalization. In turn, a powerful framework to 
study the problem of generalization is the structural risk 
minimization (SRM) principle. 


Expected and Empirical Risk 
The starting point is the definition of a loss L(y, f (x; 0)), 
or discrepancy, between the response y of the supervi- 
sor to a given input x and the response f(x; 0) of the 
learning algorithm (that can be transductive or induc- 
tive). Formally, the generalization error can be defined 
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as the expected risk 


RO) = [rose 0))p(x, y)dxdy . (32.11) 


From a mathematical perspective the goal of learning 
is the minimization of this quantity. However, p(x, y) is 
unknown and one can rely only on the sample version 
of (32.11), namely the empirical risk 


N 
Rep (O)= >) LOr f&n 0) - (32.12) 


n=l 


A possible learning approach is based on empirical risk 
minimization (ERM) and encompasses maximum like- 
lihood (ML) inference [32.25]. It consists of finding 


Oy := arg min Ronp(9) . (32.13) 


Consistency 


Definition 32.1 
The ERM approach is said to be consistent if 


N oy NS. 
Remp (0n) — Ant R(0), 
~ N 
R(O, inf R(A) , 
(Ov) —> m (8) 


N : m 
where —> denotes convergence in probability for 
N>o. 


In words: the ERM is consistent if, as the number of 
training patterns N increases, both the expected risk 
R(6y) and the empirical risk RY (6n) converge to the 
minimal possible risk minge@ R(@), see Fig. 32.1 for 
an illustration. 


Fig. 32.1 Consistency of ERM 


It was shown in [32.32] that the necessary and suf- 
ficient condition for consistency is that 


P} sup |R(6)—R%,,(@)| =e} —> 0, We>0. 
Oco 


(32.14) 


In turn, the necessary and sufficient conditions 
for (32.14) to hold true were established in 1968 by 
Vapnik [32.33, 34] and are based on capacity factors. 


Capacity Factors 

Consistency is one of the main theoretical questions in 
statistics. From a learning perspective, however, it does 
not address the most important aspect. The aspect that 
one should be mostly concerned with is how to con- 
trol the generalization of a certain learning algorithm. 
Whereas consistency is an asymptotic result, we want to 
minimize the expected risk given that we have available 
only finitely many observations to train the learning 
algorithm. It turns out, however, that consistency is cen- 
tral to address also this aspect [32.25]. Additionally, 
a crucial role for answering this question is played by 
capacity factors that, roughly speaking, are all measures 
of how well the set of functions {f (x; 0) : 0 € ©} can 
separate data. A more detailed description is given in 
the following (precise definitions and formulas can be 
found in [32.25, Chap. 2]). In general, the theory states 
that without restricting the set of admissible functions, 
the ERM is not consistent. The interested reader is re- 
ferred to [32.25, 35]. 


VC Entropy. The first capacity factor (here and be- 
low VC is used as an abbreviation for Vapnik—Chervo- 
nenkis.) relates to the expected number of equivalence 
classes according to which the training patterns divide 
the set of functions {f(x; 6): 0 € ©} (an equivalence 
class is a subset of {f (x; 0) : 0 € ©} consisting of func- 
tions that attribute the same labels to the input pattern in 
the training set). We denote the VC entropy by En(p, N), 
where the symbols emphasize the dependence of the VC 
Entropy on the underlying joint probability p and the 
number of training patterns N. The condition 


forms the necessary and sufficient condition for (32.14) 
to hold true with respect to the fixed probability den- 


sity p. 


Growth Function. It corresponds to the maximal 
number of equivalence classes with respect to all the 
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possible training samples of cardinality N. As such, it 
is a distribution-independent version of the VC entropy 
obtained via a worst-case approach. We denote it by 
Gr(N). The condition 


InGr(N) _ 


Noo 


0 


forms the necessary and sufficient condition for (32.14) 
to hold true for all the probability densities p. 


VC Dimension. This is the cardinality of the largest 
set of points that the algorithm can shatter; we de- 
note it by dimyc. Note that dimyc is a property of 
{f (x; 8) : 0 € ©}, which neither depends on N nor on p. 
Roughly speaking it tells how flexible the set of func- 
tions is. A finite value of dimyc forms the necessary 
and sufficient condition for (32.14) to hold true for all 
the probability densities p. 

The three capacities are related by the chain of in- 


equalities [32.33, 34] 
a +1 
dimyc : 


(32.15) 


En(p, N) < InGr(N) < dimyc (in 


Finite-Sample Bounds 
One of the key results of the theory developed by Vapnik 
and Chervonenkis is the following probabilistic bound. 
With probability 1 — 7 simultaneously for all 6 € © it 
holds that [32.25] 


En(p, 2N) —In7n 
snip OF y ~H 


R(0) < RY (32.16) 
Note that the latter depends on p. The result says that, 
for a fixed set of functions {f (x; 0): 0 € ©}, one can 
pick that 6 € © that minimizes ae (0) and in this way 
obtain the best guarantee on R(@). Now, taking into ac- 
count (32.15) one can formulate the following bound 


based on the growth function 


In Gr(2N) — lnn 
empl) + y n 4 


R(0) < RY (32.17) 


In the same way one has 


VC 


N 


m dimyc (in a + 1) —Inn 
R(0) < Remp l8) + : 


(32.18) 


Figure 32.2 illustrates the main idea. 


Note that both (32.17) and (32.18) are distribution- 
independent. Additionally (32.18) only depends upon 
the VC dimension (which, contrary to Gr, is in- 
dependent from N). Unfortunately there is no free 
lunch: (32.17) is less tight than (32.16) and (32.18) is 
less tight than (32.17). 

So far we gave a flavor of the theoretical framework 
in which the support vector algorithms were originally 
conceived. Recent research reinterpreted and signifi- 
cantly improved the error bounds using mathematical 
tools from approximation and learning theory, func- 
tional analysis, and statistics. The interested reader is 
referred to [32.36] and [32.37]. Although tighter bounds 
exist, the study of sharper bounds remains a challenge 
for future research. In fact, existing bounds are nor- 
mally too loose to lead to practical model selection 
techniques, i.e., strategies for tuning the parameters 
that control the capacity of the model’s class. Nonethe- 
less, the theory provides important guidelines for the 
derivation of algorithms. 


The Role of Transduction 
It turns out that a key step in obtaining the 
bound (32.16) is based upon the symmetrization lemma 


P {sup |R(O) — Rep (O)| > e} 
, (32.19) 


(0)— RY} 


N: E 
< 2P Ísup IR OE | 


emp 


where Ri, and Rd, are constructed upon two dif- 
ferent i.i.d. samples, precisely as in transduction. More 
specifically, (32.16) comes from upper-bounding the 
right-hand side of (32.19) [32.38]. More generally it is 
apparent that to obtain all bounds of this type the key 


element remains the symmetrization lemma [32.25]. 


Error 


Bound on test error 


Training error 


L > 
VC dimension 


Fig. 32.2 Illustration of the generalization bound depend- 
ing on capacity factors 
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Notably starting from the latter one can derive bounds 
explicitly designed for the transductive case where one 
of the two samples plays the role of the training set 
and the other of the test set. In light of this, Vapnik ar- 
gues that transductive inference is a fundamental step 
in machine learning. Additionally, since the bounds for 
transduction are tighter than those for induction, the 
theory suggests that, whenever possible, transductive 
inference should be preferred over inductive inference. 
Practical algorithms can take advantage of this fact by 
implementing the adaptive version of the structural risk 
minimization (SRM) principle that we discuss next. 


32.2.4 Structural Risk Minimization 
and Regularization 


The Structural Risk Minimization Principle 
The structure of the bounds above suggests that one 
should minimize the empirical risk while controlling 
some measure of complexity. The idea behind the SRM, 
introduced in the 1970s, is to construct nested subsets of 
functions 


SCS C- C8, =$= F(x, 9): 90€O}, 
(32.20) 


where each subset S$; has capacity h; (VC entropy, 
growth function, or VC dimension) with hı < M < 
-++ < hı and S is the entire hypothesis space. Then one 
chooses an element of the nested subsets so that the sec- 
ond term in the right-hand side of the bounds is kept 
under control; within that subset one then picks that 
specific function that minimizes the empirical risk. As 
Vapnik points out in [32.38]: 


[...] to find a good solution using a finite (limited) 
number of training examples one has to construct 
a (smart) structure which reflects prior knowledge 
about the problem of interest. 


In practice one can use the information coming from the 
unlabeled data to define a smart structure to improve the 
learning. In other words, the side information coming 
from unlabeled data can serve the purpose of devising 
a data-dependent set of functions. On top of this, one 
should use additional side information over the struc- 
ture of the problem, whenever available. Indeed, using 
informative representations for the input data is also 
a way to construct a smart set of functions. In fact, rep- 
resenting the data in a suitable form implies a mapping 
from the input space to a more convenient set of fea- 


tures. We will discuss this aspect more extensively in 
Sect. 32.6. 


Learning Through Regularization 

So far we have addressed the theory but we have not 
talked about how to practically implement it. It is under- 
stood that the essential idea of SRM is to find the best 
trade-off between the empirical risk and some measure 
of complexity (the capacity) of the hypothesis space. 
This ensures that the left-hand side of VC bounds — 
the expected risk that we are interested in to achieve 
generalization — is minimized. In practice there are dif- 
ferent ways to define the sets in the sequence (32.20). 
The generic set $; could be the set of polynomials of 
degree / or a set of splines with / nodes. However, it is 
in connection to regularization theory that practical im- 
plementations of the SRM principle find their natural 
domain. 


Tikhonov Theory 
Regularization theory was introduced by Andrey 
Tikhonov [32.3941] as a way to solve ill-posed prob- 
lems. Ill-posed problems are problems that are not well 
posed in the sense of Hadamard [32.42]. Consider solv- 
ing in f a linear operatorial equation of the type 

Af =b. (32.21) 
In the general case, f is an element of a Hilbert space, 
A is a compact operator, and b is an element of its range. 
Even if a solution exists, it is often observed that a slight 
perturbation of the right-hand side b causes large devi- 
ations in the solution f. Tikhonov proposed to solve this 
problem by minimizing a functional of the type 


IAF -bI +A re), 


where ||- || is a suitable norm on the range of A, A is 
some hyperparameter, and I" is a regularization func- 
tional (sometimes called stabilizer). The theory of such 
an approach was developed by Tikhonov and Ivanov; 
in particular it was shown that there exists a strategy to 
choose À depending on the accuracy of b that asymptot- 
ically leads to the desired solution f*. This was shown 
under the assumption that there exists c* such that 
ft ETEA 

According to Vapnik [32.20], the theory of 
Tikhonov regularization differs from statistical learning 
theory in a number of ways. To begin with Tikhonov 
regularization considers specific structures in the nested 
sequence (32.20) (depending on the way I” is defined); 
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secondly it requires the solution to be in the hypothe- 
sis space; finally the theory developed by Tikhonov and 
Ivanov was not concerned with guarantees for a finite 
number of observations. 

When f is an element of a reproducing kernel 
Hilbert space (RKHS) (Sect. 32.6), the theory is best 
known through the work of Wahba [32.13, 43]. 


SRM and Regularization in RKHSs 
SVMs, and more generally, primal—dual learning algo- 
rithms, represent an important class of kernel methods. 
The primal—dual approach emphasizes the geometrical 
aspects of the problem and it is particularly insight- 
ful when (32.7) is used to define a discriminative rule 
arising in a Classification problem. We will consider 
this class of learning algorithms in later sections. Al- 
ternatively, the setting of RKHSs provides a convenient 
way to define the sequence (32.20). When the hypoth- 
esis space S coincides with a RKHS of functions H, 
a nested sequence can be constructed by bounding the 
norm in H, used as a proxy for the complexity of 
models 

S= EH : [IFI <a}. (32.22) 
It turns out that there is a measure of capacity of ), 
which is an increasing function of a; [32.44]. This 
capacity measure can be used to derive probabilistic 
bounds in line with (32.16), (32.17), and (32.18). In 
practice, instead of solving the constrained problem 


min Romp) 


subject to ||f|| < a (32.23) 
for any /, one normally solves the provably equivalent 
penalized problem 


min Rump O + Ar MA? (32.24) 


fe 


and pick the optimal A; appropriately. Note that 
in (32.23) and (32.24) we wrote RN (f) instead of 


emp 
RY ap (9); as before. In fact, in this case, the solution of 
the learning problem is found by formulating a convex 
variational problem where the function f itself plays the 
role of the optimization variable 0. In practice, how- 
ever, the representer theorem [32.13, 43,45,46] shows 
that a representation of the optimal f only depends upon 
an expansion of kernel functions centered at the train- 
ing patterns. This result leads to a representation for f in 
line with (32.7). More specifically it holds that f(x) = 


SS at, k(X,,x) where a € R is found solving a finite 
dimensional optimization problem. The latter is con- 
vex [32.47] provided that L in the empirical risk (32.12) 
is a convex loss function. 


Abstract Penalized Empirical Risk Minimization 

Problems 
The penalized empirical risk minimization problem was 
introduced in (32.24) in the setting of RKHS of func- 
tions. However, it shall be noted that it is a very general 
idea. Ultimately this can be related to the general- 
ity of (32.21). The latter can either refer to infinite 
dimensional problems or to a finite system of linear 
equations involving objects living in some finite dimen- 
sional space. Therefore, for the sake of generality, one 
can consider in place of (32.24) the problem 


in RY (O)+AT(0), 
min Remp(@) + 4 PO) 


(32.25) 


where © — which is one-to-one with the hypothesis 
space — either coincides with some abstract vector 
space, or it is a subset of it; ae is the empirical risk 
and I”: © — R is a suitable penalty function. This, in 
particular, includes the situations where @ is a vector, 
a matrix, or a higher-order tensor (i. e., a higher-order 
array generalizing the notion of matrices). 


32.2.5 Types of Regularization 


A penalty frequently used in practice is T (0) = ||6||’, 
where ||@|| is the Hilbertian norm defined upon the 
space’s inner product 

18l? = (8,0). (32.26) 
This choice leads to ridge regression [32.48, 49]. 
Note that, in this case, ||0|| = 0 if and only if 0 is 
the zero vector of the space. A more general class 
of quadratic penalties is represented by seminorms. 
A seminorm is allowed to assign zero length to some 
nonzero vectors (in addition to the zero vector). They 
are commonly used in smoothing splines [32.13, 50], 
where the unknown is decomposed into an unpenalized 
parametric component and a penalized nonparametric 
part. 


LASSO and Non-Hilbertian Norms 
The methods that we present in the next sections are 
all instances of the problem class in (32.25); although 
this is not necessarily emphasized in the presentation, 
they all employ a simple quadratic penalty. This is 


Kernel Methods | 32.2 Foundations of Statistical Learning 585 


central to relying on Lagrange duality theory [32.47, 
51], which, in turn, constitutes the main technical tool 
for the derivation of a large class of kernel methods. 
However, it is important to mention that in the last 
decade much research effort has been expended on the 
design of alternative penalties (correspondingly, there 
has been increased interest in other notions of du- 
ality, such as Fenchel duality). This arises from the 
realization that using a certain penalty is also a way 
to convey prior knowledge. This fact is best under- 
stood within a Bayesian setting, in light of a maxi- 
mum a posteriori (MAP) interpretation of (32.25), see, 
e.g., [32.44]. A penalty term based on the space’s 
inner product has been replaced with various type 
of non-Hilbertian norms. These are norms that, con- 
trary to (32.26), do not arise from inner products. 
LASSO (least absolute shrinkage and selection oper- 
ator, [32.52]) is perhaps the most prominent example 
of such cases. In LASSO one considers linear func- 
tions 


D 
f(x; 8) = (0.x) = D> axa 


d=1 


and uses the /; norm 


D 
Alls = >> 16al (32.27) 
d=1 


to promote the sparsity of the parameter vector 0. Note 
that this corresponds to defining the structure (32.20) 
according to 


Sı = (4) : ll <a} . (32.28) 


Like ridge regression, LASSO is a continuous shrink- 
age method that achieves good prediction performance 
via a bias-variance trade-off. Since usually the esti- 
mated coefficient vector has many entries equal to zero, 
the approach has the further advantage over ridge re- 
gression of giving rise to interpretable models. 

More recently, different structure-inducing penal- 
ties have been proposed as a promising alterna- 
tive [32.53—-55]. The general idea is to convey structural 
assumption on the problem, such as grouping or hierar- 
chies over the set of input variables, by suitably crafting 
the penalty. In this way, the users are permitted to cus- 
tomize the regularization approach according to their 
subjective knowledge on the task. Correspondingly, as 
in (32.28), one (implicitly) forms a smart structure of 


nested subsets of functions, in agreement with the SRM 
principle. 

These ideas have been generalized to the case where 
© is infinite dimensional, in particular in the frame- 
work of the multiple kernel learning (MKL) problem. 
This was investigated both from a functional view- 
point [32.56,57], and from the more pragmatic point 
of view of optimization [32.58—60]. 


Spectral Regularization 
Yet another generalization of (32.25) arises in the 
context of multitask learning [32.61—63]. In this set- 
ting one approaches simultaneously different learning 
tasks under some common constraint(s). The general 
idea, sometimes also known as collaborative filter- 
ing, is that one can take advantage of shared features 
across tasks. In practical applications it was shown that 
one can significantly gain in terms of generalization 
performance from exploiting such prior knowledge. 
From the point of view of learning through regular- 
ization, a sensible approach is given in [32.64, 65]. 
Suppose one has T datasets, one for each task; the t- 
th dataset has N, observations. Note that, in general, 
Nı Æ N2 #---#Nr. In this setting one has to learn 
vectors 0;, f=1,2,...,7, one per task; the parame- 
ter space is, therefore, a space of matrices, © = RIXT 
where F is, possibly, infinity. The idea translates into 
penalized empirical risk minimization problems of the 


type 


T 
nd 2 Remp (6:) +All6 lls 
(32.29) 
where ||0||x is the nuclear norm 
R 
lll = >> (8) (32.30) 
r=1 


and o(0),o2(0),--- ,or(@) are the R< min(T, F) 
nonzero singular values of the Fx T matrix 0. Note 
that (32.30) corresponds to the l} norm of the vector 
of singular values. The definition also remains valid 
in the infinite dimensional case under some regularity 
assumptions [32.66]. The nuclear norm is the convex 
envelope of the rank function on the spectral-norm unit 
ball [32.67]; roughly speaking, it represents the best 
convex relaxation of the rank function. 

The use of the nuclear norm in (32.29) is moti- 
vated by the assumption that the parameter vectors of 
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related tasks should be approximately linearly depen- 
dent. This assumption is meaningful for a number of 
cases of interest. Other uses of the nuclear norm exist; 
ultimately, this is due to the fact that notions of rank 
are ubiquitous in the mathematical formulations stem- 
ming from real-life problems. As a consequence, the 
nuclear norm is a very versatile mathematical tool to 


32.3 Primal-Dual Methods 


The purpose of this section is to introduce the methods 
that have served as the archetypal approaches for a large 
class of kernel methods. In the process, we detail the 
Lagrange duality argument underlying general primal- 
dual techniques. We begin by giving a short overview of 
the formulations of SVMs introduced by Vapnik; suc- 
cessively, we discuss a number of modifications and 
extensions. 


32.3.1 SVMs for Classification 


Margin 
The problem of pattern recognition amounts to finding 
the label y € {—1, 1} that corresponds to a generic input 
point x € R?. This task can be approached by assigning 
a label ĵ according to the model 
$= sign [Toe fe p] i (32.31) 


where w! (x) +b is a hyperplane, found by a learning 
algorithm based on training data 


{(x1, 91), (x2, y2), cee (xv, yw)} C R? x {-1, 1}. 
(32.32) 


xa) 


Fig. 32.3 Several possible separating hyperplanes exist 


impose structure on (seemingly) very diverse settings. 
This includes the identification of linear time-invariant 
systems [32.68,69] and the analysis of nonstation- 
ary cointegrated systems [32.70]. Finally we mention 
that, in place of (32.30), one can consider spectral 
penalties [32.71,72] that include the nuclear norm as 
a special case. 


The concept of feature map ¢ was presented in short 
in Sect. 32.1.3. Later we will discuss the role of ¢ in 
more detail. For now, it suffices to say that ġ is expected 
to capture features that are important for the discrimi- 
nation of points. In the simplest case, @ is the identity 
map, i.e., 6 : x x. Note that w! $(x) +b is a primal 
model of the type (32.6), with w € F and b € R. 

In general, one can see that there are several possi- 
ble separating hyperplanes, see Fig. 32.3. 

The solution picked by the support vector classifi- 
cation (SVC) algorithm is the one that separates the 
data with the maximal margin. More precisely, Vap- 
nik considered a rescaling of the problem so that points 
closest to the separating hyperplane satisfy the normal- 
izing condition 


[wp +] =1. (32.33) 


The two hyperplanes w | p(x) +b = 1 and w! (x) + 
b=-—1 are called canonical hyperplanes, and the 
distance between them is called the margin, see 
Fig. 32.4. 


XA 


xo 


Fig. 32.4 SVC finds the solution that maximizes the 
margin 
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Assuming that the classification problem is separa- 
ble, i.e., that there exists at least a hyperplane separat- 
ing the training data (32.65), one obtains a canonical 
representation (w, b) satisfying 


Yn (wT bl) +b) >1, n=1,...,N. (32.34) 


Let us assume, without loss of generality, that y; = 1 
and y2 = —1. If the corresponding patterns x; and x2 are 
among the closest points to the separating hyperplane, 
the scaling imposed by Vapnik implies 


wi o(x)+b=1, 

wl o(m)+b=-1, (32.35) 
which, in turn, leads to 

w! (1) —$(2)) = 2. (32.36) 


Now the normal vector to the separating hyperplane 
w! (x) +b is (1/||w||)w. The margin is equal to the 
component of the vector @ (x1) — (x2) along (1/||w]|)w, 
i. e., the projection 


(1/|lwl)w! ($) —¢@a)). 


Using (32.36) one obtains that the margin is equal 
to 2/||w||; correspondingly, the distance between the 
points satisfying (32.33) and the separating hyperplane 
is 1/||w||. By minimizing ||w]||, subject to the set of con- 
straints (32.34), one obtains a maximal margin classifier 
that maximizes the margin between the two classes. 
This hyperplane, in turn, can be naturally envisioned as 
the simplest solution given the observed data. 


Primal Problem 
In practice, for computational reasons it is more con- 
venient to minimize $l? = swlw rather than ||w|. 
Additionally, it is in general unrealistic to assume that 
the classification problem is separable. In practical ap- 
plications, one should try to find a set of features (in 
fact, a feature mapping from the input domain to a more 
convenient representation) that allow to separate the 
two classes as much as possible. Nonetheless, there 
might be no boundary that can perfectly separate the 
data; therefore one should tolerate misclassifications. 
Taking this requirement into account leads to the pri- 
mal problem for the SVC algorithm [32.24]. This is the 


quadratic programming (QP) problem 


N 
1 
min Jp(w, £) = 5wiw +c) & 


n=l 
subject to ynlw! (xn) +b)>1-6&,, 
&, =0, n=1,...,N, 


n=1,...,N 
(32.37) 
where c > 0 is a user-defined parameter. In this prob- 


lem, one accounts for misclassifications by replacing 
the set of constraints (32.34), with the set of constraints 


Yn (wT b(n) +b) = 1 En, n=1,...,N, 
(32.38) 
where £1, &,...,&y are positive slack variables. It is 


clear that for higher values of c one penalizes more the 
violations of the conditions in (32.34). 


Dual Problem 
The Lagrangian corresponding to (32.37) is 


L(w, b, E50, v) = Jp(w, €) 
N 


N 
-J at, (nw b(n) +B) — 1+ En) = Do vatn» 


n=1 


(32.39) 


n=l 


with Lagrangian multipliers a, > 0, v, >20 for n= 
1,...,N. The solution is given by the saddle point of 
the Lagrangian 


max min £(w, b, €; œ, v). (32.40) 


QV w,b,ẸE 


One obtains 


ðL 


= lia N e 
DE, " 


=0-0<a, <c, 
(32.41) 
The dual problem is then the QP problem 


max Jp(a) 
a 


N 
subject to a AnVn = 0 


n=1 


O<a,<c, n=1,...,N, (32.42) 
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where 


N N 
1 
Jp(a) = 3 S YmYnk(Xm, Xp )Um On T > An , 


mn=1 n=1 
(32.43) 
and we used the kernel trick 
km Xn) = (Xm)! Ean) a m,n = Wi sisctund Vs 
(32.44) 


The classifier based on the dual model representation is 


N 
sign bs OnYnk(X, Xn) + | i 


n=l 


(32.45) 


where @, are positive real numbers obtained solving 
(32.42) and b is obtained based upon Karush—Kuhn— 
Tucker (KKT) optimality conditions, i.e., the set of 
conditions that must be satisfied at the optimum of 
a constrained optimization problem. These are 


aL 
ay OO = Di Meda) 
aL y 
a T0 n=O, 
ðL 

=0—>c—Qr vn =0, n=1,...,N 
dEn 
Qn (vn(w Aan) T b) =i En) =0, 

n= Tss N y 
Vnén =0, n=1,...,N, 
a,>0, n=1,...,N, 
w20, n=1,...,N. 

(32.46) 


From these equations it can be seen that, at optimum, 
we have 


yn(w! ọn) + b)—1=0 if O<a,<c, (32.47) 


from which one can compute b. 


SVC as a Penalized Empirical Risk Minimization 
Problem 
The derivation so far followed the classical approach 
due to Vapnik; the main argument comes from geo- 
metrical insights on the pattern recognition problem. 


Whenever the feature space is finite dimensional, one 
can approach learning either by solving the primal prob- 
lem or by solving the dual one; when this is not the case, 
one can still use the dual problem and rely on the dual 
representation obtained. 

Before proceeding, we highlight a different, yet 
equivalent problem formulation. For the primal prob- 
lem this reads 


N 


wp D [i-a (Taen), tawn 


n=l 


(32.48) 


where we let A = 1/(2c) and we define [-]4. by 


a, ifa>0 


, (32.49) 
0, otherwise. 


[a]+ = 


Problem (32.37) is an instance of (32.25) obtained by 
letting (note that F x R is naturally equipped with the 
inner product ((w1, b1), (w2, b2)) = wi wo +b, bp. and it 
is a (finite-dimensional) Hilbert space (HS).) © = F x 
R and taking as penalty the seminorm 


r :(w,b)>w' w. (32.50) 


This shows that (32.37) is essentially a regularized em- 
pirical risk minimization problem that can be analyzed 
in the framework of the SRM principle presented in 
Sect. 32.2. 


VC Bounds for Classification 

In Sect. 32.2.3 we already discussed bounds on the 
generalization error in terms of capacity factors. In 
particular, (32.18) states a bound for the case of VC 
dimension. The larger this VC dimension the smaller 
the training error (empirical risk) can become but the 
confidence term (second term on the right-hand side 
of (32.18) will grow. The minimum of the sum of these 
two terms is then a good compromise solution. For 
SVM classifiers, Vapnik has shown that hyperplanes 
satisfying ||w|| < a have a VC dimension A that is upper- 
bounded by 


h < min([ra], N) +1, (32.51) 


where [-] represents the integer part and r is the 
radius of the smallest ball containing the points 
(x1), p2), .. ., $ (xy) in the feature space F. 

Note that for each value of a there exists a corre- 
sponding value of À in (32.48), correspondingly, a value 
of c in (32.37) or (32.42). Additionally, the radius r can 


Kernel Methods | 32.3 Primal-—Dual Methods 


also be computed by solving a QP problem, see, e.g., 
[32.26]. It follows that one could compute solutions 
corresponding to multiple values of the hyperparame- 
ters, find the corresponding empirical risk and radius 
and then pick the model corresponding to the least value 
of the right-hand side of the bound (32.18). As we have 
already remarked, however, the bound (32.18) is often 
too conservative. Sharper bounds and frameworks alter- 
native to VC theory have been derived, see, e.g. [32.73]. 
In practice, however, the choice of parameters is of- 
ten guided by data-driven model selection criteria, see 
Sect. 32.5. 


Relative Margin and Data-Dependent 

Regularization 
Although maximum margin classifiers have proved to 
be very effective, alternative notions of data separation 
have been proposed. 

The authors of [32.74, 75], for instance, argue that 
maximum margin classifiers might be misled by direc- 
tion of large variations. They propose a way to correct 
this drawback by measuring the margin not in an abso- 
lute sense but rather relative to the spread of data in any 
projection direction. Note that this can be seen as a way 
to conveniently craft the hypothesis space, an important 
aspect that we discussed in Sect. 32.2. 


32.3.2 SVMs for Function Estimation 


In addition to classification, the support vector method- 
ology has also been introduced for linear and nonlinear 
function estimation problems [32.25]. For the general 
nonlinear case, output values are assigned according to 
the primal model 


$=w' pœ) +b. (32.52) 


In order to estimate the model’s parameter w and b, 
from training data consisting of N input-output pairs, 
Vapnik proposed to evaluate the empirical risk accord- 
ing to 


1 N 
Remp = N > 


n=l 


yn —w! O(%,)—b : (32.53) 


with the so called Vapnik’s €-insensitive loss function 
defined as 


E _ fo, if |y—f@)| < € 
ly fle = an —e, otherwise. 
(32.54) 


The idea is illustrated in Fig. 32.5. 


The corresponding primal optimization problem is 
the QP problem 


: * 1 ~ * 
che Jp(w, &,€") = av wte (En + E ) 


n=l 


subject to yn -wl (x) -b< e€+ Gas 


E e205 NY 
wln) +b— yn <€ +8" ; 
A= ee N, 


he €°>0,n=1,...,N, (32.55) 


where c > 0 is a user-defined parameter that determines 
the amount up to which deviations from the desired ac- 
curacy € are tolerated. Following the same approach 
considered above for (32.37), one obtains the dual QP 
problem 


x 
max Jp(Qm, Q; 
a.a* 


N 
subject to X Oa —a7)=0, 

n=l 

O<a,<c, n=1,...,N, 


O<ar<c, n=1,...,N, (32.56) 
where 


Jp (Qn, až) = 


1 x ok ok 
-5 De (m= on) (Oen = OFF )K Am Xn) 


mn=1 
N N 
-eJ (n0) +J yaana). (32.57) 
n=l n=1 


Note that, whereas (32.37) and (32.42) have tuning pa- 
rameter c, in (32.55) and (32.56) one has the additional 
parameter €. 


Fig. 32.5 ¢-insensitive loss 
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Before continuing, we note that a number of inter- 
esting modifications of the original SVR primal—dual 
formulations exist. In particular, [32.76] proposed the 
v-tube support vector regression. In this method, the ob- 
jective Jp(w, €, &*) in (32.55) is replaced by 


Jp(w, §,€",€) 
N 


1 1 * 
= We wpe (r ta 5 (Eté ) <- (32.58) 


n=1 
In the latter, € is an optimization variable rather than 
a hyperparameter, as in (32.37); v, on the other hand, 
is fixed by the user and controls the fraction of support 
vectors that is allowed outside the tube. 


32.3.3 Main Features of SVMs 


Here we briefly highlight the main features of support 
vector algorithms, making a direct comparison with 
classical neural networks. 


Choice of Kernel 
A number of possible kernels, such as the Gaussian 
radial basis function (RBF) kernel, can be chosen in 
(32.44). Some examples are included in Table 32.1. 

In general, it is clear from (32.44) that a valid ker- 
nel function must preserve the fundamental properties 
of the inner-product. That is, for the equality to hold, 
the bivariate function k : R? x R? > R is required to 
be symmetric and positive definite. Note that this, in 
particular, imposes restriction on t in the polynomial 
kernel. A more in depth discussion on kernels is post- 
poned to Sect. 32.6. 


Global Solution 
(32.37) and its dual (32.42) are convex problems. This 
means that any local minimum must also be global. 
Therefore, even though SVCs share similarities with 
neural network schemes (see below), they do not suf- 
fer from the well-known issue of local minima. 


Sparseness 
The dual model is parsimonious: typically, many a’s 
are zero at the solution with the nonzero ones located 
in the proximity of the decision boundary. This is also 


Table 32.1 Some examples of kernel functions 


kernel name k(x, y) 

Linear xly 

Polynomial of degree d > 0 (c+x!' y)?, fort >0 
Gaussian RBF exp(—||x — y||?/o7) 


desirable in all those setting were one requires fast on- 
line out-of-sample evaluations of models. 


Neural Network Interpretation 

Both primal (parametric) and dual (nonparametric) 
problems admit neural network representations [32.26], 
see Fig. 32.6. Note that in the dual problem the size of 
the QP problem is not influenced by the dimension D of 
the input space, nor it is influenced by the dimension of 
the feature space. Notably, in classical multilayer per- 
ceptrons one has to fix the number of hidden units in 
advance; in contrast, in SVMs the number of hidden 
units follows from the QP problem and corresponds to 
the number of support vectors. 


SVM Solvers 

The primal and dual formulations presented above 
are all QP problems. This means that one can rely 
on general purpose QP solvers for the training of 
models. Additionally, a number of specialized de- 
composition methods have been developed, including 
the sequential minimum optimization (SMO) algo- 
rithm [32.77]. Publicly available software packages 
such as libSVM [32.78, 79] and SVMlight [32.80] in- 
clude implementations of efficient solvers. 


32.3.4 The Class of Least-Squares SVMs 


We discuss here the class of least-squares SVMs (LS- 
SVMs) obtained by simple modifications to the SVMs 
formulations. The arising problems relate to a number 
of existing methods and entail certain advantages with 
respect to the original SVM formulations. 


a) Primal problem b) Dual problem 


pila) K(x, x1) 
Wi ay 
x x 
va) ya) 
WF QAgsv 
prx) k(x, X#sv) 
Fig. 32.6a,b Primal-dual network interpretations of 


SVMs [32.26]. (a) The number F of hidden units in the 
primal weight space corresponds to the dimensionality of 
the feature space. (b) The number # SV of hidden units 
in the dual weight space corresponds to the number of 
nonzero &’s 
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LS-SVMs for Classification 
We illustrate the idea with respect to the formulation 
for classification (LSs-SVC). The approach, originally 
proposed in [32.81], considers the primal problem 


N 
1 Y 
min Jp(w,e) = =w! = e 
w,b,E p(w ) 2 eats 2 dX # 
subject to y,(w! (x) +b) =1—e,, 


R= lren N a (32.59) 


This formulation simplifies the primal problem (32.37) 
in two ways. First, the inequality constraints are re- 
placed by equality constraints; the 1’s on the right-hand 
side in the constraints are regarded as target values 
rather than being treated as a threshold. An error €,, is 
allowed so that misclassifications are tolerated in the 
case of overlapping distributions. Secondly, a squared 
loss function is taken for these error variables. The La- 
grangian for (32.59) is 


L£(w, b, €; a, Vv) 


N 
= Jew, e) an (yaw Tn) +)— 1 ben), 
n=l 


(32.60) 


where @’s are Lagrange multipliers that can be positive 
or negative since the problem now only entails equality 
constraints. The KKT conditions for optimality yield 


OL 
ow =0>w= pe OnYnP (Xn) f 
w 
ðL 
a OF Eh, 
ðL 
=0 > yen=a,, n=l,...,N, 
dEn 
£ T 
9 =0 > y, (w (4%) +d)-1l+e,=0, 
An 
n=1,...,N 


(32.61) 


By eliminating the primal variables w and b, one obtains 
the KKT system [32.82] 


k 2 an H g H 


where 


(32.62) 


y= yi y2- yN] 


and 
ly =[1,1,...,1]" 


are N-dimensional vectors and 2 is defined entry-wise 
by 


(Q) nn = YnYnP (xm) Q (Xn) = YmYnk(Xm, Xn) 7 
(32.63) 


In the latter, we used the kernel trick introduced before. 
The dual model obtained corresponds to (32.45) where 
a and b are now obtained solving the linear system 
(32.62), rather than a more complex QP problem, as in 
(32.42). Notably, for LS-SVMs (and related methods) 
one can exploit a number of computational shortcuts 
related to spectral properties of the kernel matrix; for 
instance, one can compute solutions for different values 
of y at the price of computing the solution of a single 
problem, which cannot be done for QPs [32.83-85]. 

The LS-SVC is easily extended to handle multiclass 
problems [32.26]. Extensive comparisons with alterna- 
tive techniques (including SVC) for binary and multi- 
class classification are considered in [32.26, 86]. The re- 
sults show that, in general, LS-SVC either outperforms 
or perform comparably to the alternative techniques. In- 
terestingly, it is clear from the primal problem (32.61) 
that LS-SVC maximizes the margin while minimizing 
the within-class scattering from targets {+1,—1}. As 
such, LS-SVC is naturally related to Fisher discriminant 
analysis in the feature space [32.26]; see also [32.2, 87, 
88]. 


Alternative Formulations 
Besides classification, a primal—dual approach simi- 
lar to the one introduced above has been considered 
for function estimation [32.26]; in this case too the 
dual model representation is obtained by solving a lin- 
ear system of equations rather than a QP problem, 
as in SVR. This approach to function estimation is 
similar to a number of techniques, including smooth- 
ing splines [32.13], regularization networks [32.44, 89], 
kernel ridge regression [32.90], and Kriging [32.91]. 
LS-SVM solutions also share similarities with Gaussian 
processes [32.92, 93], which we discuss in more detail 
in the next section. 

Other formulations have been considered within 
a primal—dual framework. These include principal com- 
ponent analysis [32.94], which we discuss next, spec- 
tral clustering [32.95], canonical correlation analy- 
sis [32.96], dimensionality reduction and data visual- 
ization [32.97], recurrent networks [32.98], and optimal 
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control [32.99]; see also [32.100]. In all these cases the 
estimation problem of interest is conceived at the primal 
level as an optimization problem with equality con- 
straints, rather than inequality constraints as in SVMs. 
The constraints relate to the model which is expressed 
in terms of the feature map. From the KKT optimality 
conditions one jointly finds the optimal model repre- 
sentation and the model estimate. As for the case of 
classification the dual model representation is expressed 
in terms of kernel functions. 


Sparsity and Robustness 
An important difference with SVMs, is that the dual 
model found via LS-SVMs depends upon all the train- 
ing data. Reduction and pruning techniques have been 
used to achieve the sparse representation in a second 
stage [32.26, 101]. A different approach, which makes 
use of the primal-dual setting, leads to fixed-size tech- 
niques [32.26], which relate to the Nyström method 
proposed in [32.102] but lead to estimation in the pri- 
mal setting. Optimized versions of fixed-size LS-SVMs 
are currently applicable to large data sets with millions 
of data points for training and tuning on a personal com- 
puter [32.103]. 

In LS-SVM the estimation of the support values is 
only optimal in the case of a Gaussian distribution of the 
error variables [32.26]; [32.101] shows how to obtain 
robust estimates for regression by applying a weighted 
version of LS-SVM. The approach is suitable in the 
case of outliers or non-Gaussian error distributions with 
heavy tails. 


32.3.5 Kernel Principal Component Analysis 


Principal component analysis (PCA) is one of the 
most important techniques in the class of unsupervised 
learning algorithms. PCA linearly transforms a num- 
ber of possibly correlated variables into uncorrelated 
features called principal components. The transforma- 
tion is performed to find directions of maximal vari- 
ation. Often, few principal components can account 
for most of the structure in the original dataset. PCA 
is not suitable for discovering nonlinear relationships 
among the original variables. To overcome this limita- 
tion [32.104] originally proposed the idea of perform- 
ing PCA in a feature space rather than in the input 
space. 

Regardless of the space where the transformation is 
performed, there is a number of different ways to char- 
acterize the derivation of the PCA problem [32.105]. 
Ultimately, PCA analysis is readily performed by solv- 


ing an eigenvalue problem. Here we consider a primal- 
dual formulation similar to the one introduced above 
for LS-SVC [32.106]. In this way the eigenvalue prob- 
lem is seen to arise from optimality conditions. Notably 
the approach emphasizes the underlying model, which 
is important for finding the projection of out-of-sample 
points along the direction of maximal variation. The 
analysis assumes the knowledge of N training data 
pairs 


ate eee xy} C RP (32.64) 


i.i.d. according to the generator p(x). The starting point 
is to define the generic score variable z as 


z(x) =w! (P(x) — Âe). (32.65) 


The latter represents one projection of ¢(x) — He into 
the target space. Note that we considered data centered 
in the feature space with 


N 
jie= 5 X bn) (32.66) 


n=l 


corresponding to the center of the empirical distribu- 
tion. The primal problem consists of the following 
constrained formulation [32.94] 


N 
1 r y 2 
max —-w w+z> z 
subject to zn =w! (@n)—- Âg), n=1,...,N. 

(32:67) 


where y > 0. The latter maximizes the empirical vari- 
ance of z while keeping the norm of the corresponding 
parameter vector w small by the regularization term. 
One can also include a bias term, see [32.26] for 
a derivation. 

The Lagrangian corresponding to (32.67) is 


N 
1 
L(w,z;œ) =— aw we Z Dz 


n=l 


N 
-J onw! (Cn) — Îo)), 
n=l 


(32.68) 
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with conditions for optimality given by 


a 0w Lam (b(n) — He), 

w 

o£ 

—=0-a,=Yz, n=1,...,N, 

Zn 

ðL A 

dAn =0 >n =w! ($n) — ha). 
n=1,...,N. 


(32.69) 


By eliminating the primal variables w and z, one obtains 
forn=1,...,N 


1 N 
Ti = 5 Am ($ (Xn) — fie) (O (Xm) a fie) = 0 š 


m=1 


(32.70) 


The latter is an eigenvalue decomposition that can be 
stated in matrix notation as 


Raq —ha g (32.71) 


where A = 1/y and 92, is the centered Gram matrix de- 
fined entry-wise by 


N 
1 
[clam = K(Xn, Xm) = N > k(Xm, xı) 


l=1 


N 
~ ~ XO k(%n.x1) + = 5 X ky. xi) x 
l=1 i 


As before, one may choose any positive definite ker- 
nel; a typical choice corresponds to the Gaussian RBF 


32.4 Gaussian Processes 


So far we have dealt with primal-dual kernel methods; 
regularization was motivated by the SRM principle, 
which achieves generalization by trading off empiri- 
cal risk with the complexity of the model class. This 
is representative of a large number of procedures. How- 
ever, it leaves out an important class of kernel-based 
probabilistic techniques that goes under the name of 
Gaussian processes. In Gaussian processes one uses 


kernel. By solving the eigenvalue problem (32.71) one 
finds N pairs of eigenvalues and eigenvectors 


(Am, œ”), m = l2 N 


Correspondingly, one finds N score variables with dual 
representation 


N 1 N 
m = J an | Kand- 55D ka) 


n=1 i=1 


LA LAA 
= 2 Ht) + M X X kai) i 


i=1j=1 
(32.73) 


in which a” is the eigenvector associated to the eigen- 
value Àm. Note that all eigenvalues are positive and 
real because 2. is symmetric and positive semidef- 
inite; the eigenvectors are mutually orthogonal, i.e., 
(a!) Tov =0, for 14m. Note that when the feature 
map is nonlinear, the number of score variables asso- 
ciated to nonzero eigenvalues might exceed the dimen- 
sionality D of the input space. Typically, one selects 
then the minimal number of score variables that pre- 
serve a certain reconstruction accuracy, see [32.26, 104, 
105]. 

Finally, observe that by the second optimality con- 
dition in (32.61), one has z = A a! for! = 1,2,...,N. 
From this, we obtain that the score variables are empir- 
ically uncorrelated. Indeed, we have for / 4 m 


N 
> Zl (Xn)Zm (Xn) 


n=1 


N 
=} hûn) To” =0. (32.74) 


n=l 


a Bayesian approach to perform inference and learn- 
ing. The main idea goes back at least to the work of 
Wiener [32.107] and Kolmogorov [32.108] on time- 
series analysis. 

As a first step, one poses a probabilistic model 
which serves as a prior. This prior is updated in the light 
of training data so as to obtain a predictive distribution. 
The latter represents a spectrum of possible answers. In 
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contrast, in the standard SVM/LS-SVM framework one 
obtains only point-wise estimates. The approach, how- 
ever, is analytically tractable only for a limited number 
of cases of interests. In the following, we summarize the 
main ideas in the context of regression where tractabil- 
ity is ensured by Gaussian posteriors; the interested 
reader is referred to [32.109] for an in-depth review. 


32.4.1 Definition 


A real-valued stochastic process f is a Gaussian process 


(GP) if for every finite set of indices x1, x2, . . . , xy in an 
index set X, the tuple 
fe = (F011). f 02)... fw) (32.75) 


is a multivariate Gaussian random variable taking val- 
ues in R”. Note that the index set X represents the set of 
all possible inputs. This might be a countable set, such 
as N (e.g. a discrete time index) or, more commonly in 
machine learning, the Euclidean space R?. A GP f is 
fully specified by a mean function m : X — R and a co- 
variance function k : X x X — R defined by 


m(x) = Eff], (32.76) 
k(x, x’) = EF — m(a)) FQ’) ma]. (32.77) 


In light of this, one writes 


f~ GP(m,k). (32.78) 
Usually, for notational simplicity one takes the mean 
function to be zero, which we consider here; however, 
this need not to be the case. 

Note that the specification of the covariance func- 
tion implies a distribution over any finite collection of 
random variables obtained sampling the process f at 
given locations. Specifically, we can write for (32.75) 

fe~ N(0,K), (32.79) 
which means that f, follows a multivariate zero-mean 
Gaussian distribution with N x N covariance matrix K 
defined entry-wise by 


[K]nm — k(Xn, Xm) 3 


The typical use of a GP is in a regression context, which 
we consider next. 


32.4.2 GPs for Regression 


In regression one observes a dataset of input—output 
pairs (Xn, Yn), n = 1,2,...,N and wants to make a pre- 
diction at one or more test points. In the following, we 
call y the vector obtained staking the target observa- 
tions and denote by X the collection of input training 
patterns (32.64). In order to carry on the Bayesian infer- 
ence, one needs a model for the generating process. It 
is generally assumed that function values are observed 
in noise, that is, 


Yn =f@n) tEn, n=1,2,...,N. (32.80) 


One further assumes that €, are i.i.d. zero-mean Gaus- 
sian random variables independent of the process f and 
with variance o?. Under these circumstances, the noisy 
observations are Gaussian with mean zero and covari- 
ance function 


C(Xn. Xn) =E myn] = K(Xm, Xn) A OSs , (32.81) 


where the Kronecker delta function „m is 1 if n =m 
and 0 otherwise. 

Suppose now that we are interested in the value fx 
of the process at a single test point x, (the approach 
that we discuss below extends to multiple test points 
in a straightforward manner). By relying on proper- 
ties of Gaussian probabilities, we can readily write the 
joint distribution of the test function value and the noisy 
training observations. This reads 


y K +07ly kx 
Brei" = 


where Jy is the N x N identity matrix, kx = k(Xx, Xx), 
and finally 


(32.82) 


ky = [k(x1, xx), k2, Xe)... Ky, xx)] |. (32.83) 
Prediction with Noisy Observations 


Using the conditioning rule for multivariate Gaussian 
distributions, namely: 


batbhé s- 


yxa N (b+ CTAT|(x—a),B—CTAT'C) 
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one arrives at the key predictive equation for GP regres- 
sion 


fly, Xx" ~ N (mx, 02), (32.84) 
where 

tt. =k! (K+o7I'y, (32.85) 

o2 = kx — kl (K+ 07D 'k,. (32.86) 


Note that, by letting a = (K +.07J)~'y, one obtains for 
the mean value 


N 
My = X ankn, xx) : 


n=1 


(32.87) 


Up to the bias term b, the latter coincides with the typ- 
ical dual model representation (32.7), in which x plays 
the role of a test point. Therefore, one can see that, in 
the framework of GPs, the covariance function plays the 
same role of the kernel function. The variance oĉ, on 
the other hand, is seen to be obtained from the prior co- 
variance, by subtracting a positive term which accounts 
for the information about the process conveyed by train- 
ing data. 


Weight-Space View 

We have presented GPs through the so-called function 
space view [32.109], which ultimately captures the dis- 
tinctive nature of this class of methodologies. Here we 
illustrate a different view, which allows one to achieve 
three objectives: 1) it is seen that Bayesian linear mod- 
els are a special instance of GPs; 2) the role of Bayes’ 
tule is highlighted; 3) one obtains additional insight on 
the relationship with the feature map and kernel func- 
tion used before within primal—dual techniques. 


Bayesian Regression. The starting point is to charac- 
terize f as a parametric model involving a set of basis 
functions y1, Wo,..., Wr 


Q=] wii) = wl yo). (32.88) 


Note that F might be infinity. For the special case where 
w is the identity mapping, one recognizes in (32.80) 
the standard modeling assumptions for Bayesian linear 
regression analysis. Inference is based on the posterior 


distribution over the weights, computed by Bayes’ rule 


. likelihood x prior 
posterior = 


marginal likelihood’ 
P(X, w)p(w) 
p(wly, X) = POIX, wp(w) 
pOIX) 


where the marginal likelihood (a.k.a. normalizing con- 
stant) is independent of the weights 


(32.89) 


POW = f poix, wpa. (32.90) 


Explicit Feature Space Formulation. To make a pre- 
diction for a test pattern x, we average over all possible 
parameter values with weights corresponding to the 
posterior probability 


P(faly. Xa) = | plwy Dw. 
(32.91) 


One can see that computing the posterior p(wly, X) 
based upon the prior 


p(w) = N (0, Xp) (32.92) 


gives the predictive model 
Faly, X, xa ~ N (Wore) Sway, 


WT Eya) — Wea) T WA WT Eevee), 
(32.93) 


where A = (W | X,Y +0°Iy)! and we denoted by 


Y = [y a), ya), VOn)] 


the feature representation of the training patterns. It is 
not difficult to see that (32.93) is (32.84) in disguise. In 
particular, one has 


k(x,y) = V(x) 3,0). 


The positive definiteness of X, ensures the existence 
and uniqueness of the square root y > If we now de- 
fine 


ox) = I?ve), 


we retrieve the relationship in (32.44). We conclude that 
the kernel function considered in the previous sections 
can be interpreted as the covariance function of a GP. 


(32.94) 


(32.95) 
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32.4.3 Bayesian Decision Theory 


Bayesian inference is particularly appealing when pre- 
diction is intended for supporting decisions. In this 
case, one requires a loss function L(firue,fouess) Which 
specifies the penalty obtained by guessing fouess When 
the true value is fue. Note that the predictive distribu- 
tion (32.84) or — equivalently — (32.93) was derived 
without reference to the loss function. This is a ma- 
jor difference with respect to the techniques developed 
within the framework of statistical learning. Indeed, 
in the non-Bayesian framework of penalized empirical 
risk minimization, prediction and loss are entangled; 
one tackles learning in a somewhat more direct way. 
In contrast, in the Bayesian setting there is a clear dis- 
tinction between 1) the model that generated the data 
and 2) capturing the consequences of making guesses. 
In light of this, [32.109] advises one to beware of ar- 
guments like a Gaussian likelihood implies a squared 
error loss. In order to find the point prediction that in- 
curs the minimal expected loss, one can define the merit 


32.5 Model Selection 


Kernel-based models depend upon a number of pa- 
rameters which are determined during training by 
numerical procedures. Still, one or more hyperpa- 
rameters usually need to be tuned by the user. In 
SVC, for instance, one has to fix the value of c. 
The choice of the kernel function, and of the cor- 
responding parameters, also needs to be properly 
addressed. 

In general, performance measures used for model 
selection include k-fold cross-validation, leave-one- 
out (LOO) cross-validation, generalized approximate 
cross-validation (GACV), approximate span bounds, 
VC bounds, and radius-margin bounds. For discus- 
sions and comparisons see [32.111, 112]. Another ap- 
proach found in the literature is kernel-target align- 
ment [32.113]. 


32.5.1 Cross-Validation 


In practice, model selection based on cross-validation 
is usually preferred over generalization error bounds. 
Criticism for cross-validation approaches is related 
to the high computational load involved; [32.114] 
presents an efficient methodology for hyperparame- 
ter tuning and model building using LS-SVMs. The 


function 
R fies ly, X, Xx) 


= | Lofaol (32.96) 


Y, X, Xx) dfx . 


Note that, since the true value fiue is unknown, the 
latter averages with respect to the model’s opinion 
D(f«ly, X,X) on what the truth might be. The corre- 
sponding best guess is 


Jop = arg min R(fouess|Y, X, Xe) - (32.97) 


Souess 


Since p(fx|y, X,x.) is Gaussian and hence symmet- 
ric, fop always coincides with the mean m whenever 
the loss is also symmetric. However, in many practical 
problems such as in critical safety applications, the loss 
can be asymmetrical. In these cases, one must solve the 
optimization problem in (32.97). Similar considerations 
hold for classification, see [32.109]. For an account on 
decision theory see [32.110]. 


approach is based on the closed form LOO cross- 
validation computation for LS-SVMs, only requir- 
ing the same computational cost of one single LS- 
SVM training. Leave-one-out cross-validation-based 
estimates of performance, however, generally exhibit 
a relatively high variance and are, therefore, prone to 
over-fitting. To amend this, [32.115] proposed the use 
of Bayesian regularization at the second level of infer- 
ence. 


32.5.2 Bayesian Inference 
of Hyperparameters 


Many authors have proposed a full Bayesian framework 
for kernel-based algorithms in the spirit of the meth- 
ods developed by MacKay for classical MLPs [32.116- 
118]. In particular, [32.26] discusses the case of LS- 
SVMs. It is shown that, besides leading to tuning 
strategies, the approach allows us to take probabilis- 
tic interpretations of the outputs; [32.109] discusses 
the Bayesian model selection for GPs. In general, the 
Bayesian framework consists of multiple levels of infer- 
ence. The parameters (i. e., with reference to the primal 
model, w and b) are inferred at level 1. Contrary to 
MLPs, this usually entails the solution of a convex op- 
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timization problem or even solving a linear system, as 
in LS-SVMs and GPs. The regularization parameter(s) 
and the kernel parameter(s) are inferred at higher levels. 


32.6 More on Kernels 


We have already seen that a kernel arising from an inner 
product can be interpreted as the covariance function 
of a Gaussian process. In this section we further study 
the mathematical properties of different yet equivalent 
notions of kernels. In particular, we will discuss that 
positive definite kernels are reproducing, in a sense that 
we are about to clarify. We will then review a number 
of specialized kernels for learning problems involving 
structured data. 


32.6.1 Positive Definite Kernels 


Denote by X a nonempty index set. A symmetric func- 
tion k: X xX —> R is a positive definite kernel if for 
any N € N and for any tuple (x1, x2,..., xy) € X”, the 
Gram matrix K defined entry-wise by Kym = k(%n,Xm); 
satisfies (note that, by definition, a positive definite ker- 
nel satisfies k(x, x) > 0 for any x € X) 


N N 
al! Ka = > > Kym nm Z 0 Yg € R“. 


n=1m=1 


In particular, suppose F is some Hilbert space (HS) 
(for an elementary introduction see [32.66]) with inner 
product (-,-). Then for any function ¢ : X —> F one has 


N N 
> yi $n), (Xm)) ) nm 


n=1m=1 


N N 
= > > (Anh (Xn), OnP(Xm)) 
n=1m=1 


N 2 
X ong Xn) (32.98) 
n=1 
From the first line one can then see that 
k: (xy) > ($x), 60) (32.99) 


is a positive definite kernel in the sense specified above. 
A continuous positive definite kernel k is often called 
a Mercer kernel [32.36]. 


The method progressively integrates out the parameters 
by using the evidence at a certain level of inference as 
the likelihood at the successive level. 


Note that in Sect. 32.1.3 we denoted (f(x), b(y)) by 
p(x)! $), implicitly making the assumption that the 
feature space F is a finite dimensional Euclidean space. 
However, one can show that the feature space associated 
to certain positive definite kernels (such as the Gaussian 
RBF [32.119]) is an infinite dimensional HS; in turn, 
the inner product in such a space is commonly denoted 


as (+,-+). 
32.6.2 Reproducing Kernels 


Evaluation Functional 
Let (H, (-,-)}) be a HS of real-valued functions (the 
theory of RKHSs generally deals with complex-valued 
functions [32.12, 14]; here we stick to the real setting 
for simplicity) on X equipped with the norm ||f|| = 
af (f,f). For x € X we denote by Ly the evaluation func- 
tional 


L,: HOR, 


fefœ. (32.100) 
L, is said to be bounded if there exists c > 0 such that 
LFI = F| < ellf|| for all f € H. By the Riesz repre- 
sentation theorem [32.120] if Ly is bounded then there 
exists a unique ny € H such that for any f € H 


Lf = (f; nx) - (32.101) 
Reproducing Kernel 
A function 
k:XxX>R, 
(x, y) > k(x, y) (32.102) 


is said to be a reproducing kernel of H if and only if 


Vx EX, kK DEH, 
Wx EX, Vf eH (F, k6) =f@). 


(32.103) 
(32.104) 


Note that by k(-,x) we mean the function k(-, x) : t => 
k(t, x). 
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The definition of reproducing kernel (r.k.) implies 
that k(-,x) = nx, i.e., k(-,x) is the representer of the 
evaluation functional Ly; (32.104) goes under the name 
of reproducing property. From (32.103) and (32.104) it 
is clear that 

k(x, y) = (k(x), kG, y)), Yx,yE€X; (32.105) 
since (-,-) is symmetric, it follows that k(x, y) = 
k(y,x). A HS of functions that possesses a reproduc- 
ing kernel is called a reproducing kernel Hilbert space 
(RKHS). 

Finally, notice that the reproducing kernel (.k.) of 
a space of a HS of functions corresponds in a one-to-one 
manner with the definition of the inner product (-,-); 
changing the inner product implies a change in the re- 
producing kernel. 


Basic Properties of RKHSs 
Let (G,(-,-)) be a HS of functions. If, for any x, the 
evaluation functional L, is bounded, then it is clear that 
G is a RKHS with reproducing kernel 


K(x, y) = (Nx, Ny) - (32.106) 
Vice versa, if G admits a reproducing kernel k, then all 
evaluation functionals are bounded. Indeed, we have 


FOO] = FC.) < WICE Il 
= Vk(x, x) If], (32.107) 


where we simply relied on the Cauchy—Schwarz in- 
equality. Boundedness of evaluation functionals means 
that all the functions in the space are well defined for 
all x. Note that this is not the case, for instance, of the 
space of square-integrable functions. 

It is not difficult to prove that, ina RKHS H, the 
representation of a bounded linear functional A is sim- 
ply Ak(-,x), i.e., it is obtained by applying A to the 
representer of Ly. As an example, take the functional 
evaluating the derivative of f at x 


D: fe f(x). 
If D, is bounded on H, then the property implies that 
FO=FKC0), EH, 


where k’(-, x) is the derivative of the function k(-, x). 


32.6.3 Equivalence Between 
the Two Notions 


Moore-Aronszajn Theorem 
If we let 


o:X SH, 


x= k(x,-), (32.108) 
one can see that, in light of (32.98), the reproducing 
kernel k is also a positive definite kernel. The converse 
result, stating that a positive definite kernel is the repro- 
ducing kernel of a HS of functions (H, (-,-)), is found 
in the Moore-Aronszajn theorem [32.12]. This com- 
pletes the equivalence between positive definite kernels 
and reproducing kernels. 


Feature Maps and the Mercer Theorem 

Note that (32.108) is a first feature map associated to the 
kernel function k. Correspondingly, this shows that the 
RKHS H is a possible instance of the feature space. 
A different feature map can be given in view of the 
Mercer theorem [32.8, 121], which historically played 
a major role in the development of kernel methods. The 
theorem states that every positive definite kernel can be 
written as 


k(x, y) = D> mieiaei(y) , (32.109) 


i=l 


where the series in the right-hand side of (32.109) con- 
verges absolutely and uniformly, (e;); is an orthonormal 
sequence of eigenfunctions, and (j1;); is the correspond- 
ing sequence of nonnegative eigenvalues such that for 
some measure v 


fre yye(y)dv(y) = wie(x)Vx EX, (32.110) 


f sora) = bj. (32.111) 


The eigenfunctions (e;); belong to the RKHS (H, (-,-)) 
associated to k. In fact, by (32.110) one has 


TE J tedo. (32.112) 


L 


and therefore e; can be approximated by elements 
in the span of (k,)xex [32.36]. One can further see 
that (./j/e;); is an orthonormal basis for H ; indeed one 


Kernel Methods | 32.6 More on Kernels 


has 
(Vien Pje) 
1 
by (110) (= [wwo vine) 


= f SEke aaow 
by a0) f vA by) Vi s 
vli j 


(32.113) 


Ja aw v(y) 


Note that we considered the general case in which 
the expansion (32.109) involves infinitely many terms. 
However, there are positive definite kernels (e.g., the 
polynomial kernel) for which only finitely many eigen- 
values are nonzero. 

In light of the Mercer theorem one can see that a dif- 
ferent feature map is given by 


o:xX—>F, 


x> (/ miei (x)); . (32.114) 


Note that @ maps x into an infinite dimensional vector 
with i-th entry $;(x) = ./[je;(x). 

Connecting Functional 

and Parametric View 
One can see now that X`, w;;(x) in the primal model 
(32.6) corresponds to the evaluation in x of a function 


f ina RKHS. To see this, we start from decomposing f 
according to the orthonormal basis (,/1;e;); 


f= ye Hiei) / Hiei 
(32.115) 


= Dwi Hiei A 


where we let w; = (f, ./Hiei),/Hi. Now one has 


Lf = (f, k(x) (Seve, kG, ») 
= Lowi 
= wie), 


where we applied the reproducing property on e; and 
used the definition of feature map (32.114). Addition- 


J ile: k(-,x)) = 5 Wiy/hiei(x) 


(32.116) 


ally, notice that one has 


II? = Gf) = (x wil Hiei > nvr) 
= = 2D wi (Tiei, 156) 


by(113) Lv ae [vhi 5 
Sy 
i 


(32.117) 


This shows that the penalty w! w, used within the 
primal problems of Sect. 32.3, can be connected to 
the squared norm of a function. The interested reader 
is referred to [32.14,36] for additional properties of 
kernels. 


32.6.4 Kernels for Structured Data 


In applications where data are well represented by vec- 
tors in a Euclidean space one usually uses the Gaussian 
RBF kernel, which is universal [32.122]. Nonetheless, 
there exists an entire set of rules according to which 
one can design new kernels from elementary positive 
definite functions [32.1]. Although the idea of kernels 
has been around for a long time, it was only in the 
1990s that the machine learning community started 
to realize that the index set X does not need to be 
(a subset of) some Euclidean space. This significantly 
improved the applicability of kernel-based algorithms 
to a broad range of data types, including sequence 
data, graphs and trees, and XML and HTML docu- 
ments [32.123]. 


Probabilistic Kernels 

One powerful approach consists of applying a kernel 
that brings generative models into a (possibly discrimi- 
native) kernel-based method [32.124, 125]. Generative 
models can deal naturally with missing data and in 
the case of hidden Markov models can handle se- 
quences of varying length. A popular probabilistic 
similarity measure is the Fisher kernel (32.126, 127]. 
The key intuition behind this approach is that similarly 
structured objects should induce similar log-likelihood 
gradients in the parameters of a predefined class of 
generative models [32.126]. Different instances exist, 
depending on the generative model of interest, see 
also [32.1]. 
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Graph Kernels and Dynamical Systems 

Graphs can very naturally represent entities, their at- 
tributes, and their relationships to other entities; this 
makes them one of the most widely used tools for 
modeling structured data. Various type of kernels for 
graphs have been proposed, see [32.128, 129] and refer- 
ences therein. The approach can be extended to carry on 
recognitions and decisions for tasks involving dynami- 
cal systems; in fact, kernels on dynamical systems are 
related to graph kernels through the dynamics of ran- 
dom walks [32.128, 130]. 


Tensors and Kernels 
Tensors are multidimensional arrays that represent 
higher-order generalizations of vectors and matrices. 
Tensor-based methods are often particularly effective 
in low signal-to-noise ratios and when the number 
of observations is small in comparison with the di- 


32.7 Applications 


Kernel methods have been shown to be successful in 
many different applications. In this section we mention 
only a few examples. 


32.7.1 Text Categorization 


Recognition of objects and handwritten digits is studied 
in [32.135—137]; natural language text categorization is 
discussed in [32.138, 139]. The task consists of classi- 
fying documents based on their content. Attribute value 
representation of text is used to adequately represent the 
document text; typically, each distinct word in a docu- 
ment represents a feature with values corresponding to 
the number of occurrences. 


32.7.2 Time-Series Analysis 


The use of kernel methods for time-series prediction has 
been discussed in a number of papers [32.140-144], 
with applications ranging from electric load forecast- 
ing [32.145] to financial time series prediction [32.146]. 
Nonlinear system identification by LS-SVMs is dis- 
cussed in [32.26] and references therein; [32.134] stud- 


mensionality of the data. They are used in domains 
ranging from neuroscience to vision and chemomet- 
rics, where tensors best capture the multiway nature 
of the data [32.131]. The authors of [32.132] pro- 
posed a family of kernels that exploit the algebraic 
structure of data tensors. The approach is related to 
a generalization of the singular value decomposition 
(SVD) to higher-order tensors [32.133]. The essential 
idea is to measure similarity based upon a Grass- 
mannian distance of the subspaces spanned by ma- 
trix unfolding of data tensors. It can be shown that 
the approach leads to perfect separation of tensors 
generated by different sets of rank-1 atoms [32.132]. 
Within this framework, [32.134] proposed a kernel 
function for multichannel signals; the idea exploits 
the spectral information of tensors of fourth or- 
der cross-cumulants associated to each multichannel 
signal. 


ies the problem of training a discriminative classifier 
given a set of labeled multivariate time series. Applica- 
tions include brain decoding tasks based on magnetoen- 
cephalography (MEG) recordings. 


32.7.3 Bioinformatics and Biomedical 
Applications 


Gene expression analysis performed by SVMs is dis- 
cussed in [32.147]. Applications in metabolomics, ge- 
netics, and proteomics are presented in the tutorial 
paper [32.148]; [32.149] discussed different techniques 
for the integration of side information in models based 
on gene expression data to improve the accuracy of 
diagnosis and prognosis in cancer; [32.150] provides 
an introduction to general data fusion problems using 
SVMs with application to computational biology prob- 
lems. Detection of remote protein homologies by SVMs 
is discussed in [32.151], which combines discrimi- 
native methods with generative models. Bioengineer- 
ing and bioinformatics applications can also be found 
in [32.152-154]. Survival analysis based on primal- 
dual techniques in discussed in [32.155, 156]. 
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This chapter introduces basic concepts, phenom- 
ena, and properties of neurodynamic systems. 
it consists of four sections with the first two 
on various neurodynamic behaviors of gen- 
eral neurodynamics and the last two on two 
types of specific neurodynamic systems. The 
neurodynamic behaviors discussed in the first 
two sections include attractivity, oscillation, 
synchronization, and chaos. The two specific 
neurodynamics systems are memrisitve neuro- 
dynamic systems and neurodynamic optimization 
systems. 


33.1 Dynamics of Attractor 


and Analog Networls.......................... 607 
33.1.1 Phase Space and Attractors......... 608 
33.1.2 Single Attractors 

of Dynamical Systems ................ 609 


33. Neurodynamics 


33.1.3 Multiple Attractors 


of Dynamical Systems ................ 610 
SS Ae. COCOM rnea 611 
33.2 Synchrony, Oscillations, 
and Chaos in Neural Networks.............. 611 
F220 “SYMCMPOMIZALON iersinii 611 
33.2.2 Oscillations in Neural Networks .. 616 
33.2.3 Chaotic Neural Networks............ 623 
33.3 Memristive Neurodynamics .................. 629 
33.3.1 Memristor-Based Synapses......... 630 
33.3.2 Memristor-Based 
Neural Networks... 632 
3325.5  COTGMUSION, «4.0553 sccssrscadedssancestaca 634 
33.4 Neurodynamic Optimization................. 634 
33.4.1 Neurodynamic Models ............... 635 
33.4.2 Design Methods ....................008 636 
33.4.3 Selected Applications................. 638 
33.4.4 Concluding Remarks.................. 638 
ROETEFCRICES oo. ccc.cccscsccesscsesaceeseseescaeesescsesaanes 639 


33.1 Dynamics of Attractor and Analog Networks 


An attractor, as a well-known mathematical object, 
is central to the field of nonlinear dynamical sys- 
tems (NDS) theory, which is one of the indispensable 
conceptual underpinnings of complexity science. An 
attractor is a set towards which a variable moves ac- 
cording to the dictates of a nonlinear dynamical system, 
evolves over time, such that points get close enough 
to the attractor, and remain close even if they are 
slightly disturbed. To well appreciate what an attractor 
is, some corresponding NDS notions, such as phase or 
state space, phase portraits, basins of attractions, initial 
conditions, transients, bifurcations, chaos, and strange 
attractors are needed to tame some of the unruliness of 
complex systems. 

Most of us have at least some inkling of what non- 
linear means, which can be illustrated by the most 
well-known and vivid example of the butterfly effect 
of a chaotic system that is nonlinear. It has prompted 


the use of the image of tiny air currents produced 
by a butterfly flapping its wing in Brazil, which are 
then amplified to the extent that they may influence 
the building up of a thunderhead in Kansas. Although 
no one can actually claim that there is such a linkage 
between Brazilian lepidopterological dynamics and cli- 
matology in the Midwest of the USA, it does serve to 
vividly portray nonlinearity in the extreme. 

As the existence of both the nonlinearity and 
the capacity in passing through different regimes of 
stability and instability, the outcomes of the nonlin- 
ear dynamical system are unpredictable. These dif- 
ferent regimes of a dynamical system are under- 
stood as different phases governed by different at- 
tractors, which means that the dynamics of each 
phase of a dynamical system are constrained within 
the circumscribed range allowable by that phase’s 
attractors. 
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33.1.1 Phase Space and Attractors 


To better grasp the idea of phase space, a time se- 
ries and phase portrait have been used to represent the 
data points. Time series display changes in the val- 
ues of variables on the y-axis (or the z-axis), and time 
on the x-axis as in a time series chart, however, the 
phase portrait plots the variables against each other 
and leaves time as an implicit dimension not explicitly 
plotted. Attractors can be displayed by phase portraits 
as the long-term stable sets of points of the dynami- 
cal system. This means that the locations in the phase 
portrait towards which the system’s dynamics are at- 
tracted after transient phenomena have died down. To 
illustrate phase space and attractors, two examples are 
employed. 


Frictionless pivot 


Amplitude 


Massless rod 


Imagine a child on a swing and a parent pulling 
the swing back. This gives a good push to make the 
child move forward. When the child is not moving for- 
ward, he will move backward on the swing as shown 
in Fig. 33.1. The unpushed swing will come to rest 
as shown in the times series chart and phase space. 
The time series show an oscillation of the speed of the 
swing, which slows down and eventually stops, that is 
its flat lines. In phase space, the swing’s speed is plotted 
against the distance of the swing from the central point 
called a fixed point attractor since it attracts the sys- 
tem’s dynamics in the long run. The fixed point attractor 
in the center of Fig. 33.2 is equivalent to the flat line in 
Fig. 33.3. The fixed point attractor is another way to see 
and say that an unpushed swing will come to a state of 
rest in the long term. The curved lines with arrows spi- 
raling down to the center point in Fig. 33.2 display what 
is called the basin of attraction for the unpushed swing. 
These basins of attraction represent various initial con- 
ditions for the unpushed swing, such as starting heights 
and initial velocities. 
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Fig. 33.2 Phase portrait and fixed point attractor of an un- 
pushed swing 


Fig. 33.4 Time series chart of the pushed swing 


Neurodynamics | 33.1 Dynamics of Attractor and Analog Networks 609 


Now consider another type of a similar dynami- 
cal system, this time the swing is pushed each time it 
comes back to where the parent is standing. The time 
series chart of the pushed swing is shown in Fig. 33.4 
as a continuing oscillation. This oscillation is around 
a zero value for y and is positive when the swing is 
going in one direction and negative when the swing is 
going in the other direction. As a phase space diagram, 
the states of variables against each other are shown in 
Fig. 33.5. The unbroken oval in Fig. 33.5 is a different 
kind of attractor from the fixed point one in Fig. 33.2. 
This attractor is well known as a limit cycle or peri- 
odic attractor of a pushed swing. It is called a limit 
cycle because it represents the cyclical behavior of the 
oscillations of the pushed swing as a limit to which 
the dynamical systems adheres under the sway of this 
attractor. It is periodic because the attractor oscillates 
around the same values, as the swing keeps going up 
and down until the s has a same heights from the lowest 
point. Such dynamical system can be called periodic for 
it has a repeating cycle or pattern. 

By now, what we have learned about attractors can 
be summarized as follows: they are spatially displayed 
phase portraits of a dynamical system as it changes over 
the course of time, thus they represent the long-term 
dynamics of the system so that whatever the initial con- 
ditions represented as data points are, their trajectories 
in phase space fall within its basins of attraction, they 
are attracted to the attractor. In spite of wide usage in 
mathematics and science, as Robinson points out there is 
still no precise definition of an attractor, although many 
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Fig. 33.5 Phase portrait and limit cycle attractor of 
a pushed swing (after [33.1]) 


have been offered [33.2]. So he suggests thinking about 
an attractor as a phase portrait that attracts a large set of 
initial conditions and has some sort of minimality prop- 
erty, which is the smallest portrait in the phase space of 
the system. The attractor has the property of attracting 
the initial conditions after any initial transient behavior 
has died down. The minimality requirement implies the 
invariance or stability of the attractor. As a minimal ob- 
ject, the attractor cannot be split up into smaller subsets 
and retains its role as what dominates a dynamical sys- 
tem during a particular phase of its evolution. 


33.1.2 Single Attractors 
of Dynamical Systems 


Standard methods for the study of stability of dynamical 
systems with a unique attractor include the Lyapunov 
method, the Lasalles invariance principle, and the com- 
bination of thereof. Usually, given the properties of 
a (unique) attractor, we can realize a dynamical system 
with such an attractor. 

Since the creation of the fundamental theorems of 
Lyapunov stability, many researchers have gone fur- 
ther and proved that most of the fundamental Lyapunov 
theories are reversible. Thus, from theory, this demon- 
strates that these theories are efficacious; i.e., there 
necessarily exists the corresponding Lyapunov function 
if the solution has some kind of stability. However, as 
for the construction of an appropriate V function for the 
determinant of stability, researchers are still interested. 
The difference between the existence and its construc- 
tion is large. However, there is no general rule for the 
construction of the Lyapunov function. In some cases, 
different researchers have different methods for the 
construction of the Lyapunov function based on their 
experience and technique. Those, who can construct 
a good quality Lyapunov function, can get more use- 
ful information to demonstrate the effectiveness of their 
theories. Certainly, many successful Lyapunov func- 
tions have a practical background. For example, some 
equations inferred from the physical model have a clear 
physical meaning such as the mechanics guard system, 
in which the total sum of the kinetic energy and po- 
tential energy is the appropriate V function. The linear 
approximate method can be used; i.e., for the nonlin- 
ear differential equation, firstly find its corresponding 
linear differential equation’s quadric form positive de- 
fined V function, then consider the nonlinear quality for 
the construction of a similar V function. 

Grossberg proposed and studied additive neural net- 
works because they add nonlinear contributions to the 
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neuron activity. The additive neural network has been 
used for many applications since the 1960s [33.3, 4], 
including the introduction of self-organizing maps. In 
the past decades, neural networks as a special kind 
of nonlinear systems have received considerable at- 
tention. The study of recurrent neural networks with 
their various generalizations has been an active research 
area [33.5-17]. The stability of recurrent neural net- 
works is a prerequisite for almost all neural network 
applications. Stability analysis is primarily concerned 
with the existence and uniqueness of equilibrium points 
and global asymptotic stability, global exponential sta- 
bility, and global robust stability of neural networks at 
equilibria. In recent years, the stability analysis of re- 
current neural networks with time delays has received 
much attention [33.18, 19]. Single attractors of dynam- 
ical systems are shown in Fig. 33.6. 


33.1.3 Multiple Attractors 
of Dynamical Systems 


Multistable systems have attracted extensive interest in 
both modeling studies and neurobiological research in 
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Fig. 33.7 Two limit cycle attractors of dynamical systems 


recent years due to their feasibility to emulate and ex- 
plain biological behavior [33.20-34]. Mathematically, 
multistability allows the system to have multiple fixed 
points and periodic orbits. As noted in [33.35], more 
than 25 years of experimental and theoretical work has 
indicated that the onset of oscillations in neurons and 
in neuron populations is characterized by multistabil- 
ity. 

Multistability analysis is different from monosta- 
bility analysis. In monostability analysis, the objective 
is to derive conditions that guarantee that each nonlin- 
ear system contains only one equilibrium point, and all 
the trajectories of the neural network converge to it. 
Whereas in multistability analysis, nonlinear systems 
are allowed to have multiple equilibrium points. Sta- 
ble and unstable equilibrium points, and even periodic 
trajectories may co-exist in a multistable system. 

The methods to study the stability of dynamical 
systems with a unique attractor include the Lyapunov 
method, the Lasalles invariance principle, and the com- 
bination of the two methods. One unique attractor can 
be realized by one dynamical system, but it is much 
more complicated for multiple attractors to be realized 
by one dynamical system or dynamical multisystems 
because of the compatibility, agreement, and behavior 
optimization among the systems. Generally, the usual 
global stability conditions are not adequately applicable 
to multistable systems. The latest results on multistabil- 
ity of neural networks can be found in [33.36—52]. It is 
shown in [33.45, 46] that the n-neuron recurrent neural 
networks with one step piecewise linear activation func- 
tion can have 2” locally exponentially stable equilib- 
rium points located in saturation regions by partitioning 
the state space into 2” subspaces. In [33.47], mul- 


Fig. 33.8 2* equilibrium point attractors of dynamical sys- 
tems 
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33.2 Synchrony, Oscillations, and Chaos in Neural Networks 


tistability of almost periodic solutions of recurrently 
connected neural networks with delays is investigated. 
In [33.48], by constructing a Lyapunov functional and 
using matrix inequality techniques, a delay-dependent 
multistability criterion on recurrent neural networks is 
derived. In [33.49], the neural networks with a class 
of nondecreasing piecewise linear activation functions 
with 2r corner points are considered. It is proved that 
the n-neuron dynamical systems can have and only 
have (2r+ 1)" equilibria under some conditions, of 
which (r+ 1)” are locally exponentially stable and oth- 
ers are unstable. In [33.50], some multistability prop- 
erties for a class of bidirectional associative memory 
recurrent neural networks with unsaturation piecewise 
linear transfer functions are studied based on local inhi- 
bition. In [33.51], for two classes of general activation 
functions, multistability of competitive neural networks 


with time-varying and distributed delays is investigated 
by formulating parameter conditions and using inequal- 
ity techniques. In [33.52], the existence of 2” stable 
stationary solutions for general n-dimensional delayed 
neural networks with several classes of activation func- 
tions is presented through formulating parameter condi- 
tions motivated by a geometrical observation. Two limit 
cycle attractors and 2+ equilibrium point attractors of 
dynamical systems are shown in Figs. 33.7 and 33.8, 
respectively. 


33.1.4 Conclusion 


In this section, we briefly introduced what attractors 
can be summarized as, and phase space and attractors. 
Furthermore, single-attractor and multiattractors of dy- 
namical systems were also discussed. 


33.2 Synchrony, Oscillations, and Chaos in Neural Networks 


33.2.1 Synchronization 


Biological Significance of Synchronization 
Neurodynamics deals with dynamic changes of neu- 
ral properties and behaviors in time and space at 
different levels of hierarchy in neural systems. The 
characteristic spiking dynamics of individual neurons 
is of fundamental importance. In large-scale systems, 
such as biological neural networks and brains with 
billions of neurons, the interaction among the con- 
nected neural components is crucial in determining col- 
lective properties. In particular, synchronization plays 
a critical role in higher cognition and conscious- 
ness experience [33.53-57]. Large-scale synchroniza- 
tion of neuronal activity arising from intrinsic asyn- 
chronous oscillations in local electrical circuitries of 
neurons are at the root of cognition. Synchroniza- 
tion at the level of neural populations is characterized 
next. 

There are various dynamic behaviors of potential 
interest for neural systems. In the simplest case, the 
system behavior converges to a fixed point, when all 
major variables remain unchanged. A more interesting 
dynamic behavior emerges when the system behavior 
periodically repeats itself at period T, which will be 
described first. Such periodic oscillations are common 
in neural networks and are often caused by the pres- 
ence of inhibitory neurons and inhibitory neural pop- 
ulations. Another behavior emerges when the system 


neither converges to a fixed point nor exhibits peri- 
odic oscillations, rather it maintains highly complex, 
chaotic dynamics. Chaos can be microscopic effect at 
the cellular level, or mesoscopic dynamics of neural 
populations or cortical regions. At the highest level 
of hierarchy, chaos can emerge as the result of large- 
scale, macroscopic effect across cortical areas in the 
brain. 

Considering the temporal dynamics of a system 
of interacting neural units, limit cycle oscillations and 
chaotic dynamics are of importance. Synchronization 
in limit cycle oscillations is considered first, which il- 
lustrates the basic principles of synchronization. The 
extension to more complex (chaotic) dynamics is de- 
scribed in Sect. 33.2.3. Limit cycle dynamics is de- 
scribed as a cyclic repetition of the system’s behavior 
at a given time period T. The cyclic repetition covers 
all characteristics of the system, e.g., microscopic cur- 
rents, potentials, and dynamic variables; see, e.g., the 
Hodgkin—Huxley model of neurons [33.58]. Limit cy- 
cle oscillations can be described as a cyclic loop of the 
system trajectory in the space of all variables. The state 
of the system is given as a point on this trajectory at any 
given time instant. As time evolves, the point belong- 
ing to the system traverses along the trajectory. Due to 
the periodic nature of the movement, the points describ- 
ing the system at time ¢ and t+ T coincide fully. We 
can define a convenient reference system by selecting 
a center point of the trajectory and describe the mo- 
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tion as the vector pointing from the center to the actual 
state on the trajectory. This vector has an amplitude and 
phase in a suitable coordinate system, denoted as é (t) 
and ®(t), respectively. The evolution of the phase in an 
isolated oscillator with frequency wp can be given as 
follows 


dð) 
dt 


wo. (33.1) 


Several types of synchronization can be defined. 
The strongest synchronization takes place when two 
(or multiple) units have identical behaviors. Consider- 
ing limit cycle dynamics, strong synchronization means 
that the oscillation amplitude and phase are the same 
for all units. This means complete synchrony. An ex- 
ample of two periodic oscillators is given by the clocks 
shown in Fig. 33.9a—c [33.59]. Strong synchroniza- 
tion means that the two pendulums are connected with 
a rigid object forcing them move together. The lack of 
connection between the two pendulums means the ab- 
sence of synchronization, i.e., they move completely 
independently. An intermediate level of synchrony may 
arise with weak coupling between the pendulums, such 
as a spring or a flexible band. Phase synchrony takes 
place when the amplitudes are not the same, but the 


a) 


patti) 


phases of the oscillations could still coincide. Fig- 
ure 33.9b-d depicts the case of out-of-phase synchrony, 
when the phases of the two oscillators are exactly the 
opposite. 


Amplitude Measures of Synchrony 
Denote by a;(t) the time signal produced by the individ- 
ual units (neurons); j = 1,...,N, and the overall signal 
of interacting units (A) is determined as 


N 
A(t) =1/N ) a(t). (33.2) 


j=l 
The variance of time series A(t) is given as follows 


of =(P O) (AQ)? . (33.3) 


Here (f(t)) denotes time averaging over a give time 
window. After determining the variance of the individ- 
ual channels o based on (33.3), the synchrony yy in 
the system with N components is defined as follows 


2 


ie — S (33.4) 
1/N J i= Oi; 
b) 
d) \ Ea(tr) 


galt) 


\ éB(h) 


Fig. 33.9a-d Synchronization in pendulums, in phase and out of phase (after [33.59]). Bottom plots: Illustration of 
periodic trajectories, case of in-phase (a-c) and out-of-phase oscillations (b-d) 
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This synchrony measure has a nonzero value in syn- 
chronized and partially synchronized systems 0 < yy < 
1, while yy = 0 means the complete absence of syn- 
chrony in neural networks [33.60]. 

Fourier transform-based signal processing methods 
are very useful for the characterization of synchrony 
in time series, and they are widely used in neural net- 
work analysis. The Fourier transform makes important 
assumptions on the analyzed time series, including sta- 
tionary or slowly changing statistical characteristics and 
ergodicity. In many applications these approximations 
are appropriate. In analyzing large-scale synchrony on 
brain signals, however, alternative methods are also 
justified. Relevant approaches include the Hilbert trans- 
form for rapidly changing brain signals [33.61, 62]. 
Here both Fourier and Hilbert-based methods are out- 
lined and avenues for their applications in neural net- 
works are indicated. Define the cross correlation func- 
tion (CCF) between discretely sampled time series x; (t) 
and x;(t), t = 1,...,N as follows 


T-T 
ccr) = E Ette) O- (y) 
t=1 


(33.5) 


Here (x;) is the mean of the signal over period T, 
and it is assumed that x;(t) is normalized to unit vari- 
ance. For completely correlated pairs of signals, the 
maximum of the cross correlation is 1, for uncorrelated 
signals it equals 0. The cross power spectral density 
CPSDj(@), cross spectrum for short, is defined as the 
Fourier transform of the cross correlation as follows: 
CPSDj;(@) = F (CCF,(t)). If i = j, i. e., the two chan- 
nels coincide, then we talk about autocorrelation and 
auto power spectral density APSDj;;(@); for details of 
Fourier analysis, see [33.63]. Coherence y? is defined 
by normalizing the cross spectrum by the autospectra 


|CPSD;(@)|? 
|APSD;i(@)||APSD;(@)| ` 


The coherence satisfies 0 < yw) < 1 and it con- 
tains useful information on the frequency content of 
the synchronization between signals. If coherence is 
close to unity at some frequencies, it means that the 
two signals are closely related or synchronized; a co- 
herence near zero means the absence of synchrony at 
those frequencies. Coherence functions provide useful 
information on synchrony in brain signals at various 
frequency bands [33.64]. For other information-theo- 
retical characterizations, including mutual information 
and entropy measures. 


(33.6) 


yj(@) = 


Phase Synchronization 
If the components of the neural network are weakly 
interacting, the synchrony evaluated using the ampli- 
tude measure y in (33.4) may be low. There can still 
be a meaningful synchronization effect in the system, 
based on phase measures. Phase synchronization is de- 
fined as the global entrainment of the phases [33.65], 
which means a collective adjustment of their rhythms 
due to their weak interaction. At the same time, in sys- 
tems with phase synchronization the amplitudes need 
not be synchronized. Phase synchronization is often 
observed in complex chaotic systems and it has been 
identified in biological neural networks [33.61, 65]. 

In complex systems, the trajectory of the system 
in the phase space is often very convoluted. The ap- 
proach described in (33.1), i. e., choosing a center point 
for the oscillating cycle in the phase space with natural 
frequency wo, can be nontrivial in chaotic systems. In 
such cases, the Hilbert transform-based approach can 
provide a useful tool for the characterization of phase 
synchrony. Hilbert analysis determines the analytic sig- 
nal and its instantaneous frequency, which can be used 
to describe phase synchronization effects. Considering 
time series s(f), its analytic signal z(t) is defined as fol- 
lows [33.62] 


z(t) = s(t) + BA = AHi O | (33.7) 


Here A(f) is the analytic amplitude, ®(f) is the an- 
alytic phase, while S(1) is the Hilbert transform of s(t), 
given by 


+oo 
= ev f Oa, (33.8) 


where PV stands for the principal value of the integral 
computed over the complex plane. The analytic signal 
and its instantaneous phase can be determined for an 
arbitrary broadband signal. However, the analytic sig- 
nal has clear meaning only at a narrow frequency band, 
therefore, the bandpass filter should precede the eval- 
uation of analytic signal in data with broad frequency 
content. 

The Hilbert method of analytic signals is illus- 
trated using actual local field potentials measured over 
rabbits with an array of chronically implanted intracra- 
nial electrodes [33.67]. The signals have been filtered 
in the theta band (3—7 Hz). An example of time se- 
ries s(t) is shown in Fig. 33.10a. The Hilbert trans- 
form S(t) is depicted in Fig. 33.10b in red, while blue 
color shows s(f). Figure 33.10c shows the analytic 
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Fig. 33.10a-d Demonstration of the Hilbert analytic signal approach on electroencephalogram (EEG) signals (af- 
ter [33.66]); (a) signal s(t); (b) Hilbert transform S(t) (red) of signal s(t) (blue); (c) instantaneous phase ®(f); and 


analytic signal in complex plane z(f) 


phase ®(t), and Fig. 33.10d depicts the analytic z(f) 
signal in the complex plane. Figure 33.11 shows the 
unwrapped instantaneous phase with bifurcating phase 
curves indicating desynchronization at specific time in- 
stances —1.3s, —0.4s, and 1s. The plot on the right- 
hand side of Fig. 33.11 depicts the evolution of the in- 
stantaneous frequency in time. The frequency is around 
5 Hz most of the time, indicating phase synchroniza- 
tion. However, it has very large dispersion at a few 
specific instances (desynchronization). 

Synchronization between channels x and y can be 
measured using the phase lock value (PLV) defined as 
follows [33.61] 


l t+T/2 

PLV „(®) = J elPO-Py@Igzr} (33.9) 
t—T/2 

PLV ranges from 1 to 0, where 1 indicates complete 


phase locking. PLV defined in (33.9) determines an av- 
erage value over a time window of length T. Note that 


PLY is a function of t by applying the given sliding win- 
dow. PLV is also the function of the frequency, which 
is being selected by the bandpass filter during the pre- 
processing phase. By changing the frequency band and 
time, the synchronization can be monitored at various 
conditions. This method has been applied productively 
in cognitive experiments [33.68]. 


Synchronization- Desynchronization 

Transitions 
Transitions between neurodynamic regimes with and 
without synchronization have been observed and ex- 
ploited for cognitive monitoring. The Haken—Kelso— 
Bunz (HKB) model is one of the prominent and elegant 
approaches providing a theoretical framework for syn- 
chrony switching, based on the observations related 
to bimanual coordination [33.69]. The HKB model 
invokes the concepts of metastability and multista- 
bility as fundamental properties of cognition. In the 
experiment, the subjects were instructed to follow the 
rhythm of a metronome with their index fingers in an 
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anti-phase manner. It was observed that by increasing 
the metronome frequency, the subject spontaneously 
switched their anti-phase movement to in-phase at a cer- 
tain oscillation frequency and maintained it thereon 
even if the metronome frequency was decreased again 
below the given threshold. 

The following simple equation is introduced to de- 
scribe the dynamics observed: dA®/dt = — sin(@) — 
2e sin(2®). Here AD = ġı — dz is the phase difference 
between the two finger movements, control parameter £ 
is related to the inverse of the introduced oscillatory fre- 
quency. The system dynamics is illustrated in Fig. 33.12 
by the potential surface V, where stable fixed points 
correspond to local minima. For low oscillatory fre- 
quencies (high ¢), there are stable equilibria at anti- 
phase conditions. As the oscillatory frequency increases 
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Fig. 33.11a,b Illustration of instantaneous phases; (a) un- 
wrapped phase with bifurcating phase curves indicating 
desynchronization at specific time instances —1.3 s, —0.4 s, 
and 1 s; (b) evolution of instantaneous frequency in time 


(low £) the dynamics transits to a state where only the 
in-phase equilibrium is stable. 

Another practical example of synchrony-desyn- 
chrony transition in neural networks is given by image 
processing. An important basic task of neural networks 
is image segmentation, which is difficult to accom- 
plish with excitatory nodes only. There is evidence that 
biological neural networks use inhibitory connections 
for completing basic pattern separation and integration 
tasks [33.70]. Synchrony between interacting neurons 
may indicate the recognition of an input. A typical neu- 
ral network architecture implementing such a switch 
between synchronous and nonsynchronous states us- 
ing local excitation and global inhibition is shown in 
Fig. 33.13. This system uses amplitude difference to 


Fig. 33.12 Illustration of the potential surface (V) of the 
HKB system as a function of the phase difference in radi- 
ans A@ and inverse frequency e£. The transition from anti- 
phase to in-phase behavior is seen as the oscillation fre- 
quency increases (£ decreases) 


Fig. 33.13 Neural network with local excitation and 
a global inhibition node (black; after [33.70]) 
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measure synchronization between neighboring neurons. 
Phase synchronization measures have been proposed as 
well to accomplish the segmentation and recognition 
tasks [33.71]. Phase synchronization provides a very 
useful tool for learning and control of the oscillations 
in weakly interacting neighborhoods. 


33.2.2 Oscillations in Neural Networks 


Oscillations in Brains 
The interaction between opposing tendencies in phys- 
ical and biological systems can lead to the onset of 
oscillations. Negative feedback between the system’s 
components plays an important role in generating os- 
cillations in electrical systems. Brains as large-scale 
bioelectrical networks consist of components oscillat- 
ing at various frequencies. The competition between 
inhibitory and excitatory neurons is a basic ingredi- 
ent of cortical oscillations. The intricate interaction 
between oscillators produces the amazingly rich oscil- 
lations that we experimentally observe as brain rhythms 
at multiple time scales [33.72, 73]. 

Oscillations occur in the brain at different time 
scales, starting from several milliseconds (high fre- 
quencies) to several seconds (low frequencies). One 
can distinguish between oscillatory components based 
on their frequency contents, including delta (1—4 Hz), 
theta (4—7 Hz), alpha (7—12 Hz), beta (12—30 Hz), and 
gamma (30—80Hz) bands. The above separation of 
brain wave frequencies is somewhat arbitrary, however, 
they can be used as a guideline to focus on various 
activities. For example, higher cognitive functions are 
broadly assumed be manifested in oscillations in the 
higher beta and gamma bands. 

Brain oscillations take place in time and space. 
A large part of cognitive activity happens in the cortex, 
which is a convoluted surface of the six-layer cortical 
sheet of gyri and sulci. The spatial activity is organized 
on multiple scales as well, starting from the neuronal 
level (um), to granules (mm), cortical activities (several 
cm), and hemisphere-wide level (20cm). The tempo- 
ral and spatial scales are not independent, rather they 
delicately interact and modulate each other during cog- 
nition. Modern brain monitoring tools provide insight 
to these complex space-time processes [33.74]. 


Characterization of Oscillatory Networks 
Oscillations in neural networks are synchronized activ- 
ities of populations of neurons at certain well-defined 
frequencies. Neural systems are often modeled as the 
interaction of components which oscillate at specific, 


well-defined frequencies. Oscillatory dynamics can cor- 
respond to either microscopic neurons, to mesoscopic 
populations of tens of thousands neurons, or to macro- 
scopic neural populations including billions of neurons. 
Oscillations at the microscopic level have been thor- 
oughly studied using spiking neuron models, such as 
the Hodgkin—Huxley equation (HH). Here we focus on 
populations of neurons, which have some natural os- 
cillation frequencies. It is meaningful to assume that 
the natural frequencies are not identical due to the 
diverse properties of populations in the cortex. Inter- 
estingly, the diversity of oscillations at the microscopic 
and mesoscopic levels can give rise to large-scale syn- 
chronous dynamics at higher levels. Such emergent 
oscillatory dynamics is the primary subject of this 
section. 

Consider N coupled oscillators with natural fre- 
quencies aj; j= 1,...,N. A measure of the synchro- 
nization in such systems is given by parameter R, 
which is often called the order parameter. This ter- 
minology was introduced by Haken [33.75] to de- 
scribe the emergence of macroscopic order from dis- 
order. The time-varying order parameter R(t) is defined 
as [33.76] 

RO) = |1/N x BL, i20], (33.10) 

Order parameter R provides a useful synchroniza- 
tion measure for coupled oscillatory systems. A com- 
mon approach is to consider a globally coupled system, 
in which all the components interact with each other. 
This is the broadest possible level of interaction. The 
local coupling model represents just the other extreme 
limit, i.e., each node interacts with just a few others, 
which are called its direct neighbors. In a one-dimen- 
sional array, a node has two neighbors on its left and 
right, respectively (assuming periodic boundary con- 
ditions). In a two-dimensional lattice, a node has four 
direct neighbors, and so on. The size of the neigh- 
borhood can be expanded, so the connectivity in the 
network becomes more dense. There is of special inter- 
est in networks that have a mostly regular neighborhood 
with some further neighbors added by a selection rule 
from the whole network. The addition of remote or non- 
local connections is called rewiring, and the networks 
with rewiring are small world networks. They have been 
extensively studied in network theory [33.76—78]. Fig- 
ure 33.14 illustrates local (top left) and global coupling 
(bottom right), as well as intermediate coupling, with 
the bottom left plot giving an example of network with 
random rewiring. 
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The Kuramoto Model 

The Kuramoto model [33.79] is a popular approach 
to describe oscillatory neural systems. It implements 
mean-field (global) coupling. The synchronization in 
this model allows an analytical solution, which helps to 
interpret the underlying dynamics in clear mathemati- 
cal terms [33.76]. Let 6; and œ; denote the phase and the 
inherent frequency of the i-th oscillator. The oscillators 
are coupled by a nonlinear interaction term depending 
on their pair-wise phase differences. In the Kuramoto 
model, the following sinusoidal coupling term has been 
used to model neural systems 


d9 


K 
Fa -5X sin(—§), j=l,...,N. 


! N 


(33:11) 


Here K denotes the coupling strength and K = 0 
means no coupling. The system in (33.11) and its 
generalizations have been studied extensively since 
its first introduction by Kuramoto [33.79]. Kuramoto 
used Lorenztian initial distribution of phases @ defined 
as: L(0) = y/{x(y* + (w — w)?)}. This leads to the 
asymptotic solution N — inf and t — inf for order pa- 
rameter R in simple analytic terms 


R= /1—(K./K)_ if K > K,,R = 0 otherwise . 
(33.12) 


Here K, denotes the critical coupling strength given 
by K. = 2y. There is no synchronization between the 
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Fig. 33.14a-d Network architectures with various connec- 
tivity structures: (a) local, (b) and (c) are intermediate, and 
(d) global (mean-field) connectivity 


oscillators if K < K., and the synchronization becomes 
stronger as K increases at supercritical conditions K > 
K., see Fig. 33.15. Inputs can be used to control syn- 
chronization, i.e., a highly synchronized system can 
be (partially) desynchronized by input stimuli [33.80, 
81]. Alternatively, input stimuli can induce large-scale 
synchrony in a system with low level of synchrony, as 
evidenced by cortical observations [33.82]. 


Neural Networks as Dynamical Systems 
A dynamical system is defined by its equation of mo- 
tion, which describes the location of the system as 
a function of time t 


dX(t, A) 


33.13 
EP ( ) 


=F(X), XeR”. 


Here X is the state vector describing the state of 
the system in the n-dimensional Euclidean space X = 
X(x1,..-,Xn) € R” and å is the vector of system param- 
eters. Proper initial conditions must be specified and it 
is assumed that F(X) is a sufficiently smooth nonlinear 
function. In neural dynamics it is often assumed that 
the state space is a smooth manifold, and the goal is to 
study the evolution of the trajectory of X(t) in the state 
space as time varies along the interval [f, T]. 

The Cohen-—Grossberg (CG) equation is a general 
formulation of the motion of a neural network as a dy- 
namical system with distributed time delays in the 
presence of inputs. The CG model has been studied 
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Fig. 33.15 Kuramoto model in the mean-field case. Depen- 
dence of order parameter R on the coupling strength K. 
Below a critical value K., the order parameter is 0, indi- 
cating the absence of synchrony; synchrony emerges for K 
above the critical value 
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thoroughly in the past decades and it served as a starting 
point for various other approaches. The general form of 
the CG model is [33.83] 


dz;(t) x 
T S aO) | Bil) A hO) 


j=l 


N 
-J baft- ty) +u |, 


j=l 
(33.14) 


Here X(t) = [x1(f),x2(t),...,xy(t)]' is the state 
vector describing a neural network with N neurons. 
Function a;(t) describes the amplification, b;(t) denotes 
a properly behaved function to guarantee that the solu- 
tion remains bounded, f;(x) is the activation function, u; 
denotes external input, aj; and bj are components of the 
connection weight matrix and the delayed connection 
weight matrix, respectively, and tj describes the time 
delays between neurons, i,j =1,...,n. The solution 
of (33.14) can be determined after specifying suitable 
initial conditions. 

There are various approaches to guarantee the sta- 
bility of the CG equation as it approaches its equilibria 
under specific constraints. Global convergence assum- 
ing symmetry of the connectivity matrix has been 
shown [33.83]. The symmetric version of a simpli- 
fied CG model has become popular as the Hopfield 
or Hopfield—Tank model [33.84]. Dynamical proper- 
ties of CG equation have been studied extensively, 
including asymptotic stability, exponential stability, ro- 
bust stability, and stability of periodic bifurcations and 
chaos. Symmetry requirements for the connectivity ma- 
trix have been relaxed, still guaranteeing asymptotic 
stability [33.85]. CG equations can be employed to 
find the optimum solutions of a nonlinear optimization 
problem when global asymptotic stability guarantees 
the stability of the solution [33.86]. Global asymptotic 
stability of the CG neural network with time delay is 
studied using linear matrix inequalities (LMI). LMI is 
a fruitful approach for global exponential stability by 
constructing Lyapunov functions for broad classes of 
neural networks. 


Bifurcations in Neural Network Dynamics 
Bifurcation theory studies the behavior of dynamical 
systems in the neighborhood of bifurcation points, 1. e., 
at points when the topology of the state space abruptly 
changes with continuous variation of a system parame- 
ter. An example of the state space is given by the folded 


surface in Fig. 33.16, which illustrates a cusp bifurca- 
tion point. Here A = [a, b] is a two-dimensional param- 
eter vector, X € R! [33.87]. As parameter b increases, 
the initially unfolded manifold undergoes a bifurcation 
through a cusp folding with three possible values of 
state vector X. This is an example of pitchfork bifur- 
cation, when a stable equilibrium point bifurcates into 
one unstable and two stable equilibria. The projection to 
the a — b plane shows the cusp bifurcation folding with 
multiple equilibria. The presence of multiple equilib- 
ria provides the conditions for the onset of oscillatory 
states in neural networks. The transition from fixed 
point to limit cycle dynamics can described by bifur- 
cation theory. 


Neural Networks with Inhibitory Feedback 
Oscillations in neural networks are typically due to de- 
layed, negative feedback between neural population. 
Mean-field models are described first, starting with 
Wilson—Cowan (WC) oscillators, which are capable of 
producing limit cycle oscillations. Next, a class of more 
general networks with excitatory—inhibitory feedback 
are described, which can generate unstable limit cycle 
oscillations. 

The Wilson—Cowan model is based on statistical 
analysis of neural populations in the mean-field limit, 
i.e., assuming that all components of the system fully 
interact [33.88, 89]. In the brain it may describe a sin- 
gle cortical column in one of the sensory cortices, which 
in turn interacts with other columns to generate syn- 
chronous or asynchronous oscillations, depending on 
the cognitive state. In its simplest manifestation, the 
WC model has one excitatory and one inhibitory com- 
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Fig. 33.16 Folded surface in the state space illustrating 
cusp bifurcation following (after [33.87]). By increasing 
parameter b, the stable equilibrium bifurcates to two stable 
and one unstable equilibria 
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ponent, with interaction weights denoted as Wgg, wer, 
Wig, and wy. Nonlinear function f stands for the stan- 
dard sigmoid with rate constant a 


dXg 
e —Xe+f(weeXe + weXiı + Pg), (33.15) 
dXı 
T = —X + f (werXe + wuXı + Py), (33.16) 
fŒ = 1/4]. (33.17) 


Pg and P; describe the effect of input stimuli 
through the excitatory and inhibitory nodes, respec- 
tively. The inhibitory weights are negative, while the 
excitatory ones are positive. The WC system has been 
extensively studied with dynamical behaviors includ- 
ing fixed point and oscillatory regimes. In particular, 
for fixed weight values, it has been shown that the 
WC system undergoes a pitchfork bifurcation by chang- 
ing Pg or Pı input levels. Figure 33.17 shows the 
schematics of the two-node system, as well as the illus- 
tration of the oscillatory states following the bifurcation 
with parameters Weg = 11.5, wy = —2, Wer = —wE = 
—10, and input values Pg = 0 and P; = —4, with rate 
constant a= 1. Stochastic versions of the Wilson- 


Time t 


Fig. 33.17 Schematic diagram of the Wilson—Cowan os- 
cillator with excitatory (E) and inhibitory (I) populations; 
solid lines show excitatory, dashed show inhibitory con- 
nections. The right panels show the trajectory in the phase 
space of Xg — Xj and the time series of the oscillatory sig- 
nals (after [33.90]) 


Cowan oscillators have been extensively developed as 
well [33.90]. Coupled Wilson—Cowan oscillators have 
been used in learning models and have demonstrated 
applicability in a number of fields, including visual pro- 
cessing and pattern classification [33.9 1-93]. 

Oscillatory neural networks with interacting ex- 
citatory—inhibitory units have been developed in Free- 
man K sets [33.94]. That model uses an asymmetric 
sigmoid function f(x) modeled based on neurophysio- 
logical activations and given as follows 


F(x) = qil — exp [1/4 - 1) )} - (33.18) 


Here q is a parameter specifying the slope and 
maximal asymptote of the sigmoid curve. The sigmoid 
has unit gain at zero, and has maximum gain at pos- 
itive x values due to its asymmetry, see (33.18). This 
property provides the opportunity for self-sustained os- 
cillations without input at a wide range of parameters. 
Two versions of the basic oscillatory units have been 
studied, either one excitatory and one inhibitory unit, 
or two excitatory and two inhibitory units. This is il- 
lustrated in Fig. 33.18. Stability conditions of the fixed 
point and limit cycle oscillations have been identi- 
fied [33.95, 96]. The system with two E and two I units 
has the advantage that it avoids self-feedback, which is 
uncharacteristic in biological neural populations. Inter- 
estingly, the extended system has an operating regime 
with an unstable equilibrium without stable equilib- 
ria. This condition leads to an inherent instability in 
a dynamical regime when the system oscillates with- 
out input. Oscillations in the unstable region have been 
characterized and conditions for sustained unstable os- 
cillations derived [33.96]. Simulations in the region 
confirmed the existence of limit cycles in the unstable 
regime with highly irregular oscillatory shapes of the 
cycle, see Fig. 33.18, upper plot. Regions with regular 
limit cycle oscillations and fixed point oscillations have 
been identified as well, see Fig. 33.18, middle and bot- 
tom [33.97]. 


Spatiotemporal Oscillations 

in Heterogeneous NNs 
Neural networks describe the collective behavior of 
populations of neurons. It is of special interest to study 
populations with a large-number of components having 
complex, nonlinear interactions. Homogeneous popula- 
tions of neurons allow mathematical modeling in mean- 
field approximation, leading to oscillatory models such 
as the Wilson—Cowan oscillators and Freeman KII sets. 
Field models with heterogeneous structure and dynamic 
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Fig. 33.18a,b Illustration of excitatory—inhibitory models. (a) Left: 
simplified model with one excitatory (E) and one inhibitory (I) 
node. Right: extended model with two E and two I nodes. (b) Simu- 
lations with the extended model with two E and two I nodes; yı — y4 
show the activations of the nodes; b1: limit cycle oscillations in the 
unstable regime; b2: oscillations in the stable limit cycle regime; 
b3: fixed point regime (after [33.97]) 


variables are of great interest as well, as they are the 
prerequisite of associative memory functions of neural 
networks. 

A general mathematical formulation views the neu- 
ropil, the interconnected neural tissue of the cortex, 
as a dynamical system evolving in the phase space, 
see (33.13). Consider a population of spiking neu- 
rons each of which is modeled by a Hodgkin—Huxley 
equation. The state of a neuron at any time instant is 
determined by its depolarization potential, microscopic 


current, and spike timing. Each neuron is represented 
by a point in the state space given by the above coor- 
dinates comprising vector X(t) € R”, and the evolution 
of a neuron is given with its trajectory the state space. 
Neuropils can include millions and billions of neurons; 
thus the phase space of the neurons contains a myriads 
of trajectories. Using the ensemble density approach of 
population modeling, the distribution of neurons in the 
state space at a given time ¢ is described by a prob- 
ability density function p(X, t). The ensemble density 
approach models the evolution of the probability den- 
sity in the state space [33.98]. One popular approach 
uses the Langevin formalism given next. 


Field Theories of Neural Networks 
Consider the stochastic process X(t), which is described 
by the Langevin equation [33.99] 


dX(t) = (X (Ð )dt + o (X(t) dW(P) . (33.19) 


Here jz and o denote the drift and variance, respec- 
tively, and dW(r) is a Wiener process (Brown noise) 
with normally distributed increments. The probability 
density p(X, t) of Langevin equation (33.19) satisfies 
the following form of the Fokker—Planck equation, after 
omitting higher-order terms 


wX D) 


=- 2 mop, 0] 


De ay Pi PX.) 


i=l j=1 
(33.20) 


The Fokker—Planck equation has two components. 
The first one is a flow term containing drift vec- 
tor 4;(X), while the other term describes diffusion 
with diffusion coefficient matrix Dj(X, t). The Fokker- 
Planck equation is a partial differential equation (PDE) 
that provides a deterministic description of macroscopic 
events resulting from random microscopic events. The 
mean-field approximation describes time-dependent, 
ensemble average population properties, instead of 
keeping track of the behavior of individual neurons. 

Mean-field models can be extended to describe the 
evolution of neural populations distributed in physi- 
cal space. Considering the cortical sheet as a de facto 
continuum of the highly convoluted neural tissue (the 
neuropil), field theories of brains are developed using 
partial differential equations in space and time. The cor- 
responding PDEs are wave equations. Consider a sim- 
ple one-dimensional model to describe the dynamics of 
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the current density ®(x, t) as a macroscopic variable. In 
the simple case of translational invariance of the con- 
nectivity function between arbitrary two points of the 
domain with exponential decay, the following form of 
the wave equation is obtained [33.100] 


a ® — ad 
2 T (wj — yv A)P +20 


= (03 + ons) S[P(x, t) + P(x, ©] . 


PP (33.21) 


Here A = 9? /x? is the Laplacian in one dimen- 
sion, S(.) is a sigmoid transfer function for firing 
rates, P(x, t) describes the effect of inputs; wp = v/o, 
where v is the propagation velocity along lateral axons, 
and o is the spatial relaxation constant of the applied 
exponential decay function [33.100]. The model can 
be extended to excitatory-inhibitory components. An 
example of simulations with a one-dimensional neu- 
ral field model incorporating excitatory and inhibitory 
neurons is given in Fig. 33.19 [33.101]. The figure 
shows the propagation of two traveling pulses and the 
emergence of transient complex behavior ultimately 
leading to an elevated firing rate across the whole tis- 
sue [33.101]. For recent developments in brain field 
models, see [33.90, 102]. 


Coupled Map Lattices for NNs 
Spatiotemporal dynamics in complex systems has been 
modeled using coupled map lattices (CML) [33.103]. 
CMLs use continuous state space and discrete time and 
space coordinates. In other words, CMLs are defined on 
(finite or infinite lattices) using discrete time iterations. 
Using periodic boundary conditions, the array can be 


250 


Fig. 33.19 Numerical simulations of a one-dimensional 
neural field model showing the interaction of two travel- 
ing pulses (after [33.101]) 


folded into a circle in one dimension, or into a torus 
for lattices of dimension 2 or higher. CML dynamics is 
described as follows 


1 K/2 

mi = -Af D)e J fal), 
k=—K/2 

(33.22) 


where x, (i) is the value of node i at iteration step n, i = 
1,..., N; N is the size of the lattice. Note that in (33.22) 
a periodic boundary condition applies. f(.) is a nonlin- 
ear mapping function used in the iterations and € is the 
coupling strength, 0 < £ < 1. e = 0 means no coupling, 
while £ = 1 is maximum coupling. The CML rule de- 
fined in (33.22) has two terms. The first term on the 
right-hand side is an iterative update of the i-th state, 
while the second term describes coupling between the 
units. Parameter K has a special role in coupled map 
lattices; it defines the size of the neighborhoods. K = N 
describes mean-field coupling, while smaller K values 
belong to smaller neighborhoods. The geometry of the 
system is similar to the ones given in Fig. 33.14. The 
case of local neighborhood is the upper left diagram in 
Fig. 33.14, while mean-field coupling is the lower right 
diagram. Similar rules have been defined for higher-di- 
mensional lattices. 

CMLs exhibit very rich dynamic behavior, includ- 
ing fixed points, limit cycles, and chaos, depending 
on the choice of control parameters, £, K, and func- 
tion f(.) [33.103, 104]. An example of the cubic sig- 
moid function 


f(x,a) = ax’ —ax+x 


is shown in Fig. 33.20, together with the bifurcation di- 
agram with respect to parameter a. By increasing the 
value of parameter a, the map exhibits bifurcations from 
fixed point to limit cycle, and ultimately to the chaotic 
regime. 

Complex CML dynamics has been used to design 
dynamic associative memory systems. In CML, each 
memory is represented as a spatially coherent oscil- 
lation and is learnt by a correlational learning rule 
operating in limit cycle or chaotic regimes. In such sys- 
tems, both the memory capacity and the basin volume 
for each memory are larger in CML than in the Hopfield 
model employing the same learning rule [33.105]. CML 
chaotic memories reduce the problem of spurious mem- 
ories, but they are not immune to it. Spurious memories 
prevent the system from exploiting its memory capacity 
to the fullest extent. 


621 


eee | d Hed 


622 


cee | d Hed 


Part D 


Neural Networks 


fœ 
1 


0.5 


-1 -0.5 0 0.5 1 
X 


Fig. 33.20a,b Transfer function for CML: (a) shape of the cubic transfer function f(x, a) = ax? — ax + x; (b) bifurcation 


diagram over parameter a 


Stochastic Resonance 
Field models of brain networks develop determinis- 
tic PDEs (Fokker—Planck equation) for macroscopic 
properties based on a statistical description of the 
underlying stochastic dynamics of microscopic neu- 
rons. In another words, they are deterministic sys- 
tems at the macroscopic level. Stochastic resonance 
(SR) deals with conditions when a bistable or multi- 
stable system exhibits strong oscillations under weak 
periodic perturbations in the presence of random 
noise [33.106]. In a typical SR situation, the weak 
periodic carrier wave is insufficient to cross the po- 
tential barrier between the equilibria of a multistable 
system. Additive noise enables the system to sur- 
mount the barrier and exhibit oscillations as it transits 
between the equilibria. SR is an example of pro- 
cesses when properly tuned random noise improves 
the performance of a nonlinear system and it is 
highly relevant to neural signal processing [33.107, 
108]. 

A prominent example of SR in a neural net- 
work with excitatory and inhibitory units is described 
in [33.109]. In the model developed, the activation 
rate of excitatory and inhibitory neurons is described 
by Me and ui, respectively. The ratio œ = [le/ fi is an 
important parameter of the system. The investigated 
neural populations exhibit a range of dynamic behav- 
iors, including convergence to fixed point, damped 
oscillations, and persistent oscillations. Figure 33.21 


summarizes the main findings in the form of a phase 
diagram in the space of parameters œ and noise level. 
The diagram contains three regions. Region I is at low 
noise levels and it corresponds to oscillations decay- 
ing to a fixed point at an exponential rate. Region II 
corresponds to high noise, when the neural activity ex- 
hibits damped oscillations as it approaches the steady 
state. Region III, however, demonstrates sustained os- 
cillations for an intermediate level of noise. If a is 
above a critical value (see the tip of Region MI), 


a 
Dynamical SR 
~. 
Damped oscillations 
Band-pass filter 
Threshold SR Critical 
fluctuations 


Network 
oscillations 


H 


M Berger effect 


> 
Noise 


Fig. 33.21 Stochastic resonance in excitatory—inhibitory 
neural networks; œ describes the relative strength of inhi- 
bition. Region I: fixed point dynamics. Region II: damped 
oscillatory regime. Region III: sustained periodic oscilla- 
tions illustrating stochastic resonance (after [33.109]) 
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Fig. 33.22a,b Lorenz attractor in the chaotic regime; (a) time series of the variables X, Y, and Z; (b) butterfly-winged 
chaotic Lorenz attractor in the phase space spanned by variables X, Y, and Z 


the activities in the steady state undergo a first-order 
phase transition at a critical noise level. The inten- 
sive oscillations in Region III at an intermediate noise 
level show that the output of the system (oscilla- 
tions) can be enhanced by an optimally selected noise 
level. 

The observed phase transitions may be triggered by 
neuronal avalanches, when the neural system is close 
to a critical state and the activation of a small number 
of neurons can generate an avalanche process of activa- 
tion [33.110]. Neural avalanches have been described 
using self-organized criticality (SOC), which has been 
identified in neural systems [33.111]. There is much 
empirical evidence of the cortex conforming to the self- 
stabilized, scale-free dynamics with avalanches during 
the existence of some quasi-stable states [33.112, 113]. 
These avalanches maintain a metastable background 
state of activity. 

Phase transitions have been studied in models with 
extended layers of excitatory and inhibitory neuron 
populations, respectively. A specific model uses ran- 
dom cellular neural networks to describe conditions 
with sustained oscillations [33.114]. The role of var- 
ious control parameters has been studied, including 
noise level, inhibition, and rewiring. Rewiring describes 
long axonal connections to produce neural network ar- 
chitectures resembling connectivity patterns with short 
and long-range axons in the neuropil. By properly tun- 
ing the parameters, the system can reside in a fixed 
point regime in isolation, but it will switch to per- 
sistent oscillations under the influence of learnt input 
patterns [33.115]. 


33.2.3 Chaotic Neural Networks 


Emergence of Chaos in Neural Systems 
Neural networks as dynamical systems are described 
by the state vector X(t) which obeys the equation of 
motion (33.13). Dynamical systems can exhibit fixed 
point, periodic, and chaotic behaviors. Fixed points and 
periodic oscillations, and transitions from one to the 
other through bifurcation dynamics has been described 
in Sect. 33.2.2. The trajectory of a chaotic system does 
not converge to a fixed point or limit cycle, rather it 
converges to a chaotic attractor. Chaotic attractors, or 
strange attractors, have the property that they define 
a fractal set in the state space, moreover, chaotic trajec- 
tories close to each other at some point, diverge from 
each other exponentially fast as time evolves [33.116, 
117]. 

An example of the chaotic Lorenz attractor is shown 
in Fig. 33.22. The Lorenz attractor is defined by a sys- 
tem of three ordinary differential equations (ODEs) 
with nonlinear coupling, originally derived for the de- 
scription of the motion of viscous flows [33.118]. The 
time series belonging to variables X, Y, Z are shown in 
Fig. 33.22a for parameters in the chaotic region, while 
the strange attractor is illustrated by the trajectory in the 
phase space, see Fig. 33.22b. 


Chaotic Neuron Model 
In chaotic neural networks the individual components 
exhibit chaotic behavior, and the goal is to study the or- 
der emerging from their interaction. Nerve membranes 
produce propagating action potentials in a highly non- 
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linear process which can generate oscillations and bi- 
furcations to chaos. Chaos has been observed in the gi- 
ant axons of squid and it has been used to study chaotic 
behavior in neurons. The Hodgkin—Huxley equations 
can model nonlinear dynamics in the squid giant axon 
with high accuracy [33.58]. The chaotic neuron model 
of Aihara et al. is an approximation of the Hodgkin- 
Huxley equation and it reproduces chaotic oscillations 
observed in the squid giant axon [33.119, 120]. The 
model uses the following simple iterative map 


x(t+ 1) = kx(t) -—afQ()) +a, 


where x(t) is the state of the chaotic neuron at time t, 
k is a decay parameter, œ characterizes refractoriness, 
a is a combined bias term, and f(y(t)) is a nonlin- 
ear transfer function. In the chaotic neuron model, 
the log sigmoid transfer function is used, see (33.17). 
Equation (33.23) combined with the sigmoid produces 
a piece-wise monotonous map, which generates chaos. 
Chaotic neural networks composed of chaotic neu- 
rons generate spatio-temporal chaos and are able to 
retrieve previously learnt patterns as the chaotic trajec- 
tory traverses the state space. Chaotic neural networks 
are used in various information processing systems 
with abilities of parallel distributed processing [33.12 1— 
123]. Note that CMLs also consist of chaotic oscillators 
produced by a nonlinear local iterative map, like in 
chaotic neural networks. CMLs define a spatial rela- 
tionship among their nodes to describe spatio-tempo- 
ral fluctuations. A class of cellular neural networks 
combines the explicit spatial relationships similar to 
CMLs with detailed temporal dynamics using Cohen- 
Grossberg model [33.83] and it has been used success- 
fully in neural network applications [33.124, 125]. 


(33.23) 


Collective Chaos in Neural Networks 
Chaos in neural networks can be an emergent macro- 
scopic property stemming from the interaction of non- 
linear neurons, which are not necessarily chaotic in iso- 
lation. Starting from the microscopic neural level up to 
the macroscopic level of cognition and consciousness, 
chaos plays an important role in neurodynamics [33.82, 
126-129]. There are various routes to chaos in neu- 
ral systems, including period-doubling bifurcations to 
chaos, chaotic intermittency, and collapse of a two-di- 
mensional torus to chaos [33.130, 131]. 

Chaotic itinerancy is a special form of chaos, 
which is between ordered dynamics and fully devel- 
oped chaos. Chaotic itinerancy describes the trajectory 
through high-dimensional state space of neural activ- 
ity [33.132]. In chaotic itinerancy the chaotic system 


is destabilized to some degree but some traces of the 
trajectories remain. This describes an itinerant behavior 
between the states of the system containing destabilized 
attractors or attractor ruins, which can be fixed point, 
limit cycle, torus, or strange attractor with unstable di- 
rections. Dynamical orbits are attracted to a certain 
attractor ruin, but they leave via an unstable mani- 
fold after a (short or long) stay around it and move 
toward another attractor ruin. This successive chaotic 
transition continues unless a strong input is received. 
A schematic diagram is shown in Fig. 33.23, where the 
trajectory of a chaotic itinerant system is shown visit- 
ing attractor ruins. Chaotic itinerancy is associated with 
perceptions and memories, the chaos between the at- 
tractor ruins is related to searches, and the itinerancy 
is associated with sequences in thinking, speaking, and 
writing. 

Frustrated chaos is a dynamical system in a neu- 
ral network with a global attractor structure when local 
connectivity patterns responsible for stable oscilla- 
tory behaviors become intertwined, leading to mutually 
competing attractors and unpredictable itinerancy be- 
tween brief appearances of these attractors [33.133]. 
Similarly to chaotic itinerancy, frustrated chaos is re- 
lated to destabilization of the dynamics and it generates 
itinerant, wavering oscillations between the orbits of 
the network, the trajectories of which have been stable 
with the original connectivity pattern. Frustrated chaos 
is shown to belong to the family of intermittency type 
of chaos [33.134, 135]. 

To characterize chaotic dynamics, tools of statistical 
time series analysis are useful. The studies may involve 
time and frequency domains. Time domain analysis 
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Fig. 33.23 Schematic illustration of itinerant chaos with 
a trajectory visiting attractor ruins (after [33.132]) 
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includes attractor reconstruction, i.e., the attractor is 
depicted in the state space. Chaotic attractors have frac- 
tal dimensions, which can be evaluated using one of the 
available methods [33.136—-138]. In the case of low-di- 
mensional chaotic systems, the reconstruction can be 
illustrated using two or three-dimensional plots. An ex- 
ample of attractor reconstruction is given in Fig. 33.22 
for the Lorenz system with three variables. Attractor re- 
construction of a time series can be conducted using 
time-delay coordinates [33.139]. 

Lyapunov spectrum analysis is a key tool in iden- 
tifying and describing chaotic systems. Lyapunov ex- 
ponents measure the instability of orbits in different 
directions in the state space. It describes the rate of ex- 
ponential divergence of trajectories that were once close 
to each other. The set of corresponding Lyapunov ex- 
ponents constitutes the Lyapunov spectrum. The maxi- 
mum Lyapunov exponent A® is of crucial importance; 
as a positive leading Lyapunov exponent A* > 0 is the 
hallmark of chaos. X(t) describes the trajectory of the 
system in the phase space starting from X(0) at time ¢ = 
0. Denote by XAx (t) the perturbed trajectory starting 
from [X(0) + Axo]. The leading Lyapunov exponent can 
be determined using the following relationship [33.140] 


A* = lim C In[|Xax() —X(O|/|Aroll . 
Axo—>0 


(33.24) 


where A* <0 corresponds to convergent behavior, 
A* =0 indicates periodic orbits, and A* > 0 signi- 
fies chaos. For example, the Lorenz attractor has A* = 
0.906, indicating strong chaos (Fig. 33.24). Equa- 
tion (33.24) measures the divergence for infinitesimal 
perturbations in the limit of infinite time series. In prac- 
tical situations, especially for short time series, it is 
often difficult to distinguish weak chaos from random 
perturbations. One must be careful with conclusions 
about the presence of chaos when A* has a value 
close to zero. Lyapunov exponents are widely used in 
brain monitoring using electroencephalogram (EEG) 
analysis, and various methods are available for charac- 
terization of normal and pathological brain conditions 
based on Lyapunov spectra [33.141, 142]. 

Fourier analysis conducts data processing in the 
frequency domain, see (33.5) and (33.6). For chaotic 
signals, the shape of the power spectra is of special 
interest. Power spectra often show 1/f% power law be- 
havior in log—log coordinates, which is the indication of 
scale-free system and possibly chaos. Power-law scal- 
ing in systems at SOC is suggested by a linear decrease 


in log power with increasing log frequency [33.143]. 
Scaling properties of criticality facilitate the coexis- 
tence of spatially coherent cortical activity patterns for 
a duration ranging from a few milliseconds to a few 
seconds. Scale-free behavior characterizes chaotic brain 
activity both in time and frequency domains. For com- 
pleteness, we mention the Hilbert space analysis as an 
alternative to Fourier methods. The analytic signal ap- 
proach based on Hilbert analysis is widely used in brain 
monitoring. 


Emergent Macroscopic Chaos 

in Neural Networks 
Freeman’s K model describes spatio-temporal brain 
chaos using a hierarchical approach. Low-level K sets 
were introduced in the 1970s, named in the honor of 
Aharon Kachalsky, an early pioneer of neural dynam- 
ics [33.82,94]. K sets are multiscale models, describ- 
ing an increasing complexity of structure and dynam- 
ics. K sets are mesoscopic models and represent an 
intermediate level between microscopic neurons and 
macroscopic brain structures. K-sets are topological 
specifications of the hierarchy of connectivity in neu- 
ral populations in brains. K sets describe the spatial 
patterns of phase and amplitude of the oscillations gen- 
erated by neural populations. They model observable 
fields of neural activity comprising electroencephalo- 
grams (EEGs), local field potentials (LFPs), and mag- 
netoencephalograms (MEGs) [33.144]. K sets form 
a hierarchy for cell assemblies with components start- 
ing from KO to KIV [33.145, 146]. 

KO sets represent noninteractive collections of neu- 
rons forming cortical microcolumns; a KO set models 
a neuron population of ~ 10°—10* neurons. KO models 
dendritic integration in average neurons and an asym- 
metric sigmoid static nonlinearity for axon transmis- 
sion. The KO set is governed by a point attractor with 
zero output and stays at equilibrium except when per- 
turbed. In the original K-set models, KOs are described 
by a state-dependent, linear second-order ordinary dif- 
ferential equation (ODE) [33.94] 


ab &°X(t)/dt? + (a+ b) dX(t)/dt+ P(t) = U(t). 
(33.25) 


Here a and b are biologically determined time con- 
stants. X(t) denotes the activation of the node as a func- 
tion of time. U(t) includes an asymmetric sigmoid 
function Q(x), see (33.18), acting on the weighted sum 
of activation from neighboring nodes and any external 
input. 
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Fig. 33.24a-c KII diagram and behaviors; (a) 3 double layer hierarchy of KII and time series over each layer, exhibit- 
ing intermittent chaotic oscillations, (b) phase space reconstruction using delayed time coordinates 


KI sets are made of interacting KO sets, either exci- 
tatory or inhibitory with positive feedback. The dynam- 
ics of KI is described as convergence to a nonzero fixed 
point. If KI has sufficient functional connection density, 
then it is able to maintain a nonzero state of back- 
ground activity by mutual excitation (or inhibition). 


KI typically operates far from thermodynamic equi- 
librium. Neural interaction by stable mutual excitation 
(or mutual inhibition) is fundamental to understanding 
brain dynamics. KII sets consists of interacting exci- 
tatory and inhibitory KI sets with negative feedback. 
KII sets are responsible for the emergence of limit cy- 
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cle oscillation due to the negative feedback between the 
neural populations. Transitions from point attractor to 
limit cycle attractor can be achieved through a suit- 
able level of feedback gain or by input stimuli, see 
Fig. 33.18. 

KIII sets made up of multiple interacting KII sets. 
Examples include the sensory cortices. KII sets gen- 
erate broadband, chaotic oscillations as background 
activity by combined negative and positive feedback 
among several KII populations with incommensurate 
frequencies. The increase in nonlinear feedback gain 
that is driven by input results in the destabilization of 
the background activity and leads to the emergence of 
a spatial amplitude modulation (AM) pattern in KIMI. 
KIII sets are responsible for the embodiment of mean- 
ing in AM patterns of neural activity shaped by synaptic 
interactions that have been modified through learning in 
KII layers. The KII model is illustrated in Fig. 33.24 
with three layers of excitatory—inhibitory nodes. In 
Fig. 33.24a the temporal dynamics is illustrated in each 
layer, while Fig. 33.24b shows the phase space recon- 
struction of the attractor. This is a chaotic behavior 
resembling the dynamics of the Lorenz attractor in 
Fig. 33.22. KIV sets are made up of interacting KIM 
units to model intentional neurodynamics of the limbic 
system. KIV exhibits global phase transitions, which 
are the manifestations of hemisphere-wide coopera- 
tion through intermittent large-scale synchronization. 
KIV is the domain of Gestalt formation and preaffer- 
ence through the convergence of external and internal 
sensory signals leading to intentional action [33.144, 
146]. 


Properties of Collective Chaotic Neural 
Networks 
KIII is an associative memory, encoding input data 
in spatio-temporal AM patterns [33.147, 148]. KII 
chaotic memories have several advantages as compared 
to convergent recurrent networks: 


1. They produce robust memories based on relatively 
few learning examples even in noisy environment. 

2. The encoding capacity of a network with a given 
number of nodes is exponentially larger than their 
convergent counterparts. 

3. They can recall the stored data very quickly, just as 
humans and animals can recognize a learnt pattern 
within a fraction of a second. 


The recurrent Hopfield neural network can store 
an estimated 0.15N input patterns in stable attractors, 


where N is the number of neurons [33.84]. Exact anal- 
ysis by Mceliece et al. [33.149] shows that the memory 
capacity of the Hopfield network is N/(4logN). Various 
generalizations provide improvements over the initial 
memory gain [33.150, 151]. It is of interest to eval- 
uate the memory capacity of the KII memory. The 
memory capacity of chaotic networks which encode 
input into chaotic attractors is, in principle, exponen- 
tially increased with the number of nodes. However, 
the efficient recall of the stored memories is a serious 
challenge. The memory capacity of KII as a chaotic as- 
sociative memory device has been evaluated with noisy 
input patterns. The results are shown in Fig. 33.25, 
where the performance of Hopfield and KII memo- 
ries are compared; the top two plots are for Hopfield 
nets, while the lower two figures describe KIII re- 
sults [33.152]. The light color shows recognition rate 
close to 100%, while the dark color means poor recog- 
nition approaching 0. The right-hand column has higher 
noise levels. The Hopfield network shows the well- 
known linear gain curve ~ 0.15. The KIII model, on the 
other hand, has a drastically better performance. The 
boundary separating the correct and incorrect classifi- 
cation domains is superlinear; it has been fitted with as 
a fifth-order polynomial. 


Cognitive Implications 
of Intermittent Brain Chaos 
Developments in brain monitoring techniques provide 
increasingly detailed insights into spatio-temporal neu- 
rodynamics and neural correlates of large-scale cog- 
nitive processing [33.74, 153-155]. Brains as large- 
scale dynamical systems have a basal state, which is 
a high-dimensional chaotic attractor with a dynamic 
trajectory wandering broadly over the attractor land- 
scape [33.82, 126]. Under the influence of external 
stimuli, cortical dynamics is destabilized and condenses 
intermittently to a lower-dimensional, more organized 
subspace. This is the act of perception when the subject 
identifies the stimulus with a meaning in the context 
of its previous experience. The system stays intermit- 
tently in the condensed, more coherent state, which 
gives rise to a spatio-temporal AM activity pattern cor- 
responding to the stimulus in the given context. The 
AM pattern is meta-stable and it disintegrates as the 
system returns to the high-dimensional chaotic basal 
state (less synchrony) Brain dynamics is described 
as a sequence of phase transitions with intermittent 
synchronization-desynchronization effects. The rapid 
emergence of synchronization can be initiated by (Heb- 
bian) neural assemblies that lock into synchronization 
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Fig. 33.25a,b Comparison of the memory capacity of (a) Hopfield and (b) KHI neural networks; the noise level is 40% 
(left); 50% (right); the lighter the color the higher the recall accuracy. Observe the linear gain for Hopfield networks and 
the superlinear (fifth-order) separation for KIN (after [33.152]) 


across widespread cortical and subcortical areas [33.82, 
156, 157]. 

Intermittent oscillations in spatio-temporal neural 
dynamics are modeled by a neuropercolation approach. 
Neuropercolation is a family of probabilistic models 
based on the theory of probabilistic cellular automata 
on lattices and random graphs and it is motivated by 
structural and dynamical properties of neural popu- 
lations. Neuropercolation constructs the hierarchy of 
interactive populations in networks as developed in 
Freeman K models [33.94, 144], but replace differen- 
tial equations with probability distributions from the 
observed random networks that evolve in time [33.158]. 
Neuropercolation considers populations of cortical neu- 
rons which sustain their background state by mutual 
excitation, and their stability is guaranteed by the neural 
refractory periods. Neural populations transmit and re- 


ceive signals from other populations by virtue of small- 
world effects [33.77, 159]. Tools of statistical physics 
and finite-size scaling theory are applied to describe 
critical behavior of the neuropil. Neuropercolation the- 
ory provides a mathematical approach to describe phase 
transitions and critical phenomena in large-scale, in- 
teractive cortical networks. The existence of phase 
transitions is proven in specific probabilistic cellular au- 
tomata models [33.160, 161]. 

Simulations by neuropercolation models demon- 
strate the onset of large-scale synchronization-desyn- 
chronization behavior [33.162]. Figure 33.26 illustrates 
results of intermittent phase desynchronization for neu- 
ropercolation with excitatory and inhibitory popula- 
tions. Three main regimes can be distinguished, sepa- 
rated by critical noise values ¢; > £ọ. In Regime I ¢ > 
£1, Fig. 33.26a, the channels are not synchronous and 
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the phase values are distributed broadly. In Regime II 
£1 > £ > £p, Fig. 33.26b, the phase lags are drastically 
reduced indicating significant synchrony over extended 
time periods. Regime III is observed for high val- 
ues of £ọ >£, when the channels demonstrate highly 
synchronized, frozen dynamics, see Fig. 33.26c. Sim- 
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Sequential processing of fetch, decode, and execu- 
tion of instructions through the classical von Neu- 
mann digital computers has resulted in less efficient 
machines as their ecosystems have grown to be in- 
creasingly complex [33.164]. Though modern digital 
computers are fast and complex enough to emulate the 
brain functionality of animals like spiders, mice, and 
cats [33.165, 166], the associated energy dissipation in 
the system grows exponentially along the hierarchy 
of animal intelligence. For example, to perform cer- 
tain cortical simulations at the cat scale even at an 
83 times slower firing rate, the IBM team has to em- 
ploy Blue Gene/P (BG/P), a super computer equipped 
with 147456 CPUs and 144 TBs of main memory. On 
the other hand, the human brain contains more than 
100 billion neurons and each neuron has more than 
20000 synapses [33.167]. Efficient circuit implementa- 
tion of synapses, therefore, is very important to build 
a brain-like machine. One active branch of this re- 
search area is cellular neural networks (CNNs) [33.168, 
169], where lots of multiplication circuits are utilized in 
a complementary metal-oxide-semiconductor (CMOS) 
chip. However, since shrinking the current transistor 
size is very difficult, introducing a more efficient ap- 
proach is essential for further development of neural 
network implementations. 

The memristor was first authorized by Chua as 
the fourth basic circuit element in electrical circuits in 
1971 [33.170]. It is based on the nonlinear character- 
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Fig. 33.26a-c Phase synchroni- 
zation—desynchronization with 
excitatory—inhibitory connections 
in neuropercolation with 256 gran- 
250 ule nodes; the z-axis shows the pair- 
200 wise phase between the units. (a) 
150 No synchrony; (b) intermittent syn- 
chrony; (c) highly synchronized, 
frozen phase regime (after [33.162]) 
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ilar transitions can be induced by the relative strength 
of inhibition, as well as by the fraction of rewiring 
across the network [33.114, 115, 163]. The probabilistic 
model of neural populations reproduces important prop- 
erties of the spatio-temporal dynamics of cortices and is 
a promising approach for large-scale cognitive models. 


istics of charge and flux. By supplying a voltage or 
current to the memristor, its resistance can be altered. 
In this way, the memristor remembers information. In 
that seminal work, Chua demonstrated that the memris- 
tance M(q) relates the charge q and the flux g in such 
a way that the resistance of the device will change with 
the applied electric field and time 


m=, 
dq 

The parameter M denotes the memristance of a charge 
controlled memristor, measured in ohms. Thus, the 
memristance M can be controlled by applying a voltage 
or current signal across the memristor. In other words, 
the memristor behaves like an ordinary resistor at any 
given instance of time, where its resistance depends on 
the complete history of the device [33.170]. 

Although the device was proposed nearly four 
decades ago, it was not until 2008 that researchers from 
HP Labs showed that the devices they had fabricated 
were indeed two-terminal memristors [33.171]. Fig- 
ure 33.27 shows the I-V characteristics of a generic 
memristor, where memristance behavior is observed 
for TiOz-based devices. A TiOz—, layer with oxy- 
gen vacancies is placed on a perfect TiO, layer, and 
these layers are sandwiched between platinum elec- 
trodes. In metal oxide materials, the switching from Roff 
to Ron and vice versa occurs as a result of ion migra- 
tion, due to the enormous electric fields applied across 


(33.26) 
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the nanoscale structures. These memristors have been 
fabricated using nanoimprint lithography and were suc- 
cessfully integrated on a CMOS substrate in [33.172]. 
Apart from these metal-oxide memristors, memristance 
has also been demonstrated using magnetic materials 
based on their magnetic domain wall motion and spin- 
torque induced magnetization switching in [33.173]. 
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Fig. 33.27 Typical J-V characteristic of memristor (af- 
ter [33.171]). The pinched hysteresis loop is due to the 
nonlinear relationship between the memristance current 
and voltage. The parameters of the memristor are Ron = 
1002, Ror = 16KQ, Rint = 11kKQ, D= 10nm, w= 
107 cm? s7! yp = 10 and Vin = sin(2xt). The mem- 
ristor exhibits the feature of pinched hysteresis, which 
means that a lag occurs between the application and the 
removal of a field and its subsequent effect, just like the 
feature of neurons in the human brain 
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Fig. 33.28 Window function for different integer p 


Furthermore, several different types of nonlinear mem- 
ristor models have been investigated [33.174, 175]. 
One of them is the window model in which the state 
equation is multiplied by window function F,(œ), 
namely 


daw 


— 33.27 
T ( ) 


Ron. 
= Wy p OF) r 


where p is an integer parameter and F, (œ) is defined by 


F,(w) = 1-(22-1)", (33.28) 


which is shown in Fig. 33.28. 


33.3.1 Memristor-Based Synapses 


The design of simple weighting circuits for synap- 
tic multiplication between arbitrary input signals and 
weights is extremely important in artificial neural sys- 
tems. Some efforts have been made to build neuron- 
like analog neural networks [33.178—180]. However, 
this research has gained limited success so far be- 
cause of the difficulty in implementing the synapses 
efficiently. Based on the memristor, a novel weight- 
ing circuit was proposed by Kim et al. [33.176, 181, 
182] as shown in Fig. 33.29. The memristors pro- 
vide a bridge-like switching for achieving either posi- 
tive or negative weighting. Though several memristors 
are employed to emulate a synapse, the total area of 
the memristors is less than that of a single transis- 
tor. To compensate for the spatial nonuniformity and 
nonideal response of the memristor bridge synapse, 
a modified chip-in-the-loop learning scheme suitable 
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Fig. 33.29 Memristor bridge circuit. The synaptic weight 
is programmable by varying the input voltage. The weight- 
ing of the input signal is also performed in this circuit 
(after [33.176]) 
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Fig. 33.30 Neuromorphic memristive computer equipped 
with STDP (after [33.177]) 


for the proposed neural network architecture is inves- 
tigated [33.176]. In the proposed method, the initial 
learning is conducted by software, and the behavior of 
the software-trained network is learned via the hard- 
ware network by learning each of the single layered 
neurons of the network independently. The forward 
calculation of single layered neuron learning is im- 
plemented through circuit hardware and is followed 
by a weight updating phase assisted by a host com- 


puter. Unlike conventional chip-in-the-loop learning, 
the need for the readout of synaptic weights for cal- 
culating weight updates in each epoch is eliminated by 
virtue of the memristor bridge synapse and the proposed 
learning scheme. 

On the other hand, spike-timing-dependent learn- 
ing (STDP), which is a powerful learning paradigm 
for spiking neural systems because of its massive 
parallelism, potential scalability, and inherent defect, 
fault, and failure-tolerance, can be implemented by 
using a crossbar memristive array combined with neu- 
rons that asynchronously generate spikes of a given 
shape [33.177,185]. Such spikes need to be sent 
back through the neurons to the input terminal as in 
Fig. 33.30. The shape of the spikes turns out to be very 
similar to the neural spikes observed in realistic bio- 
logical neurons. The STDP learning function obtained 
by combining such neurons with memristors is exactly 
obtained from neurophysiological experiments on real 
synapses. Such nanoscale synapses can be combined 
with CMOS neurons which is possible to create neuro- 
morphic hardware several orders of magnitude denser 
than in conventional CMOS. This method offers bet- 
ter control over power dissipation; fewer constraints on 
the design of memristive materials used for nanoscale 
synapses; greater freedom in learning algorithms than 
traditional design of synapses since the synaptic learn- 
ing dynamics can be dynamically turned on or off; 
greater control over the precise form and timing of the 
STDP equations; the ability to implement a variety of 
other learning laws besides STDP; better circuit diver- 
sity since the approach allows different learning laws 
to be implemented in different areas of a single chip 
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Fig. 33.31 Memristor-based cellular 
neural networks cell (after [33.183]) 
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Fig. 33.32 Simple realization of MNN based on fuzzy concepts (after [33.184]) 


using the same memristive material for all synapses. 
Furthermore, an analog CMOS neuromorphic design 
utilizing STDP and memristor synapses is investigated 
for use in building a multipurpose analogy neuromor- 
phic chip [33.186]. In order to obtain a multipurpose 
chip, a suitable architecture is established. Based on the 
technique of IBM 90nm CMOS9RF, neurons are de- 
signed to interface with Verilog-A memristor synapses 
models to perform the XOR operation and edge detec- 
tion function. 

To make the neurons compatible with such new 
synapses, some novel training methods are proposed. 
For instance, Manem et al. proposed a variation-tolerant 
training method to efficiently reconfigure memristive 
synapses in a trainable threshold gate array (TTGA) 
system [33.187]. The training process is inspired from 
the gradient descent machine learning algorithm com- 
monly used to train artificial threshold neural networks 


known as perceptrons. The proposed training method 
is robust to the unpredictability of CMOS and nanocir- 
cuits with decreasing technology sizes, but also pro- 
vides its own randomness in its training. 


33.3.2 Memristor-Based Neural Networks 


Employing memristor-based synapses, some results 
have been obtained about the memristor-based neural 
networks [33.183, 184, 188]. As the template weights in 
memristor-based neural networks (MNNs) are usually 
known and need to be updated between each template 
in a sequence of templates, there should be a way to 
rapidly change the weights. Meanwhile, the MNN cells 
need to be modified, as the programmable couplings 
are implemented by memristors which require program- 
ming circuits to isolate each other. Lehtonen and Laiho 
proposed a new cell of memristor-based cellular neural 
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network that can be used to program the templates. For 
this purpose, a voltage global is input into the cell. This 
voltage is used to convey the weight of one connection 
into the cells [33.183]. The level of virtual ground and 
switches are controlled so that the memristor connected 
to a particular neighbor is biased above the program- 
ming threshold, until it reaches the desired resistance 
value. 

Merrikh-Bayat et al. presented a new way to explain 
the relationships between logical circuits and artificial 
neural networks, logical circuits and fuzzy logic, and 
artificial neural networks and fuzzy inference systems, 
and proposed a new neuro-fuzzy computing system, 
which can effectively be implemented via the mem- 
ristor-crossbar structure [33.184]. A simple realization 
of MNNs is shown in Figs. 33.32-33.34. Figure 33.32 
shows that it is possible to interpret the working pro- 
cedure of conventional artificial neural network ANN 
without changing its structure. In this figure, each row 
of the structure implements a simple fuzzy rule or min- 
term. Figure 33.33 shows how the activation function 
of neurons can be implemented when the activation 
function is modeled by a t-norm operator. Matrix mul- 
tiplication is performed by vector circuit in Fig. 33.34. 
This circuit consists of a simple memristor crossbar 
where each of its rows is connected to the virtually 
grounded terminal of an operational amplifier that plays 
the role of a neuron with identity activation function. 
The advantages of the proposed system are twofold: 
first, its hardware can be directly trained using the 
Hebbian learning rule and without the need to per- 
form any optimization; second, this system has a great 
ability to deal with a huge number of input-output 
training data without facing problems like overtraing- 
ing. 

Howard et al. proposed a spiking neuro-evolution- 
ary system which implements memristors as plas- 
tic connections [33.188]. These memristors provide 
a learning architecture that may be beneficial to the evo- 
lutionary design process that exploits parameter self- 
adaptation and variable topologies, allow the num- 
ber of neurons, connection weights, and interneu- 
ral connectivity pattern to emerge. This approach 
allows the evolution of networks with appropriate 
complexity to emerge whilst exploiting the memris- 
tive properties of the connections to reduce learning 
time. 

To investigate the dynamic behaviors of memris- 
tor-based neural networks, Zeng etal. proposed the 
memristor-based recurrent neural networks (MRNNs) 
[33.189, 190] shown in Fig. 33.35, where x;(.) is the 


Fig. 33.33 Implementation of the activation function of neurons 


(after [33.184]) 
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Fig. 33.34 Memristor crossbar-based circuit (after [33.184]) 


state of the i-th subsystem, f;(.) is the amplifier, My; 
is the connection memristor between the amplifier fi(.) 
and state x;(.), R; and C; are the resistor and capaci- 
tor, I; is the external input, a;, b; are the outputs, i, j = 
1,2,...,n. The parameters in this neural network are 
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Fig. 33.35 Circuit of a memristor-based recurrent network (after [33.189]) 
changed according to the state of the system, so this ble applications in analog, digital information process- 
network is a state-dependent switching system. The ing, and memory and logic applications. However, the 
dynamic behavior of this neural network with time- problem, of how to take advantage of the nonvolatile 
varying delays was investigated based on the Filippov memory of memristors, nanoscale, low-power dissipa- 
theory and the Lyapunov method. tion, and so on to design a method to process and store 
the information, which needs learning and memory, into 
33.3.3 Conclusion the synapses of the memristor-based neural networks at 
the dynamical mapping space by a more rational space- 
z Memristor-based synapses and neural networks have parting method, is still an open issue. Further investiga- 
5 been investigated by many scientists for their possi- tion is needed to shorten such a gap. 
(=j 
w 
S 
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33.4 Neurodynamic Optimization 


Optimization is omnipresent in nature and society, and 
an important tool for problem-solving in science, en- 
gineering, and commerce. Optimization problems arise 
in a wide variety of applications such as the design, 
planning, control, operation, and management of en- 
gineering systems. In many applications (e.g., online 


pattern recognition and in-chip signal processing in mo- 
bile devices), real-time optimization is necessary or 
desirable. For such applications, conventional optimiza- 
tion techniques may not be competent due to stringent 
requirements on computational time. It is computation- 
ally challenging when optimization procedures are to 
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be performed in real time to optimize the performance 
of dynamical systems. 

The brain is a profound dynamic system and its 
neurons are always active from birth to death. When 
a decision is to be made in the brain, many of its neu- 
rons are highly activated to gather information, search 
memory, compare differences, and make inferences 
and decisions. Recurrent neural networks are brain-like 
nonlinear dynamic system models and can be prop- 
erly designed to imitate biological counterparts and 
serve as goal-seeking parallel computational models 
for solving optimization problems in a variety of set- 
tings. Neurodynamic optimization can be physically 
realized in designated hardware such as application- 
specific integrated circuits (ASICs) where optimization 
is carried out in a parallel and distributed manner, where 
the convergence rate of the optimization process is in- 
dependent of the problem dimensionality. Because of 
the inherent nature of parallel and distributed informa- 
tion processing, neurodynamic optimization can handle 
large-scale problems. In addition, neurodynamic opti- 
mization may be used for optimizing dynamic systems 
in multiple time scales with parameter-controlled con- 
vergence rates. These salient features are particularly 
desirable for dynamic optimization in decentralized 
decision-making scenarios [33.191—194]. While pop- 
ulation-based evolutionary approaches to optimization 
have emerged as prevailing heuristic and stochastic 
methods in recent years, neurodynamic optimization 
deserves great attention in its own right due to its close 
ties with optimization and dynamical systems theories, 
as well as its biological plausibility and circuit imple- 
mentability with very large scale integration (VLSI) or 
optical technologies. 


33.4.1 Neurodynamic Models 


The past three decades witnessed the birth and growth 
of neurodynamic optimization. Although a couple of 
circuit-based optimization methods were developed 
earlier [33.195—197], it was perhaps Hopfield and Tank 
who spearheaded neurodynamic optimization research 
in the context of neural computation with their sem- 
inal work in the mid 1980s [33.198—-200]. Since the 
inception, numerous neurodynamic optimization mod- 
els in various forms of recurrent neural networks have 
been developed and analyzed, see [33.201—256], and 
the references therein. For example, Tank and Hop- 
field extended the continuous-time Hopfield network 
for linear programming and showed their experimen- 
tal results with a circuit of operational amplifiers and 


other discrete components on a breadboard [33.200]. 
Kennedy and Chua developed a circuit-based recurrent 
neural network for nonlinear programming [33.201]. It 
is proven that the state of the neurodynamics is glob- 
ally convergent and an equilibrium corresponding to an 
approximate optimal solution of the given optimization 
problems. 

Over the years, neurodynamic optimization re- 
search has made significant progress with models with 
improved features for solving various optimization 
problems. Substantial improvements of neurodynamic 
optimization theory and models have been made in the 
following dimensions: 


i) Solution quality: designed based on smooth penalty 
methods with a finite penalty parameter; the earliest 
neurodynamic optimization models can converge to 
approximate solutions only [33.200, 201]. Later on, 
better models designed based on other design prin- 
ciples can guarantee to state or output convergence 
to exact optimal solutions of solvable convex and 
pseudoconvex optimization problems with or with- 
out any conditions [33.204, 205, 208, 210], etc. 

ii) Solvability scope: the solvability scope of neuro- 
dynamic optimization has been expanded from lin- 
ear programming problems [33.200, 202, 208, 211, 
212, 214-219, 223, 242, 244, 251], to quadratic pro- 
gramming problems [33.202—206, 210, 214, 217, 
218, 220, 225, 226, 229, 233, 240-243, 247], to 
smooth convex programming problems with various 
constraints [33.201, 204, 205, 210, 214, 222, 224, 
228, 230, 232, 234, 237, 245, 246, 257], to nons- 
mooth convex optimization problems [33.235, 248, 
250-256], and recently to nonsmooth optimization 
with some nonconvex objective functions or con- 
straints [33.239, 249, 254-256]. 

iii) Convergence property: the convergence property of 
neurodynamic optimization models has been ex- 
tended from near-optimum convergence [33.200, 
201], to conditional exact-optimum global conver- 
gence [33.205, 208, 210], to guaranteed global con- 
vergence [33.204, 205, 214-216, 218, 219, 222, 
226-228, 230, 232, 234, 240, 243, 245, 247, 250, 
253, 256, 257], to faster global exponential con- 
vergence [33.206, 224, 225, 228, 233, 237, 239, 241, 
246, 254], to even more desirable finite-time con- 
vergence [33.235, 248, 249, 251,252,255], with in- 
creasing convergence rate. 

iv) Model complexity: the neurodynamic optimization 
models for constrained optimization are essentially 
multilayer due to the introduction of instrumen- 
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tal variables for constraint handling (e.g., Lagrange 
multipliers or dual variables). The architectures of 
later neurodynamic optimization models for solv- 
ing linearly constrained optimization problems have 
been reduced from multilayer structures to single- 
layer ones with decreasing model complexity to 
facilitate their implementation [33.243, 244, 251, 
252, 254, 255]. 

Activation functions are a signature component 
of neural network models for quantifying the 
firing state activities of neurons. The activation 
functions in existing neurodynamic optimization 
models include smooth ones (e.g., sigmoid), as 
shown in Fig. 33.36a,b [33.200, 208-210], nons- 
mooth ones (e.g., piecewise-linear) as shown in 
Fig. 33.36c,d [33.203, 206], and even discontinuous 
ones as shown in Fig. 33.36e,f [33.243, 244, 251, 
252, 254, 255]. 


33.4.2 Design Methods 


The crux of neurodynamic optimization model design 
lies in the derivation of a convergent neurodynamic 
equation that prescribes the states of the neurodynam- 
ics. A properly derived neurodynamic equation can 
ensure that the states of neurodynamics reaches an equi- 


a) b) 
c) d) 
e) f) 


Fig. 33.36a-f Three classes of activation functions in 
neurodynamic optimization models: smooth in (a) and (b), 
nonsmooth in (c) and (d), and discontinuous in (e) and (f) 


librium that satisfies the constraints and optimizes the 
objective function. Although the existing neurodynamic 
optimization models are highly diversified with many 
different features, the design methods or principles for 
determining their neurodynamic equations can be cate- 
gorized as follows: 


i) Penalty methods 

ii) Lagrange methods 
iii) Duality methods 

iv) Optimality methods. 


Penalty Methods 
Consider the general constrained optimization problem 


minimize f(x) 
subject to g(x) <0, 
h(x) = 0, 


where x € Re” is the vector of decision variables, f(x) 
is an objective function, g(x) = [g1(x),...,%m(x)]' is 
a vector-valued function, and h(x) = [hi (x),..., Mp w] 
a vector-valued function. 

A penalty method starts with the formulation of 
a smooth or nonsmooth energy function based on 
a given objective function f(x) and constraints g(x) 
and h(x). It plays an important role in neurodynamic 
optimization. Ideally, the minimum of a formulated en- 
ergy function corresponds to the optimal solution of 
the original optimization problem. For constrained op- 
timization, the minimum of the energy function has to 
satisfy a set of constraints. Most early approaches for- 
mulate an energy function by incorporating objective 
function and constraints through functional transfor- 
mation and numerical weighting [33.198—201]. Func- 
tional transformation is usually used to convert con- 
straints to a penalty function to penalize the violation 
of constraints; e.g., a smooth penalty function is as 
follows 


m p 


Pa) =5 Erwt. 


i=1 j=l 


where [y] t = max{0, y}. Numerical weighting is often 
used to balance constraint satisfaction and objective op- 
timization, e.g., 


E(x) = f(x) + wpe) . 


where w is a positive weight. 
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In smooth penalty methods, neurodynamic equa- 
tions are usually derived as the negative gradient flow 
of the energy function in the form of a differential equa- 
tion 

dx(t) 

rp x —VE(x(t)). 

If the energy function is bounded below, the stability 
of the neurodynamics can be ensured. Nevertheless, 
the major limitation is that the neurodynamics de- 
signed using a smooth penalty method with any fixed 
finite penalty parameter can converge to an approxi- 
mate optimal solution only, as a compromise between 
constraint satisfaction and objective optimization. One 
way to remedy the approximated limitation of smooth 
penalty design methods is to introduce a variable 
penalty parameter. For example, a time-varying de- 
laying penalty parameter (called temperature) is used 
in deterministic annealing networks to achieve ex- 
act optimality with a slow cooling schedule [33.208, 
210]. 

If the objective function or penalty function is nons- 
mooth, the gradient has to be replaced by a generalized 
gradient and the neurodynamics can be modeled us- 
ing a differential inclusion [33.235, 248, 249, 251, 252, 
255]. Two advantages of nonsmooth penalty methods 
over smooth ones are possible constraint satisfaction 
and objective optimization with some finite penalty pa- 
rameters and finite-time convergence of the resulting 
neurodynamics. Needless to say, nonsmooth neurody- 
namics are much more difficult to analyze to guarantee 
their stability. 


Lagrange Methods 
A Lagrange method for designing a neurodynamic 
optimization model begins with the formulation of 
a Lagrange function (Lagrangian) instead of an energy 
function [33.204, 205]. A typical Lagrangian is defined 
as 


m Pp 


Læ WD =f) + Ag+ ho), 


i=1 j=l 


where A = (A1,...,Am)' and À = (y,..., Ly)" are La- 
grange multipliers, for inequality constraints g(x) and 
equality constraints h(x), respectively. 

According to the saddle-point theorem, the opti- 
mal solution can be determined by minimizing the 
Lagrangian with respect to x and maximizing it with 
respect to A and u. Therefore, neurodynamic equations 


can be derived in an augmented space 


SO = VLOA MO), 
AO L vLE, AOO) 
eHO L V LEOA KO), 


where € is a positive time constant. The equilibrium 
of the Lagrangian neurodynamics satisfy the Lagrange 
necessary optimality conditions. 


Duality Methods 
For convex optimization, the objective functions of pri- 
mal and dual problems reach the same value at their 
optima. In view of this duality property, the dual- 
ity methods for designing neurodynamic optimization 
models begin with the formulation of an energy func- 
tion consisting of a duality gap between the primal and 
dual problems and a constraint-based penalty function, 


e.g., 
1 
E(x, y) = ZF) —faly))? + p(x) + pal) , 


where y is a vector of dual variables, fa(y) is the dual 
objective function to be maximized, p(x) and pa(y) 
are, respectively, smooth penalty functions to pe- 
nalize the violations of constraints of primal (orig- 
inal) and dual problems. The corresponding neuro- 
dynamic equation can be derived with guaranteed 
global stability as the negative gradient flow of the 
energy function similarly as in the aforementioned 
smooth penalty methods [33.216, 218, 222, 226, 258, 
259]. Neurodynamic optimization models designed by 
using duality design methods can guarantee global 
convergence to the exact optimal solutions of con- 
vex optimization problems without any parametric 
condition. 

In addition, using duality methods, dual networks 
and their simplified/improved versions can be designed 
for quadratic programming with reduced model com- 
plexity by mapping their global convergent optimal dual 
state variables to optimal primal solutions via linear 
or piecewise-linear output functions [33.240, 247, 260- 
263]. 


Optimality Methods 
The neurodynamic equations of some recent models 
are derived based on optimality conditions (e.g., the 
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Karush—Kuhn-Tucker condition) and projection meth- 
ods. Basically, the methods are to map the equilibrium 
of the designed neurodynamic optimization models to 
the equivalent equalities given by optimality conditions 
and projection equations (i. e., all equilibria essentially 
satisfy the optimality conditions) [33.225, 227, 228]. 
For several types of common geometric constraints 
(such as nonnegative constraints, bound constraints, 
and spherical constraints), some projection operators 
map the neuron state variables onto the convex fea- 
sible regions by using their activation functions and 
avoid the use of excessive dual variables as in the 
dual networks, and thus lower the model complexity. 
For neurodynamic optimization models designed using 
optimality methods, stability analysis is needed ex- 
plicitly to ensure that the resulting neurodynamics are 
stable. 

Once a neurodynamic equation has been derived 
and its stability is proven, the next step is to deter- 
mine the architecture of the neural network in terms of 
the neurons and connections based on the derived neu- 
rodynamic equation. The last step is usually devoted 
to simulation or emulation to test the performance of 
the neural network numerically or physically. The sim- 
ulation/emulation results may reveal additional prop- 
erties or characteristics for further analysis or model 
redesign. 


33.4.3 Selected Applications 


Over the last few decades, neurodynamic optimization 
has been widely applied in many fields of science, engi- 
neering, and commerce, as highlighted in the following 
selected nine areas. 


Scientific Computing 
Neurodynamic optimization models ave been devel- 
oped for solving linear equations and inequalities and 
computing inverse or pseudoinverse matrices [33.240, 
264-268]. 


Network Routing 
Neurodynamic optimization models have been devel- 
oped or applied for shortest-path routing in networks 
modeled by using weighted directed graphs [33.258, 
269-271]. 


Machine Learning 
Neurodynamic optimization has been applied for sup- 
port vector machine learning to take the advantages of 
its parallel computational power [33.272-274]. 


Data Processing 
The data processing applications of neurodynamic 
optimization include, but are not limited to, sort- 
ing [33.275—277], winners-take-all selection [33.240, 
277,278], data fusion [33.279], and data reconcilia- 
tion [33.254]. 


Signal/Image Processing 
The applications of neurodynamic optimization for sig- 
nal and image processing include, but are not limited 
to, recursive least-squares adaptive filtering, overcom- 
plete signal representations, time delay estimation, and 
image restoration and reconstruction [33.191, 203, 204, 
280-283]. 


Communication Systems 
The telecommunication applications of neurodynamic 
optimization include beamforming [33.284, 285]) and 
simulations of DS-CDMA mobile communication sys- 
tems [33.229]. 


Control Systems 
Intelligent control applications of neurodynamic op- 
timization include pole assignment for synthesizing 
linear control systems [33.286—289] and model predic- 
tive control for linear/nonlinear systems [33.290—292]. 


Robotic Systems 
The applications of neurodynamic optimization in in- 
telligent robotic systems include real-time motion plan- 
ning and control of kinematically redundant robot ma- 
nipulators with torque minimization or obstacle avoid- 
ance [33.259-263, 267, 293-298] and grasping force 
optimization for multifingered robotic hands [33.299]. 


Financial Engineering 
Recently, neurodynamic optimization was also applied 
for real-time portfolio selection based on an equivalent 
probability measure to optimize the asset distribution in 
financial investments; [33.255, 300]. 


33.4.4 Concluding Remarks 


Neurodynamic optimization provides a parallel dis- 
tributed computational model for solving many opti- 
mization problems. For convex and convex-like opti- 
mization, neurodynamic optimization models are avail- 
able with guaranteed optimality, expended applicability, 
improved convergence properties, and reduced model 
complexity. Neurodynamic optimization approaches 
have been demonstrated to be effective and efficient for 
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many applications, especially those with real-time solu- 
tion requirements. 

The existing results can still be further improved 
to expand their solvability scope, increase their con- 
vergence rate, or reduce their model complexity. With 
the view that neurodynamic approaches to global op- 
timization and discrete optimization are much more 
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interesting and challenging, it is necessary to develop 
neurodynamic models for nonconvex optimization and 
combinatorial optimization. In addition, neurodynamic 
optimization approaches could be more widely ap- 
plied for many other application areas in conjunction 
with conventional and evolutionary optimization ap- 
proaches. 
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34.1 Anatomy and Physiology of the Nervous System 


The animal brain is undoubtedly a unique organ that 
has evolved from humble beginnings starting with small 


groups of specialized cells in organisms long ago. The 
vast complexity found within the cortices of the mam- 
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malian brain is the result of selection across countless 
generations. The purpose of these early cells is in prin- 
ciple consistent with the function of our entire brain. 
Both serve to assess environmental variables in or- 
der to produce output that is situationally relevant. 
This production of appropriate behavioral responses 
is essential for an organism to successfully obtain re- 
sources in complex and often hostile surroundings. The 
morphology of the brain is species dependent. Its struc- 
ture is functionally correlated to the necessary output 
a specific animal requires to survive in a particular en- 
vironment. For example, the commonly used laboratory 
mouse has a brain structure that is coarsely similar to 
a human. However, the mouse possesses particularly 
enlarged olfactory bulbs situated in the front of the 
skull. This is in great contrast to the human olfactory 
centers that are considerably smaller, but perform the 
same function. This dramatic difference in the size of 
the olfactory bulb relates to differences in environmen- 
tal variables that exist between lab mice and humans. 
In contrast to humans, mice live in environments where 
scent is a highway of information. Odorant molecules 
can provide crucial signals regarding changes to the en- 
vironment that indicate such things as the approach of 
a predator or the presence of food. In contrast, humans 
have evolved a diverse set of methods for gathering 
food and avoiding predators that are highly depen- 
dent upon visual stimulation. As a consequence of this 
developmental variable, our brains have evolved to ef- 
ficiently process visual stimuli with incredible speed 
and acuity [34.1]. This paradigm of form fits function 
exists throughout nature and allows experimentalists 
to take advantage of shared anatomical and physio- 
logical characteristics. By adjusting for differences in 
evolutionary history, we can confidently perform ex- 
periments on neurons from other animals. This data 
can then be extended and translated to gather informa- 
tion about the properties of our own nervous system. 
From this point on, unless otherwise specified, any ref- 
erence to the nervous system refers to that of higher 
mammal species including rodents and primates, and 
humans. 


34.1.1 Introduction to the Anatomy 
of the Nervous System 


Cells of the nervous system arise from ectodermal 
embryonic tissue and generally develop into two dis- 
tinct groups: the cells of the peripheral nervous system 
(PNS) and those of the central nervous system (CNS). 


The brain and the spinal cord together constitute the 
CNS, with the nerves of the body (peripheral nerves) 
and autonomic ganglion forming the PNS. The PNS 
refers to neurons and sensory organs located outside 
the blood brain barrier (BBB) created by the meninges, 
which is a three-layer, dynamic, protective system 
isolating the CNS from the circulatory system. Addi- 
tionally, the primary immune system does not extend 
into the brain, leaving it particularly vulnerable. Ex- 
ploration of the PNS has formed the foundation of 
neuroscience research because these neurons tend to 
be large and easy to locate and remove for experi- 
mental examination. While many of the foundational 
principles of neuroscience were discovered within the 
PNS, the CNS has been the primary target of most 
recent neuroscience research. This is primarily due 
to the emergence of the frontal cortices as the sub- 
strate for conscious thought and action. The CNS 
contains many sub-regions that can be broadly sep- 
arated into the spinal cord, the brainstem, cerebellar 
cortex, and cerebral cortex. Increasing complexity can 
be observed moving from spinal cord to frontal cortex, 
which demonstrates that the most forward structures 
are the most recently evolved. The recently evolved 
frontal cortex structures are of high interest to neu- 
roscience researchers and are the subject of many 
computational investigations attempting to elicit their 
function [34.2]. 

The human brain is divided into four major re- 
gions (Fig. 34.1): the cortex, which includes the four 
lobes of the brain, the midbrain, the brainstem, and 
the cerebellum. The brainstem, midbrain, and cerebel- 
lum, also known as subcortical regions were the first 
to evolve and play various roles in the regulation of 
basic physiological function and relay of information 
to the cortex. The brainstem continues caudally as the 
spinal cord and contains numerous nuclei for the pro- 
cessing of information generated by spinal neurons. 
The midbrain is of particular importance concerning 
the integration and transmission of information from 
the spinal cord and brainstem to the cortex. The tha- 
lamus, part of the midbrain, is often called the gateway 
to the cerebral cortex as it is located centrally and re- 
tains projections to all parts of the cortex, and thus 
plays a pivotal role in the transmission of subcortical 
information to various association areas of the cor- 
tex. The cortex is divided into four lobes including 
frontal, parietal, temporal, and occipital. Each lobe con- 
tains areas of specialized function as well as areas 
of association. Incoming sensory information requires 
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context, hence information first travels to the areas of 
specialized function and subsequently traverses vari- 
ous association areas for integration before reaching 
its target of motor output [34.3]. An example of this 
flow of information within the cortex is depicted in 
Fig. 34.1, which details the progression of information 
from the primary auditory cortex for deciphering sound 
to Broca’s area, which is involved in speech production. 
Figure 34.1 also demonstrates the general anatomy of 
the human brain, including the brainstem, cerebellum, 
and cortex. 

Reviewed simply, the flow of information through- 
out this comprehensive system begins in sensory cells 
and ends in motor output via peripheral nerves. Sensory 
cells, such as mechanoreceptors for tactile sensation, 
transduce and forward signals to the spinal cord for first 
level processing. Sensory fibers run specifically along 
the dorsal surface of the spinal cord and ascend the en- 
tirety of the cord. Some synapse locally on the spinal 
cord and others project fully to the cortical surface of 
the brain. The information from sensory cells is first 
received in the proximal cortex, associations between 
stimuli are made, and this information is again relayed 
to the front of the brain where it is used to form a plan 
of action based on the specific input pattern. The front 
of the brain then begins a processing cascade that flows 
back toward the central sulcus of the brain and output 
motor neurons are triggered. These output signals de- 
scend the brain as a large bundle that go on to form 
the ventral surface of the spinal cord, where they ei- 
ther end on a local spinal neuron or are routed distally 
to a muscle. For a more complete discussion of neu- 
ral anatomy and information processing, please refer to 
Kandel et al. [34.1]. 


Frontal lobe 
E Parietal lobe 
E Temporal lobe 
{> Occipital lobe 
D Cerebellum 
E Pons 
E Medulla oblongata 


Primary auditory cortex 
Wernicke's area 
Primary visual cortex 
Angular gyrus 


* Arrow indicates flow of information 
from secondary association cortices 
to Broca's area; the Arcuate fasciculus 


Fig. 34.1 A cartoon representation 

of the human brain with the major 
lobes colored according to the key on 
the left. The blue arrow highlights, as 
an example, a specific pathway that 
heavily contributes to the output of 
language by bringing together au- 
ditory and visual information in the 
parietal lobe. This information that 
contains associations between stimuli 
is then forwarded to the frontal cor- 
tex where it is integrated with the rest 
of the bodies’ sensory information 
and will guide the output of speech 
sounds 


34.1.2 Sensation — Environmental Signal 
Detection 


Transduction of physical environmental stimuli into 
electrical and chemical signals is the common func- 
tion shared by all sensory systems. This transduction 
provides baseline input to the brain from an array of 
sensors placed throughout the body in the skin, eyes, 
ears, mouth, and nose. These highly specialized cells 
respond only to the application of very precise ex- 
ternal stimuli. While each receptor cell is specialized 
for a specific signal, there is much variation within 
each sensory system, allowing the most pertinent in- 
formation to be extracted from the environment. An 
example of this variation exists within the eye, where 
there are two primary detector cell types: rods and 
cones [34.4]. The former is involved in the sensation 
of light and dark and the latter for the perception of col- 
ors. Even within the collection of cone cells there are 
further specializations that allow for the detection of 
various color wavelengths most specifically red, blue, 
and green. A similar pattern is observed in cells of the 
inner ear, where hair cells are housed in osseous cavities 
lined with membrane. This makes these cells special- 
ized for the detection of vibrations in the air occurring 
at different frequencies. Sensory specialization is fur- 
ther mapped onto the brain where distinct regions of 
the cortex correspond to specific environmental stim- 
uli or motor output patterns. Sensation is a vital part 
of cognition as the representation of the world we each 
possess is built upon our own unique sensory expe- 
riences. This literally means that we have shaped the 
surfaces of our cortices based on our experiences as 
individuals. 
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34.1.3 Associations — The Foundation 
of Cognition 


The brain is an integrative structure capable of trans- 
forming sensory stimuli while forming associations 
between stimuli based on temporal and local param- 
eters. Sensory integration allows organisms to relate 
pertinent information about environmental variables in 
real time based on successful behavioral patterns of the 
past. The mechanism driving these associative prop- 
erties of the brain has been the subject of countless 
scientific endeavors yielding a basic understanding. Al- 
though progress has been made in understanding how 
the nervous system adapts to the environment, there 
are countless questions that emerge as new discoveries 
are made that require a constant revision of param- 
eters. Kandel [34.5] was among the first to develop 
experiments capable of demonstrating the molecular 
and cellula r mechanisms underlying learning within 
a biological system. 

Developing associations across a variety of sensory 
pathways is a major constituent of learning within ani- 
mals. These associations are formed by altering cellular 
physiology based on input experience for a certain neu- 
ron or population of connected neurons. This cellular 
learning is the foundation for consciousness and there 
are many processes at the cellular level that contribute 


34.2 Cells and Signaling Among Cells 


The brain is composed of two major cell types: neurons 
and glial cells. Both of these cell types are essential for 
the brain to function properly. Standard models place 
electrically excitable neurons in a signaling role with 
glial cells serving an indispensable support role al- 
though emerging evidence suggests glial cells could be 
more heavily involved with signaling than previously 
thought. The majority of research in computational neu- 
roscience focuses on modeling neurons and their role 
in generating conscious behavior. This is largely due 
to the fact that their maintenance of a potential differ- 
ence across the membrane serves as an efficient and 
robust communication pathway that can be modeled us- 
ing established electrical dynamics. Unlike other cells 
of the body the neuron is a non-differentiating cell, 
which means it does not continuously undergo cellu- 
lar mitosis or meiosis. This simple difference in the life 
cycle of these cells affords them an indispensable role 
in the animal nervous system — memory. Each cell in 


to learning in different ways. Some association mech- 
anisms directly alter the number of synapses between 
neurons based on a history of communication between 
the cells, where other processes will affect the cell’s 
DNA to accommodate a certain input pattern [34.6]. All 
of the mechanisms that influence learning in the brain 
have yet to be defined, which leaves a large opportu- 
nity for conjecture as to what constitutes learning and 
what does not. Modeling maintains a distinct advan- 
tage, as models of cellular communication and learning 
can be prototyped in silico to account for the large num- 
ber of modifications occurring over time. The output 
of these models can then logically guide our search 
for learning mechanisms within the brain by outlin- 
ing a possible path where learning mechanisms could 
be discovered. While we understand that associations 
between neurons and glial cells likely form the basis 
of cognition, we have yet been able to recreate this 
phenomenon to completely explain its nature. This is 
partly due to the fact that the mammalian brain is so 
large and contains so many networks that detection and 
characterization of all changes occurring within the sys- 
tem simultaneously is extremely difficult. To resolve 
this issue, many neuroscientists have turned to using 
model organisms that provide a reduced set of neurons 
with which to experiment and demonstrate fundamental 
theories. 


the body has an innate type of memory that begins by 
receiving messages from their environment at the mem- 
brane. These signals can sometimes propagate into the 
nucleus where alterations in DNA can occur, which will 
ultimately affect the function of the cell. Neurons have 
this type of DNA memory but also form connections 
with their neighboring cells or even cells that are lo- 
cated in other brain regions [34.7]. Because neurons do 
not divide these connections are not reset and can last 
for periods of time that are much longer than the life cy- 
cle of a typical cell. Connections between neurons are 
called synapses and form the fundamental communica- 
tion element in the nervous system. 


34.2.1 Neurons — Electrically Excitable Cells 
Neurons are composed of a cell body referred to as 


the soma, which contains membrane-bound organelles 
that are found in most cells, in addition to one or more 
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protoplasmic projections: an axon and dendrites. The 
soma contains the nucleus and is the central portion of 
the neuron. Among neurons, the most common path- 
way for signal transmission is from the axon terminal 
of one neuron to another’s dendrites, which then re- 
lays the signal to the neuron’s soma and on to the 
axon to be transmitted to another neuron. In particular, 
dendrites receive and conduct electrochemical signals 
from other neurons to the soma and play an integral 
role in determining the extent to which action poten- 
tials are generated. Dendrites are composed of many 
branches called dendritic trees and can create exten- 
sive and unique branched networks between neurons. 
The axon is the anatomical structure through which 
action potentials are transmitted away from the soma 
to other neurons or other types of cells such as mus- 
cles, ganglia, or glands [34.8]. Axons vary in length 
tremendously and bundle together to form large periph- 
eral nerves that course from the spinal cord to the toe. 
They vary in composition depending upon their loca- 
tion in the PNS or the CNS and may be myelinated 
or unmyelinated. Myelin is a fatty, dielectric insulat- 
ing layer that speeds signal conduction along the axon 
by forming discrete regions of low resistance and high 
conduction velocity. In between the discrete myelinated 
portions of axon, there are nodes of Ranvier to repeat 
the signal along the next segment of axon. Axons nor- 
mally maintain an equal radius throughout their course 
and terminate at a synapse, where the electrochemical 
signal will be transmitted from the neuron to the target 
cell, which may be another neuron or another type of 
cell. A synapse is formed by the end(s) of one neuron’s 
axon, called the axon terminal and the dendrites, axon, 
or soma of the receiving neuron. The synapse is fun- 
damentally a transducer that converts electrical signals 
from a membrane potential wave to a chemical signal 
that modifies the state of a downstream cell [34.9]. 
Neurons are classified by the branching pattern and 
location of their dendrites and axons, their physiolog- 
ical function and location within the nervous system. 
Structural and functional classification, and the type of 
neurotransmitter released are relevant to modeling and 
will be reviewed. Structurally, neurons are classified 
as unipolar, pseudo-unipolar, bipolar, or multi-polar. 
Unipolar neurons contain one protoplasmic projection 
that divides distally into sensory and transmitting por- 
tions of an axon. Bipolar neurons retain two projections 
from the soma, one from which dendrites extend and 
the other from which the axon extends. The major- 
ity of neurons are multipolar neurons, which normally 
contain one long axon and many dendritic projections. 


Figure 34.2a exemplifies bipolar, hippocampal neurons 
and Fig. 34.2b demonstrates the anatomy of a corti- 
cal, glutamatergic neuron. Functionally, neurons can be 
classified according to their electrophysiology and be 
described as tonic, phasic, or fast-spiking. However, it 
is more effective to describe the different firing pat- 
terns experienced by neurons, as most neurons exhibit 
variable firing patterns. Tonic firing involves continu- 
ous responses to stimuli and recurrent generation of 


a) 


Fig. 34.2 (a) Fluorescently labeled hippocampal slice 
showing neuronal cell bodies stained in blue with neu- 
ronal nuclei antibody (NeuN). The GFP (green fluorescent 
protein) and RFP (red fluorescent protein) labeled neurons 
extending their processes are of dentate granule cell ori- 
gin and have matured over the course of the experiment 
as this slice was imaged at postnatal day 53. (b) Here 
we see a neuron that has had its cell body stained with 
a red fluorescent marker and presynaptic glutamate recep- 
tors indicated in green. The sheer volume of presynaptic 
targets can be seen here along with two segments that have 
been selected and magnified to give a better view of how 
synapses cover the dendrite surface 
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action potentials. Tonic firing patterns are observed in 
large excitatory neurons during basal levels of activity 
to provide constant communication between elements 
of the network. Phasic firing patterns consist of bursts 
of action potentials, often in quick succession that has 
dramatic downstream effects. Bursting and phasic firing 
are highly studied phenomena that often signal a shift 
in steady state activity levels in the circuits where they 
are observed. Along with the branching patterns of den- 
drites and axons, neurons may be classified by the type 
of neurotransmitter released at synapses. The two dom- 
inant neurotransmitters in the brain are glutamate and 
gamma-aminobutyric acid (GABA), which generally 
mediate excitatory and inhibitory neurotransmission re- 
spectively [34.10]. 


34.2.2 Glial Cells - Supporting Neural 
Networks 


Glial cells are fandamentally different from neurons in 
that they do not form synapses with other cells and 
generally do not maintain a membrane potential. Al- 
though they are not directly involved in signaling be- 
tween neurons through synaptic means, these cells play 
a large role in the maintenance of synapses as well as 
signal integrity. There are four major types of glial cells 
in the brain: oligodendrocytes, astrocytes, ependymal 
cells, and micro-glial cells [34.11]. Oligodendrocytes 
secrete myelin, the insulating dielectric material cov- 
ering axons, which facilitates signal transmission over 
longer distances. Ependymal cells line the ventricles of 
the brain and secrete the cerebral spinal fluid that bathes 
the brain and provides a route for expulsion of waste and 
the intake of nutrients. Micro-glial cells act as a type of 
immune system for the brain by digesting dead cells and 
collecting material that should not be present or could be 
damaging to cells of the brain. Astrocytes are abundant, 
star-shaped cells that generally surround neurons and 
provide nutrition and oxygen, and remove waste that 
could be toxic to a neuron if left to accumulate via cere- 
bral spinal fluid. This astrocyte driven waste-removal 
system is essential for normal physiological function 
and structurally resembles the lymphatic system found 
in the rest of the body’s tissues. Remarkably, new ev- 
idence suggests astrocytes may actually participate in 
synaptic communication, and that communication may 
be bi-directional [34.12]. Communication between as- 
trocytes and neurons has yet to be fully characterized 
and provides an opportune target for new venture into 
neural network modeling. This new evidence of im- 
plicating astrocytes will shift our understanding of the 


brain as a communications structure and will open new 
questions that can be addressed using computational 
models. The current models of non-linear neural sys- 
tems are already complex and must be altered to account 
for the evidence present in this new paradigm. 


34.2.3 Transduction Proteins — 
Cellular Signaling Mechanisms 


All cells of the body contain functional proteins em- 
bedded in their lipid bi-layer that are responsible for 
transducing environmental signals across the cellular 
membrane. The incredible variety of proteins present 
in the nervous system serves as mediators of cellu- 
lar communication. Each protein will have its own 
unique structure depending on its function, and there 
are large groups of proteins that all share common prop- 
erties such as the g-protein coupled receptors (GPCRs), 
ligand gated ion channels, passive ion channels, and 
a plethora of others. All of these will not be defined 
here as the classification and physiology of membrane 
proteins can be considered the subject of a whole field 
and are reviewed in detail by Grillner in [34.12]. The 
most important protein varieties for our consideration 
are those that control the flow of ions across the lipid 
bi-layer, which can be achieved in a number of ways. 
Some directly pass ions through a small pore in their 
center that is selective to certain ion species based on 
their electron distribution. Others use stored chemical 
energy to transfer a subset of ions outside of the cell 
while bringing others into the cell allowing a gradient 
to be established by using stored cellular energy to push 
ions against their potential energy gradient. 


34.2.4 Electrochemical 
Potential Difference - 
Signaling Medium 


Neurons carry information to their targets in the form 
of fluctuations of their membrane potential. Changes in 
the value of the membrane potential can trigger a vari- 
ety of cellular signals including the opening of voltage 
gated proteins or the release of chemical neurotransmit- 
ters at a synapse [34.13]. As mentioned earlier, neurons 
generate their membrane potential by selectively trans- 
porting ions across the cellular membrane using energy, 
normally in the form of adenosine triphosphate (ATP). 
With the right combination of proteins neurons are able 
to shift their membrane potential in response to com- 
munication from other cells in a discrete fashion. This 
discrete wave along the membrane is known as an ac- 
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tion potential and is only generated by a neuron once 
a certain level of activation has been attained. A sin- 
gle neuron in a network must receive communication 
from other cells, normally at its dendrites, which will 
cause the accumulation of positive ions inside the cell. 
Once the positive ions have accumulated to a certain 
critical level the cell will generate an action potential 
that flows down the axon to the cells targets. This ac- 
tion potential is the fundamental signaling unit within 
the nervous system and triggers the release of neuro- 
transmitters when it arrives at a synapse. Once an action 
potential has been generated there is a period where the 
cell cannot create another wave, and this time is known 
as the absolute refractory period. 

The potential energy across the membrane is a com- 
plex value that results from the transport of ions against 
their concentration gradient (Fig. 34.3). If these ions 
were left to freely diffuse the membrane potential 
would eventually deteriorate as each ion moved toward 
its own value of equilibrium potential. This equilibrium 


Extracellular 


E Na’ 
A K* 
ec 
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Fig. 34.3 The lipid bi-layer is composed of phosphate 
groups on the extra and intracellular faces of the mem- 
brane with a variety of lipids attached that essentially will 
self-assemble when dissolved in water at the appropriate 
concentration. The extra cellular surface has a net posi- 
tive charge relative to the cytoplasmic side that is relatively 
negative. The membrane acts as a semi-permeable barrier 
that prevents the passage of molecules based on their at- 
traction to water. Here we see that potassium ions have 
accumulated within the cell along with positive anions and 
sodium, and chloride can be found outside the cell due to 
the action of specific transporters embedded in the mem- 
brane. The asymmetric distribution of charge across the 
membrane causes a potential difference to exist in terms 
of both the electrical potential of the ions and their chemi- 
cal nature to diffuse down their concentration gradient 


potential, also known as the reversal potential or the 
Nernst potential, can be calculated using the following 
relationship for any ionic species x 


RT | Klo 


zF n [X]; è (34.1) 


y =e 


In the above equation, known as the Nernst equation, 
E, represents the equilibrium potential for a certain ion 
species with [X], and [X]; representing the external and 
internal concentrations of the ion, respectively [34.14]. 
Additionally, R is the Rydberg constant, T is the tem- 
perature in kelvin, z is the atomic number of the ion, 
and F is the Faraday constant followed by the natural 
logarithm of the concentration difference. At the rever- 
sal potential for an ionic species there is a net force 
of zero on ions in the systems and these particles will 
be at rest. From this relationship we can see that the 
energy driving the fluctuations in membrane potential 
originates from both electrical and chemical sources. 

While the Nernst equation is capable of determining 
the reversal potential for a single ionic species, physi- 
ological systems often have many additional ions that 
participate in cellular signaling. The Nernst equation 
was expanded to form the Goldman-Hodgkin-KĶatz 
equation that yields the membrane potential in a resting 
system composed of multiple ionic species as shown 
below 


LRT n Pg [Kt], + Pra [Nat], + Poi [C17]; 
F Pk [K+]; + Pra [Nat]. + Pa [Cl], 
(34.2) 


Table 34.1 Here we see the concentration distribution of 
different salt species as they are found within a voltage 
clamp experiment using a segment of giant squid axon. 
These values will be different in each experimental prepa- 
ration and should be considered carefully as small vari- 
ations can have large implications for the firing patterns 
observed. Take note that here there are no positive anions 
present as the center of the giant squid axon is devoid of 
organelles unlike the inside of a mammalian neuron of the 
central nervous system that would likely contain many 


Ion Cytoplasmic Extracellular Equilibrium 
concentration concentration potential 

Kt 400 20 -75 

Nat 50 440 +55 

om 52 560 —60 

Organic 385 None None 

anions 

O 
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Here u is the resting membrane potential of the cellular 
system under consideration and the constants are the 
same as shown above in the Nernst equation [34.15]. 
However, instead of relying solely upon the concen- 
tration of the particular ion inside and outside the cell 
here we can see that the number of ion channels, rep- 
resented by the variable P, is also taken into account. 
Here P is the membrane’s permeability to that particular 
ionic species in the unit cm/s. This equation is used to 
calculate the membrane potential of a particular physi- 
ological system with multiple internal and external salt 
species such as potassium, sodium, and chloride. No- 
tice that the relationship between membrane potential 
and chloride concentration is the inverse due to its neg- 
ative valence. 

The simplest description for the maintenance of 
the membrane potential derives from the asymmetric 
distribution of ions across the cell’s semi-permeable 
membrane. The concentrations displayed above were 
the first calculated for any neuron and were measured 


from the axon of a giant squid. By carefully measur- 
ing the potential difference between the inside of the 
axon and the external solution Hodgkin and Huxley 
were able to determine the amount of current flow- 
ing across the axon’s membrane under varying con- 
ditions [34.16]. The potential difference between the 
inside and outside of the cell is generated by manip- 
ulating the concentration of each ion with respect to 
its charge. Through careful observation of Table 34.1 
above one can note that this resting condition will lead 
to a state where the cell is relatively negative on the 
inner membrane and positive along the outer wall of 
the membrane. It is this relative potential that varies 
along the surface of the neural membrane and it is 
what is responsible for carrying information along the 
length of the cell. These concentrations are consid- 
ered bulk values and hardly ever deviate from these 
concentrations unless a period of sustained firing has 
occurred, where the cell can deplete this potential 
difference. 


34.3 Modeling Biophysically Realistic Neurons 


34.3.1 Electrical Properties of Neurons 


When modeling a system one must first consider its 
physical dimensions and elements in order to con- 
struct a model that is true to reality and mathematically 
sound. By investigating the most fundamental structure 
of a neuron it is simple to see how this system can be 
easily related to that of an electronic capacitor. We have 
a system composed of two electrically conductive medi- 
ums, the extracellular fluid and the cytoplasm, which 
are separated by a dielectric layer that is also the phos- 
pholipid bi-layer. Therefore, from this description of 
a neuron we can generate the following relationship for 
the membrane potential 


u= >, (34.3) 


C 
where u represents the potential across the membrane 
and it is equal to the quotient of the charge along the 
membrane surface Q and the capacitance of the mem- 
brane itself C [34.17]. From this relationship it is clear 
that the membrane potential relies on the species and 
number of charges distributed across the membrane sur- 
face as well as the lipid constituents of the bi-layer. 
The capacitance of the membrane in a neuron is gener- 
ally around 1 yF/cm? but this can fluctuate depending 


on the local ionic and lipid composition. Four major 
ionic currents are most often found in the cellular mem- 
brane and considered in a biologically realistic model 
(Fig. 34.4). 

Similarly to electrical capacitors, neurons are reliant 
upon dynamic currents that flow through the membrane 


ti 


| 
aaa 


Fig. 34.4 A schematic equivalent circuit diagram of the 
four major ionic currents most often found in the cellular 
membrane. The resistors represent the varying conductiv- 
ity of membrane channels for each ion and the batteries are 
each ion’s respective concentration gradient. On the right 
we can see the membrane has an inherent capacitive cur- 
rent that acts to slow the spread of membrane currents and 
is manifested by the physical structure of the cell 


Membrane 
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Fig. 34.5 (a) The membrane potential measured at the middle of a spherical single compartment model cell denoted 
by v(.5). The cell has Hodgkin—Huxley type current dynamics. The y axis shows membrane potential at the center of 
the cell, v(.5), and the x axis represents time in ms. (b) The blue trace shows the total magnitude, in mA /cm?, of the 
outward potassium current within the Hodgkin—Huxley (HH) model. The trace red shows the total magnitude of the 
inward HH sodium current in the same units of mA/cm?. (c) The red trace is the variable 8nat mh as shown in (34.5). 
The blue trace shows the potassium gating constant gg+n* also from (34.5). (d) The state of each of the gating variables 
m, n, and h where the value 1 represents fully open and 0 represents fully closed or inactivated channels. This simulation 
was conducted within the NEURON simulation environment with a single compartment of area 29 000 p?, initialized at 
—50 mV, with physiological concentrations of calcium, chloride, potassium, and sodium ions 


into and out of the cell. One of the most important ions 
is potassium (K+), which conducts current based on 
the following relationship, ix = (yg x u)— (yg x Ex) = 
VK x (u— Ex), where y is the ionic conductance, u is 
the membrane potential, and E is the ionic reversal po- 
tential. The final term on the right-hand side of the 
equation is known as the electromotive force and can 
be calculated independently for each ionic current. This 
relationship shows how a neuron with both membrane 
potential and a potassium concentration gradient pro- 
duces net potassium current. This potassium current 
can be generalized across the entire surface using this 
relationship, gx = Nx x yx, where g is the ionic con- 
ductance, N is the number of channels open at rest, and 
(gamma) is the permeability of an individual potassium 
channel. Determination of individualized channel con- 
ductance is performed in a lab setting using the patch 
clamp technique to isolate single ion channels. Once 
isolated, these channels can be tested using pharma- 
cological techniques to determine their single channel 
conductivity [34.18]. These experiments are very sensi- 
tive and must be performed for each channel of interest 


within the model as they cannot be represented without 
accurate biological data. 

Describing a neuron as a capacitor yields interest- 
ing properties that we can infer about neurons from 
the large established set of knowledge regarding capac- 
itance. The innate capacitive nature of the membrane 
actually affects the passage of current and, therefore, is 
relevant to our discussion here. When current is injected 
into our capacitive system it is inherently slowed based 
upon the time course of the current injection and the 
capacitance of the system represented by this relation 
Au = <4". Therefore, the magnitude of the change 
in potential across the capacitor is relative to the du- 
ration of the current, presenting a natural latency in 
signal transmission that must be accounted for when 
modeling. 


34.3.2 Early Empirical Models — 
Hodgkin and Huxley 


Investigation into the structure of the brain began with 
improvements to the light microscope in the early 
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twentieth century. As characterization of cellular types ant squid axon in vitro. These results have been adjusted 
progressed rapidly, understanding of cellular physiol- to set the resting membrane potential at a value of 0 mV 
ogy lagged quite far behind. This is largely due to instead of the typical —65 mV as is seen in many mod- 
the fact that the technology to manipulate individual ern interpretations. 
cells had yet to be invented. To go around this prob- This model has been expanded and interpreted for 
lem Hodgkin and Huxley used an axon harvested from many systems outside the giant squid axon, and a gen- 
a giant squid as their experimental system. This allowed eral form for determining gating variables is shown 
them to easily observe the behavior of the experimen- here 
tal preparation while functionally examining changes 
that occurred at the microscopic level. In general, this == [8 — 6o (u)] . (34.7) 
model also describes the segment of neuron as a capac- Ta (u) 
itor with x ionic currents and applied current as shown Jn this differential form, © represents a particular gating 
here variable of interest. When the membrane voltage u, is 
fixed at a certain value then © approaches 69(u) with 
c” =— > LO ++I. (34.4) a time constant represented by to (u). These values can 
dt F be calculated using the transformation equations shown 
below 
This model uses three ionic current components includ- 
: i ; ag (u) 
ing potassium, sodium, and a non-specific leak current olu) = —__—____ , 
that are described below [æo (u) i Bow) 
Tto (u) = ; (34.8) 
YOL = guam hlu — Eyat) [æg (u) + Bo(w)] 
x 
+ &xt n'(u— Eg+) +81 (u—EL) . (34.5) Table 34.2 Parameters associated with dynamics of ions 
within biological membranes, specifically of the giant 
Each of the terms above represents a specific ion with squid. The first column shows the gating variables asso- 
a gating variable that has been experimentally fitted ciated with each ionic species. The center column is the 
to display the characteristics of each specific channel. equilibrium potential that can be found for each ion within 
A simulation of these parameters in a model cell ex- the system and is shown in millivolts. On the right the over- 
pressing the standard Hodgkin—Huxley type channels all conductance of each ionic species within the model of 
is displayed in Fig. 34.5. In this case, the sodium cur- action potential generation and membrane potential main- 
rent is determined using two different gating variables tenance is shown 
because it has two distinct phases, activation and inac- Ton species —gat- | Reversal potential | Conductance 
tivation. During the inactivation phase the conduction ing variables (0) Eg (mV) go (mS/cm?) 
pore of the channel is blocked by a string of intracellu- Na- (0 =m,h) 115 120 
lar positive amino acids that literally plug the channel K_(6 =n) =p 36 
closed. This is distinct from potassium or leak channels L — no gating 10.6 0.3 
that do not have an inactivation mechanism built into 
the protein. The differential form of each gating vari- Table 34.3 Gating variables with specific relationships 
able for the Hodgkin-Huxley model is shown below that describe the particular function of each of the gating 
(Tables 34.2 and 34.3) in terms of the experimentally variables 
determined parameters œ and f Giiny O LET CIES 
0.1—0.01lu =u 
= m = anl) (1 =m) ~ Ble) ; n cag | (02st 
5 n= Oy (u) (—n)— Brn, (2.5—0.1u) E 
ə h = a,(u) (1— h) — B,(u) . (34.6) ™ pew A 
g z 
w These values were obtained during initial laboratory A 0.07e 20 CEET] 


experimentation by Hodgkin and Huxley using the gi- 
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34.3.3 Compartmental Modeling - 
Anatomical Reduction 


Consider a simple case of a neuron with a spherical cell 
body. The time course of the potential change across the 
membrane can be described using the following 


Ault) = InRn (1 -e7). (34.9) 


In the above relationship Zm and Rm are the membrane 
current and resistance, respectively. The rightmost term 
contains e to the power of time divided by the mem- 
brane time constant T. 

Spatially, the current in our spherical cell body will 
decay along the length of the membrane according to 
the relationship below 


Au(x) = Ame ™® . (34.10) 
Above we can see that the spatial decay of potential re- 
lies upon the potential where the current was initially 
injected Aug, x is the distance from the site of the cur- 
rent injection, and w is the membrane length constant. 
This constant is defined as, œ = V/7m/Ta, Where rm is 
the membrane resistance and r, is the axial resistance 
along the length of the compartment. The length con- 
stant is also dependent upon the radius of the segment 
with large axons conducting current more easily than 
smaller axons. In addition, each cell has individualized 
values of membrane and axial resistance that are based 
on the particular distribution of protein and cellular or- 
ganelles in each experimental scenario. 


34.3.4 Cable Equations - 
Connecting Compartments 


One of the most convenient ways of modeling a neu- 
ron is by simplifying its neural structure. This can be 
done by approximating the shape of parts of the cell 
surface as cylinders (Fig. 34.6), since this is very close 
to the real shape of a neuronal process. By starting with 
a simple cylinder, we can consider this as the fundamen- 
tal unit for computation where quantities regulating the 
system will be derived and repeated along the length 
of the model neuron [34.19]. The one-dimensional ca- 
ble equation can effectively describe the propagation of 
current along a length of cylinder that does not branch 
as shown below 


3V F= eV 
OX?" 


JT (34.11) 


In the above, V and F are both independent functions 
of time and space represented by T and X, respectively. 
Since most neurons are branched we can consider 
a neuron to be composed of many one-dimensional 
elements that can be arranged to form the branching 
structure. The boundary conditions of each cylinder are 
used to calculate specific values within each region of 
membrane. To derive this relationship one must first 
consider conservation of charge as shown here, 


La- f imda =o, 
A 


where the leftmost term is the sum of axial currents 
entering the section and im represents the integrated 
transmembrane currents over the entire segment area. 
Expanding this description to include an electrode 
current source s we obtain the following relation- 
ship, 


Li- f inaa + f iaa =0. 
A A 


The electrode current can be simplified to a point source 
of current due to the fact that most electrode current 
sources are much smaller than the cell itself. 

To simplify our model we can split the neuron up 
into j compartments of size m and area A. All the 
properties of the compartment can be represented as 
the average at the middle of the compartment shown 
here, imjAj = a iag. The current between compart- 
ments can be defined using Ohm’s law where the 
voltage drop between compartments is divided by 
the axial resistance between compartmental centers 
lagi = (Ve — vj) /rx. Therefore, our compartment can then 
be described using the following relationship, i,,A; = 
(ve — yj) /e- The total membrane current can then 
be found using this expression, 


dv; 
; j : 
im Aj = oa oT lion (Vj, t) ; 


where c; is the compartmental capacitance and iion (vj, t) 
is a function that captures the varying values of ion 
channel conductance in the membrane. 

A set of branched cables can be constructed from 
individual segments to yield a set of differential equa- 
tions that follow the form shown in (34.12) below 


Gfracdvjdt + ion (vj, t) = > x=) ; (34.12) 
ra 
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Fig. 34.6 Generalized diagram of a single compartment 
representing a segment of a neuron in a biophysical model 
using cable equations. Below, three compartments are 
shown to indicate axial and membrane current components 
that are represented by their summated valued at the center 
of each compartment [34.6] 


In order for the equations above to be completely valid 
we must assume that the axial current flow between 
various compartments can be closely approximated by 
calculating the value of the current at the center of each 
compartment. This implies that the current can vary lin- 
early across compartments with the compartment size 
chosen specifically to account for any spatial variations 
that may exist within the experimental system, 
ga + tion.) = iat a + a ; 
dt Tj+1,k 


Ti-1.k 
The above is a special case (34.14), which was dis- 
cussed [34.5], where specific attention is paid to the ax- 
ial current in adjacent compartments when a uniformly 


(34.13) 


distributed current passes through an initial neuronal 
segment with constant diameter [34.20]. Compartments 
have length Ax and diameter d. The capacitance can 
be written as C,,727dAx and the axial resistance of each 
compartment is defined as R,Ax/ (4), where Cm is 
the specific membrane capacitance and R, is the spe- 
cific cytoplasmic resistivity. Manipulation of (34.13) 
using the above consideration then yields the following 


(34.14) 
d v1 — 2 + 1 
4R, Ax ` 


(34.14) 


dy 
Cay tit) = 


Here the total ionic current specified above has been 
replaced with į(v;, t), a term that expands our considera- 
tion of injected ions by using a current density function. 
Now, if we consider the case where compartment size 
becomes infinitely small and the right-hand side then 
reduces to the second partial derivative of membrane 
potential with respect to the distance from the compart- 
ment of interest j. This reduction yields (34.15) below 


ay YER) a) 


After multiplying both sides by the membrane resis- 
tance and with a simple application of Ohm’s law we 
can see that iR,, = v and, therefore, we find 


dv dRm \ (8°v 
CmR; = ; 
oA (=) (eR) 
This relationship can be scaled using the constants for 
time Tm = RmCm and space w = Tm/Ta, respectively. 


(34.15) 


(34.16) 


34.4 Reducing Computational Complexity for Large Network Simulations 


34.4.1 Reducing Computational 
Complexity — Large Scale Models 


The Hodgkin—Huxley model [34.21] sets the founda- 
tion for mathematically modeling detailed temporal 
dynamics of how action potentials in neurons are ini- 
tiated and propagated. The set of nonlinear ordinary 
differential equations of the form (34.4)-(34.6) were 
developed to describe the electrical characteristics of 
the squid giant axon. Given the many different mem- 


brane currents that may be involved in the firing of 
an action potential, the Hodgkin—Huxley model repre- 
sents the simplest possible representation of neuronal 
dynamics yet realistically captures the biophysical re- 
lationship between the voltage and time dependence of 
cell membrane currents. Even so, it is a daunting task 
to study a large neural network based on interconnected 
neurons each of which is modeled by Hodgkin—Huxley 
equations. The effort made in [34.22, 23] is a good il- 
lustration of the inherent challenges. Even on a single 
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neuron level, recent studies have shown that it would 
be easier to tune parameters in a less biophysically 
realistic model under the general scope of integrate- 
and-fire, or threshold models, which approximate the 
pulse-like electrical activity as a threshold process. Or 
in other words, such models are less sensitive in their 
model parameters and thus provide more robust and ac- 
curate model fitting results given profiles of injected 
current waveforms [34.24—26]. These threshold models 
are easy to work with but they are phenomenolog- 
ical approaches to modeling true neural behavior. It 
would not be possible to use these models to study 
membrane voltage profile over a precise time course, 
and it would be impossible to assess environmental 
parameters such as temperature change, chemical en- 
vironment change, pharmacological manipulations of 
the ion channels and their impact on the membrane 
dynamics. 

Given the many challenges of mathematical mod- 
eling of realistic neurons and neuronal networks under 
fine spatial and temporal resolution, great efforts have 
been made on several fronts to advance the study 
of neural network dynamic behaviors. Common to 
all approaches, the role of time in neuronal activities 
is emphasized and, thus, the models are usually de- 
scribed by nonlinear ordinary differential equations. 
In the following, we will examine some of these dy- 
namic models that are built on different premises and 
considerations. 


34.4.2 Firing Rate-Based Dynamic Models 


The neural firing rate-based encoding scheme assumes 
that information about environmental stimulus is con- 
tained in the firing rates of the neurons. Thus, the 
specific spike times are under-represented. Sufficient 
evidence points out that in most sensory systems, the 
firing rate increases, generally non-linearly, with in- 
creasing stimulus intensity, and measurement of firing 
rates has become a standard tool for describing the 
properties of all types of sensory or cortical neurons, 
partly due to the relative ease of measuring rates ex- 
perimentally. However, this approach neglects all the 
information possibly contained in the exact timing of 
the spikes [34.1]. Maybe it is inefficient, but the rate 
coding is robust and easy to measure and thus has been 
used as a standard or basic tool for studying sensory or 
cortical neuron characteristics in association with exter- 
nal stimuli or behaviors. The class of firing rate-based 
dynamic models mainly takes into account two consid- 
erations. First, these models account for a population of 


neurons in the model. As such, these models aim at sim- 
ulating large-scale neural network behaviors. Second, 
these models were motivated by associative memory 
processes where the time reflects the memory recall 
process. 

Consider a population of neurons, and let r;(t) de- 
note the mean firing rate of a target neuron i, and r;(t), 
j€ N; = {j/j is presynaptic to i}, the mean firing rates 
of all neurons presynaptic to neuron i. Let h,(t) be the 
input to target neuron i, which is 


hi(t) = X wi Or . (34.17) 


JENi 


Equation (34.17) takes into account all presynaptic neu- 
rons’ contributions weighted by synaptic efficacy w;j. 
A representative class of the firing rate model as studied 
extensively in the artificial neural networks community, 
which was first popularized by Hopfield [34.27], can 
then be described at the fixed-point of an associative 
memory process as 


ri(t) =O |X wOna (34.18) 


JENi 


In (34.18), OC.) is considered a gain function. Conse- 
quently the firing rate dynamics associated with this 
associative memory process can be defined by introduc- 
ing a time constant t in the associative network as 


(34.19) 


dr; 
T ae = —r() +0 |J wi 


JENi 


The firing rate model (34.19) also has another inter- 
esting interpretation where the mean firing rate r;(t) is 
considered to be the spatial averaged neural potential 
F(t) of neuron i due to contributions from a local pop- 
ulation of neurons j € N; = {j/j is presynaptic to i}. As 
such, (34.19) becomes 


oF) _ -F +0 | Y w OFA 


34.20 
J | ( ) 
JENi 


The model described by (34.20) was used for the 
analysis of a large neural network where slow neural 
dynamics were assumed in order to describe spatially 
homogeneous motoneurons [34.28]. 
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34.4.3 Spike Response Model 


The spike response model [34.29] uses response ker- 
nels to account for the integral effect of presynaptic 
action potentials. With two linear kernels and under 
a simple renewal assumption, it can be shown that the 
spike response model (SRM) is a generalization of the 
integrate-and-fire neuron. The spike response model de- 
scribes the membrane potential u;(t) of neuron i as 


u(t) = n(t—ti) + | « (t-i, s) I™ (t—5) ds, 


ieee” 


(34.21) 


where 7 represents the typical form of an action po- 
tential, which includes both depolarization and repolar- 
ization, as well as the process of settling down to the 
resting potential; 7; stands for neuron i firing an action 
potential at that time. Also in the equation, the kernel 
«(t—ij,s) is a linear impulse response function of the 
membrane potential to a unit input current. Imagine it 
as a time course of an additive membrane potential to 


34.5 Conclusions 


In this chapter we have provided an introduction to 
some established modeling approaches to studying bio- 
logical neural systems. Motivations behind these mod- 
els are twofold. First and foremost, biological realism 
is considered to be of the utmost importance. Given 
the complex nature of a biological neuron, reduced- 
order neuronal models that can be or have been vali- 
dated by the Hodgkin—Huxley model or biological data 
have been developed. As discussed, these models only 
scratch the surface of providing an accurate and realis- 
tic account of a real neural system, not even a specific 
brain area or something capable of explaining a behav- 
ioral parameter completely and thoroughly. Nonethe- 
less, this decade has probably seen the most progress in 
terms of computational modeling approaches to under- 
standing the brain. The International Neuroinformatics 
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(spinal) control of movement. However, over the Of Movement ............ccceecc cece eee 667 


last three decades, attention has turned increas- 
ingly toward higher functions related to cognition, 
decision making and voluntary behavior. Exper- 
imental studies have shown that specific brain 
structures — the prefrontal cortex, the premo- 
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inforcement learning. Because of the complexity 
of the issues involved and the difficulty of direct 
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tor control. The resulting computational models are 
also very useful in engineering domains such as 
robotics, intelligent agents, and adaptive control. 
While it is impossible to encompass the totality 
of such modeling work, this chapter provides an 
overview of significant efforts in the last 20 years. 


35.1 Overview 


Mental function is usually divided into three parts: per- 
ception, cognition, and action — the so-called sense- 
think-act cycle. Though this view is no longer held dog- 
matically, it is useful as a structuring framework for 
discussing mental processes. Several decades of theory 
and experiment have elucidated an intricate, multicon- 
nected functional architecture for the brain [35.1, 2] — 
a simplified version of which is shown in Fig. 35.1. 
While all regions and functions shown — and many not 
shown — are important, this figure provides a summary 
of the main brain regions involved in perception, cogni- 
tion, and action. The highlighted blocks in Fig. 35.1 are 


It also outlines many of the theoretical issues 
underlying this work, and discusses significant 
experimental results that motivated the computa- 
tional models. 


discussed in this chapter, which focuses mainly on the 
higher level mechanisms for the control of behavior. 

The control of action (or behavior) is, in a real 
sense, the primary function of the nervous system. 
While such actions may be voluntary or involuntary, 
most of the interest in modeling has understandably fo- 
cused on voluntary action. This chapter will follow this 
precedent. 

It is conventional to divide the neural substrates 
of behavior into higher and lower levels. The latter 
involves the musculoskeletal apparatus of action (mus- 
cles, joints, etc.) and the neural networks of the spinal 
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Fig. 35.1 A general schematic of 
primary signal flow in the nervous 
system. Many modulatory regions 
and connections, as well as several 
known connections, are not shown. 
The shaded areas indicate the com- 
ponents covered in this chapter 
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cord and brainstem. These systems are seen as repre- 
senting the actuation component of the action system, 
which is controlled by the higher level system com- 
prising cortical and subcortical structures. This division 
between a controller (the brain) and the plant (the 
body and spinal networks), which parallels the mod- 
els used in robotics, has been criticized as arbitrary 
and unhelpful [35.3,4], and there has recently been 
a shift of interest toward more embodied views of cog- 
nition [35.5, 6]. However, the conventional division is 
useful for organizing material covered in this chapter, 
which focuses primarily on the higher level systems, 
i. e., those above the spinal cord and the brainstem. 
The higher level system can be divided further into 
a cognitive control component involving action selec- 
tion, configuration of complex actions, and the learning 
of appropriate behaviors through experience, and a mo- 
tor control component that generates the control signals 
for the lower level system to execute the selected action. 
The latter is usually identified with the motor cortex 
(M1), premotor cortex (PMC), and the supplementary 
motor area (SMA), while the former is seen as involv- 


Motor cortex 


Musculoskeletal 


Brainstem 


Spinal circuits 


system 


ing the prefrontal cortex (PFC), basal ganglia (BG), the 
anterior cingulate cortex (ACC) and other cortical and 
subcortical regions [35.7]. With regard to the genera- 
tion of actions per se, an influential viewpoint for the 
higher level system is summarized by Doya [35.8]. It 
proposes that higher level control of action has three 
major loci: the cortex, the cerebellum, and the BG. Of 
these, the cortex — primarily the M1 — provides a self- 
organized repertoire of possible actions that, when trig- 
gered, generate movement by activating muscles via 
spinal networks, the cerebellum implements fine mo- 
tor control configured through error-based supervised 
learning [35.9], and the BG provide the mechanisms for 
selecting among actions and learning appropriate ones 
through reinforcement learning [35.10-13]. The motor 
cortex and cerebellum can be seen primarily as mo- 
tor control (though see [35.14]), whereas the BG falls 
into the domain of cognitive control and working mem- 
ory (WM). The PFC is usually regarded as the locus 
for higher order choice representations, plans, goals, 
etc. [35.15-18], while the ACC is thought to be in- 
volved in conflict monitoring [35.19-21]. 


Computational Models of Cognitive and Motor Control 


35.2 Motor Control 


35.2 Motor Control 


Given its experimental accessibility and direct rele- 
vance to robotics, motor control has been a primary 
area of interest for computational modeling [35.22— 
24]. Mathematical, albeit non-neural, theories of motor 
control were developed initially within the framework 
of dynamical systems. One of these directions led to 
models of action as an emergent phenomenon [35.3, 
25-33] arising from interactions among preferred coor- 
dination modes [35.34]. This approach has continued to 
yield insights [35.29] and has been extended to multiac- 
tor situations as well [35.33, 35-37]. Another approach 
within the same framework is the equilibrium point 
hypothesis [35.38,39], which explains motor control 
through the change in the equilibrium points of the mus- 
culoskeletal system in response to neural commands. 
Both these dynamical approaches have paid relatively 
less attention to the neural basis of motor control and 
focused more on the phenomenology of action in its 
context. Nevertheless, insights from these models are 
fundamental to the emerging synthesis of action as an 
embodied cognitive function [35.5, 6]. 

A closely related investigative tradition has been de- 
veloped from the early studies of gaits and other rhyth- 
mic movements in cats, fish, and other animals [35.40— 
45], leading to computational models for central pat- 
tern generators (CPGs), which are neural networks 
that generate characteristic periodic activity patterns au- 
tonomously or in response to control signals [35.46]. 
It has been found that rhythmic movements can be ex- 
plained well in terms of CPGs — located mainly in the 
spinal cord — acting upon the coordination modes in- 
herent in the musculoskeletal system. The key insight 
to emerge from this work is that a wide range of use- 
ful movements can be generated by modulation of these 
CPGs by rather simple motor control signals from the 
brain, and feedback from sensory receptors can shape 
these movements further [35.43]. This idea was demon- 
strated in recent work by [jspeert et al. [35.47] showing 
how the same simple CPG network could produce both 
swimming and walking movements in a robotic sala- 
mander model using a simple scalar control signal. 

While rhythmic movements are obviously impor- 
tant, computational models of motor control are often 
motivated by the desire to build humanoid or biomor- 
phic robots, and thus need to address a broader range of 
actions — especially aperiodic and/or voluntary move- 
ments. Most experimental work on aperiodic movement 
has focused on the paradigm of manual reaching [35.30, 
48—64]. However, seminal work has also been done 


with complex reflexes in frogs and cats [35.65—72], iso- 
metric tasks [35.73, 74], ball-catching [35.75], drawing 
and writing [35.60, 76-81], and postural control [35.71, 
72, 82,83). 

A central issue in understanding motor control is the 
degrees of freedom problem [35.84] which arises from 
the immense redundancy of the system — especially in 
the context of multijoint control. For any desired move- 
ment — such as reaching for an object — there are an 
infinite number of control signal combinations from 
the brain to the muscles that will accomplish the task 
(see [35.85] for an excellent discussion). From a con- 
trol viewpoint, this has usually been seen as a problem 
because it precludes the clear specification of an objec- 
tive function for the controller. To the extent that they 
consider the generation of specific control signals for 
each action, most computational models of motor con- 
trol can be seen as direct or indirect ways to address the 
degrees of freedom problem. 


35.2.1 Cortical Representation 
of Movement 


It has been known since the seminal work by Penfield 
and Boldrey [35.86] that the stimulation of specific lo- 
cations in the M1 elicit motor responses in particular 
locations on the body. This has led to the notion of 
a motor homunculus — a map of the body on the M1. 
However, the issue of exactly what aspect of move- 
ment is encoded in response to individual neurons is 
far from settled. A crucial breakthrough came with 
the discovery of population coding by Georgopoulos 
et al. [35.49]. It was found that the activity of spe- 
cific neurons in the hand area of the M1 corresponded 
to reaching movements in particular directions. While 
the tuning of individual cells was found to be rather 
broad (and had a sinusoidal profile), the joint activity of 
many such cells with different tuning directions coded 
the direction of movement with great precision, and 
could be decoded through neurally plausible estima- 
tion mechanisms. Since the initial discovery, population 
codes have been found in other regions of the cor- 
tex that are involved in movement [35.49, 53, 54, 60, 
77-80, 87]. Population coding is now regarded as the 
primary basis of directional coding in the brain, and 
is the basis of most brain—machine interfaces (BMI) 
and brain-controlled prosthetics [35.88, 89]. Neural net- 
work models for population coding have been devel- 
oped by several researchers [35.90-93], and popula- 
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tion coding has come to be seen as a general neural 
representational strategy with application far beyond 
motor control [35.94]. Excellent reviews are provided 
in [35.95, 96]. Mathematical and computational models 
for Bayesian inference with population codes are dis- 
cussed in [35.97, 98]. 

An active research issue in the cortical coding of 
movement is whether it occurs at the level of kine- 
matic variables, such as direction and velocity, or in 
terms of kinetic variables, such as muscle forces and 
joint torques. From a cognitive viewpoint, a kinematic 
representation is obviously more useful, and popula- 
tion codes suggest that such representations are indeed 
present in the motor cortex [35.48, 53,54, 60, 77-80, 
99,100] and PFC [35.15, 101]. However, movement 
must ultimately be constructed from the appropriate 
kinetic variables, i.e., by controlling the forces gener- 
ated by specific muscles and the resulting joint torques. 
Studies have indicated that some neurons in the M1 are 
indeed tuned to muscle forces and joint torques [35.58, 
59, 73, 99, 100, 102, 103]. This apparent multiplicity of 
cortical representations has generated significant debate 
among researchers [35.74]. One way to resolve this 
issue is to consider the kinetic and kinematic represen- 
tations as dual representations related through the con- 
straints of the musculoskeletal system. However, Shah 
et al. [35.104] have used a simple computational model 
to show that neural populations tuned to kinetic or kine- 
matic variables can act jointly in motor control without 
the need for explicit coordinate transformations. 

Graziano et al. [35.105] studied movements elicited 
by the sustained electrode stimulation of specific sites 
in the motor cortex of monkeys. They found that dif- 
ferent sites led to specific complex, multijoint move- 
ments such as bringing the hand to the mouth or lifting 
the hand above the head regardless of the initial posi- 
tion. This raises the intriguing possibility that individual 
cells or groups of cells in the M1 encode goal-directed 
movements that can be triggered as units. The study 
also indicated that this encoding is not open-loop, but 
can compensate — at least to some degree — for varia- 
tion or extraneous perturbations. The M1 and other re- 
lated regions (e.g., the supplementary motor area and 
the PMC) appear to encode spatially organized maps of 
a few canonical complex movements that can be used 
as basis functions to construct other actions [35.105— 
107]. A neurocomputational model using self-organized 
feature maps has been proposed in [35.108] for the rep- 
resentation of such canonical movements. 

In addition to rhythmic and reaching movements, 
there has also been significant work on the neural 


basis of sequential movements, with the finding that 
such neural codes for movement sequences exist in 
the supplementary motor area [35.109-111], cerebel- 
lum [35.112, 113], BG [35.112], and the PFC [35.101]. 
Coding for multiple goals in sequential reaching has 
been observed in the parietal cortex [35.114]. 


35.2.2 Synergy-based Representations 


A rather different approach to studying the construction 
of movement uses the notion of motor primitives, of- 
ten termed synergies [35.63, 115, 116]. Typically, these 
synergies are manifested in coordinated patterns of spa- 
tiotemporal activation over groups of muscles, implying 
a force field over posture space [35.117, 118]. Studies in 
frogs, cats, and humans have shown that a wide range 
of complex movements in an individual subject can be 
explained as the modulated superposition of a few syn- 
ergies [35.63, 65-72, 115,119, 120]. Given a set of n 
muscles, the n-dimensional time-varying vector of ac- 
tivities for the muscles during an action can be written 
as 


N 

m (1) =) g(t- 4). 

k=1 

where g(t) is a time-varying synergy function that 
takes only nonnegative values, c/ is the gain of the 
kth synergy used for action q, and i is the tempo- 
ral offset with which the kth synergy is triggered for 
action q [35.69]. The key point is that a broad range 
of actions can be constructed by choosing different 
gains and offsets over the same set of synergies, which 
represent a set of hard-coded basis functions for the 
construction of movements [35.120, 121]. Even more 
interestingly, it appears that the synergies found em- 
pirically across different subjects of the same species 
are rather consistent [35.67, 72], possibly reflecting the 
inherent constraints of musculoskeletal anatomy. Var- 
ious neural loci have been suggested for synergies, 
including the spinal cord [35.67, 107, 122], the motor 
cortex [35.123], and combinations of regions [35.85, 
124]. 

Though synergies are found consistently in the anal- 
ysis of experimental data, their actual existence in the 
neural substrate remains a topic for debate [35.125, 
126]. However, the idea of constructing complex move- 
ments from motor primitives has found ready appli- 
cation in robotics [35.127—132], as discussed later in 
this chapter. A hierarchical neurocomputational model 
of motor synergies based on attractor networks has re- 
cently been proposed in [35.133, 134]. 
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35.2 Motor Control 


35.2.3 Computational Models 
of Motor Control 


Motor control has been modeled computationally at 
many levels and in many ways, ranging from explicitly 
control-theoretic models through reinforcement-based 
models to models based on emergent dynamical pat- 
terns. This section provides a brief overview of these 
models. 

As discussed above the M1, premotor cortex (PMC) 
and the supplementary motor area (SMA) are seen as 
providing self-organized codes for specific actions, in- 
cluding information on direction, velocity, force, low- 
level sequencing, etc., while the PFC provides higher 
level codes needed to construct more complex actions. 
These codes, comprising a repertoire of actions [35.10, 
106], arise through self-organized learning of activity 
patterns in these cortical systems. The BG system is 
seen as the primary locus of selection among the actions 
in the cortical repertoire. The architecture of the sys- 
tem involving the cortex, BG, and the thalamus, and in 
particular the internal architecture of the BG [35.135], 
makes this system ideally suited to selectively disin- 
hibiting specific cortical regions, presumably activating 
codes for specific actions [35.10, 136, 137]. The BG 
system also provides an ideal substrate for learning 
appropriate actions through a dopamine-mediated rein- 
forcement learning mechanism [35.138—141]. 

Many of the influential early models of motor con- 
trol were based on control-theoretic principles [35.142— 
144], using forward and inverse kinematic and dy- 
namic models to generate control signals [35.55, 57, 
145-150] — see [35.146] for an excellent introduction. 
These models have led to more sophisticated ones, such 
as MOSAIC (modular selection and identification for 
control) [35.151] and AVITEWRITE (adaptive vector 
integration to endpoint handwriting) [35.81]. The MO- 
SAIC model is a mixture of experts, consisting of many 
parallel modules, each comprising three subsystems. 
These are: A forward model relating motor commands 
to predicted position, a responsibility predictor that es- 
timates the applicability of the current module, and an 
inverse model that learns to generate control signals 
for desired movements. The system generates motor 
commands by combining the recommendations of the 
inverse models of all modules weighted by their appli- 
cability. Learning in the model is based on a variant 
of the EM algorithm. The model in [35.57] is a com- 
prehensive neural model with both cortical and spinal 
components, and builds upon the earlier VITE model 
in [35.55]. The AVITEWRITE model [35.81], which is 


a further extension of the VITE model, can generate the 
complex movement trajectories needed for writing by 
using a combination of pre-specified phenomenological 
motor primitives (synergies). A cerebellar model for the 
control of timing during reaches has been presented by 
Barto et al. [35.152]. 

The use of neural maps in models of motor con- 
trol was pioneered in [35.153,154]. These models 
used self-organized feature maps (SOFMs) [35.155] to 
learn visuomotor coordination. Baraduc et al. [35.156] 
presented a more detailed model that used multiple 
maps to first integrate posture and desired movement 
direction and then to transform this internal repre- 
sentation into a motor command. The maps in this 
and most subsequent models were based on earlier 
work by [35.90-93]. An excellent review of this ap- 
proach is given in [35.94]. A more recent and com- 
prehensive example of the map-based approach is the 
SURE-REACH (sensorimotor, unsupervised, redun- 
dancy-resolving control architecture) model in [35.157] 
which focuses on exploiting the redundancy inher- 
ent in motor control [35.84]. Unlike many of the 
other models, which use neutrally implausible error- 
based learning, SURE-REACH relies only on unsu- 
pervised and reinforcement learning. Maps are also 
the central feature of a general cognitive architec- 
ture called ERA (epigenetic robotics architecture) by 
Morse et al. [35.158]. 

Another successful approach to motor control mod- 
els is based on the use of motor primitives, which are 
used as basis functions in the construction of diverse ac- 
tions. This approach is inspired by the experimental ob- 
servation of motor synergies as described above. How- 
ever, most models based on primitives implement them 
nonneurally, as in the case of AVITEWRITE [35.81]. 
The most systematic model of motor primitives has 
been developed by Schaal et al. [35.129-132]. In this 
model, motor primitives are specified using differen- 
tial equations, and are combined after weighting to 
produce different movements. Recently, Matsubara et 
al. [35.159] have shown how the primitives in this 
model can be learned systematically from demonstra- 
tions. Drew et al. [35.123] proposed a conceptual model 
for the construction of locomotion using motor prim- 
itives (synergies) and identified the characteristics of 
such primitives experimentally. A neural model of 
motor primitives based on hierarchical attractor net- 
works has been proposed recently in [35.133, 134, 160], 
while Neilson and Neilson [35.85, 124] have proposed 
a model based on coordination among adaptive neural 
filters. 
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Motor control models based on primitives can be 
simpler than those based on trajectory tracking be- 
cause the controller typically needs to choose only the 
weights (and possibly delays) for the primitives rather 
than specifying details of the trajectory (or forces). 
Among other things, this promises a potential solu- 
tion to the degrees of freedom problem [35.84] since 
the coordination inherent in the definition of motor 
primitives reduces the effective degrees of freedom 
in the system. Another way to address the degrees 
of freedom problem is to use an optimal control ap- 
proach with a specific objective function. Researchers 
have proposed objective functions such as minimum 
jerk [35.161], minimum torque [35.162], minimum ac- 
celeration [35.163], or minimum energy [35.85], but 
an especially interesting idea is to optimize the dis- 
tribution of variability across the degrees of freedom 
in a task-dependent way [35.144, 164-167]. From this 
perspective, motor control trades off variability in task- 
irrelevant dimensions for greater accuracy in task-rele- 
vant ones. Thus, rather than specifying a trajectory, the 
controller focuses only on correcting consequential er- 
rors. This also explains the experimental observation 
that motor tasks achieve their goals with remarkable 
accuracy while using highly variable trajectories to 
achieve the same goal. Trainin et al. [35.168] have 
shown that the optimal control principle can be used 
to explain the observed neural coding of movements in 


the cortex. Biess et al. [35.169] have proposed a de- 
tailed computational model for controlling an arm in 
three-dimensional space by separating the spatial and 
temporal components of control. This model is based 
on optimizing energy usage and jerk [35.161], but is 
not implemented at the neural level. 

An alternative to these prescriptive and construc- 
tivist approaches to motor control is provided by mod- 
els based on dynamical systems [35.3, 25-27, 29,31- 
33]. The most important way in which these models 
diverge from the others is in their use of emergence 
as the central organizational principle of control. In 
this formulation, control programs, structures, prim- 
itives, etc., are not preconfigured in the brain—body 
system, but emerge under the influence of task and 
environmental constraints on the affordances of the 
system [35.33]. Thus, the dynamical systems view of 
motor control is fundamentally ecological [35.170], 
and like most ecological models, is specified in terms 
of low-dimensional state dynamics rather than high- 
dimensional neural processes. Interestingly, a corre- 
spondence can be made between the dynamical and 
optimal control models through the so-called uncon- 
trolled manifold concept [35.31, 33, 39,171]. In both 
models, the dimensions to be controlled and those 
that are left uncontrolled are decided by external con- 
straints rather than internal prescription, as in classical 
models. 


35.3 Cognitive Control and Working Memory 


A lot of behavior — even in primates — is automatic, or 
almost so. This corresponds to actions (or internal be- 
haviors) so thoroughly embedded in the sensorimotor 
substrate that they emerge effortlessly from it. In con- 
trast, some tasks require significant cognitive effort for 
one or more reason, including: 


1. An automatic behavior must be suppressed to allow 
the correct response to emerge, e.g., in the Stroop 
task [35.172]. 

2. Conflicts between incoming information and/or re- 
called behaviors must be resolved [35.19, 20]. 

3. More contextual information — e.g., social context — 
must be taken into account before acting. 

4. Intermediate pieces of information need to be stored 
and recalled during the performance of the task, 
e.g., in sequential problem solving. 


5. The timing of subtasks within the overall task is 
complex, e.g., in delayed-response tasks or other se- 
quential tasks [35.173]. 


Roughly speaking, the first three fall under the 
heading of cognitive control, and the latter two of work- 
ing memory. However, because of the functions are 
intimately linked, the terms are often subsumed into 
each other. 


35.3.1 Action Selection 
and Reinforcement Learning 


Action selection is arguably the central component of 
the cognitive control process. As the name implies, it in- 
volves selectively triggering an action from a repertoire 
of available ones. While action selection is a complex 
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process involving many brain regions, a consensus has 
emerged that the BG system plays a central role in 
its mechanism [35.10, 12,14]. The architecture of the 
BG system and the organization of its projections to 
and from the cortex [35.135, 174, 175] make it ideally 
suited to function as a state-dependent gating system 
for specific functional networks in the cortex. As shown 
in Fig. 35.2, the hypothesis is that the striatal layer of 
the BG system, receiving input from the cortex, acts 
as a pattern recognizer for the current cognitive state. 
Its activity inhibits specific parts of the globus pal- 
lidus (GPi), leading to disinhibition of specific neural 
assemblies in the cortex — presumably allowing the 
behavior/action encoded by those assemblies to pro- 
ceed [35.10]. The associations between cortical activity 
patterns and behaviors are key to the functioning of the 
BG as an action selection system, and the configura- 
tion and modulation of these associations are thought 
to lie at the core of cognitive control. The neurotrans- 
mitter dopamine (DA) plays a key role here by serving 
as a reward signal [35.138—140] and modulating rein- 
forcement learning [35.176, 177] in both the BG and the 
cortex [35.141, 178-180]. 


35.3.2 Working Memory 


All nontrivial behaviors require task-specific informa- 
tion, including relevant domain knowledge and the 
relative timing of subtasks. These are usually grouped 
under the function of WM. An influential model of 
WM in [35.181] identifies three components in WM: 
(1) a central executive, responsible for attention, de- 
cision making, and timing; (2) a phonological loop, 
responsible for processing incoming auditory infor- 
mation, maintaining it in short-term memory, and re- 
hearsing utterances; and (3) a visuospatial sketchpad, 
responsible for processing and remembering visual in- 
formation, keeping track of what and where informa- 
tion, etc. An episodic buffer to manage relationships 
between the other three components is sometimes in- 
cluded [35.182]. Though already rather abstract, this 
model needs even more generalized interpretation in 
the context of many cognitive tasks that do not directly 
involve visual or auditory data. Working memory func- 
tion is most closely identified with the PFC [35.183- 
185]. 

Almost all studies of WM consider only short- 
term memory, typically on the scale of a few sec- 
onds [35.186]. Indeed, one of the most significant — 
though lately controversial — results in WM research 
is the finding that only a small number of items can 


be kept in mind at any one time [35.187, 188]. How- 
ever, most cognitive tasks require context-dependent 
repertoires of knowledge and behaviors to be enabled 
collectively over longer periods. For example, a player 
must continually think of chess moves and strategies 
over the course of a match lasting several hours. The 
configuration of context-dependent repertoires for ex- 
tended periods has been termed long-term working 
memory [35.189]. 


35.3.3 Computational Models 
of Cognitive Control 
and Working Memory 


Several computational models have been proposed for 
cognitive control, and most of them share common 
features. The issues addressed by the models include 
action selection, reinforcement learning of appropriate 
actions, decision making in choice tasks, task sequenc- 
ing and timing, persistence and capacity in WM, task 
switching, sequence learning, and the configuration of 
context-appropriate workspaces. Most of the models 
discussed below are neural with a range of biological 
plausibility. A few important nonneural models are also 
mentioned. 

A comprehensive model using spiking neurons and 
incorporating many biological features of the BG sys- 
tem has been presented in [35.13, 193]. This model 
focuses only on the BG and explicitly on the dy- 
namics of dopamine modulation. A more abstract but 
broader model of cognitive control is the agents of the 
mind model in [35.14], which incorporates the cere- 
bellum as well as the BG. In this model, the BG 
provide the action selection function while the cerebel- 
lum acts to refine and amplify the choices. A series of 
interrelated models have been developed by O’Reilly, 
Frank etal. (35.17, 179, 194-199]. All these models 
use the adaptive gating function of the BG in combi- 
nation with the WM function of the prefrontal cortex 
to explain how executive function can arise without 
explicit top-down control — the so-called homuncu- 
lus [35.196, 197]. A comprehensive review of these and 
other models of cognitive control is given in [35.200]. 
Models of goal-directed action mediated by the PFC 
have also been presented in [35.201] and [35.202]. 
Reynolds and O'Reilly et al. [35.203] have proposed 
a model for configuring hierarchically organized rep- 
resentations in the PFC via reinforcement learning. 
Computational models of cognitive control and work- 
ing have also been used to explain mental pathologies 
such as schizophrenia [35.204]. 
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Fig. 35.2 The action selection and reinforcement learning substrate in the BG. Wide filled arrows indicate excita- 
tory projections while wide unfilled arrows represent inhibitory projections. Linear arrows indicate generic excitatory 
and inhibitory connectivity between regions. The inverted D-shaped contacts indicate modulatory dopamine connec- 
tions that are crucial to reinforcement learning. Abbreviations: SMA = supplementary motor area; SNc = substantia 
nigra pars compacta; VTA = ventral tegmental area; OFC = orbitofrontal cortex; GPe = globus pallidus (external 
nuclei); GPi = globus pallidus (internal nuclei); STN = subthalamic nucleus; D; = excitatory dopamine receptors; 
D = inhibitory dopamine receptors. The primary neurons of GPi are inhibitory and active by default, thus keeping 
all motor plans in the motor and premotor cortices in check. The neurons of the striatum are also inhibitory but usually 
in an inactive down state (after [35.190]). Particular subgroups of striatal neurons are activated by specific patterns of 
cortical activity (after [35.136]), leading first to disinhibition of specific actions via the direct input from striatum to 
GPi, and then by re-inhibition via the input through STN. Thus the system gates the triggering of actions appropriate to 
current cognitive contexts in the cortex. The dopamine input from SNc projects a reward signal based on limbic system 
state, allowing desirable context-action pairs to be reinforced (after [35.191, 192]) — though other hypotheses also exist 
(after [35.14]). The dopamine input to PFC from the VTA also signals reward and other task-related contingencies 


An important aspect of cognitive control is switch- state. While this model captures many phenomeno- 
ing between tasks at various time-scales [35.205, 206]. logical aspects of behavior, it is not explicitly neural. 
Imamizu et al. [35.207] compared two computational Botvinick and Plaut [35.173] present an alternative 
models of task switching — a mixture-of-experts (MoE) neural model that relies on distributed neural repre- 


model and MOSAIC -— using brain imaging. They con- 
cluded that task switching in the PFC was more consis- 
tent with the MoE model and that in the parietal cortex 
and cerebellum with the MOSAIC model. 

An influential abstract model of cognitive con- 
trol is the interactive activation model in [35.208, 
209]. In this model, learned behavioral schemata con- 
tend for activation based on task context and cognitive 


sentations and the dynamics of recurrent neural net- 
works rather than explicit schemata and contention. 
Dayan etal. [35.210,211] have proposed a neural 
model for implementing complex rule-based decision 
making where decisions are based on sequentially un- 
folding contexts. A partially neural model of behavior 
based on the CLARION cognitive model has been de- 
veloped in [35.212]. 
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Recently, Grossberg and Pearson [35.213] have pre- 
sented a comprehensive model of WM called LIST 
PARSE. In this model, the term working memory 
is applied narrowly to the storage of temporally or- 
dered items, i.e., lists, rather than more broadly to 
all short-term memory. Experimentally observed ef- 
fects such as recency (better recall of late items in 
the list) and primacy (better recall of early items in 
the list) are explained by this model, which uses the 
concept of competitive queuing for sequences. This 
is based on the observation [35.101, 214] that multi- 
ple elements of a behavioral sequence are represented 
in the PFC as simultaneously active codes with acti- 
vation levels representing the temporal order. Unlike 
the WM models discussed in the previous paragraph, 
the WM in LIST PARSE is embedded within a full 
cognitive control model with action selection, trajec- 
tory generation, etc. Many other neural models for 
chains of actions have also been proposed [35.214— 
224]. 

Higher level cognitive control is characterized by 
the need to fuse information from multiple sensory 
modalities and memory to make complex decisions. 
This has led to the idea of a cognitive workspace. In the 
global workspace theory (GWT) developed in [35.225], 
information from various sensory, episodic, semantic, 
and motivational sources comes together in a global 
workspace that forms brief, task-specific integrated rep- 
resentations that are broadcast to all subsystems for 
use in WM. This model has been implemented com- 
putationally in the intelligent distribution agent (IDA) 
model by Franklin et al. [35.226, 227]. A neurally im- 
plemented workspace model has been developed by 
Dehaene et al. [35.172, 228, 229] to explain human sub- 
jects’ performance on effortful cognitive tasks (i. e., 
tasks that require suppression of automatic responses), 
and the basis of consciousness. The construction of cog- 
nitive workspaces is closely related to the idea of long- 
term working memory [35.189]. Unlike short-term 
working memory, there are few computational models 
for long-term working memory. Neural models seldom 
cover long periods, and implicitly assume that a chain- 
ing process through recurrent networks (e.g., [35.173]) 
can maintain internal attention. Tyer et al. [35.230, 231] 
have proposed an explicitly neurodynamical model of 
this function, where a stable but modulatable pat- 
tern of activity called a graded attractor is used to 
selectively bias parts of the cortex in the context- 
dependent fashion. An earlier model was proposed 
in [35.232] to serve a similar function in the hippocam- 
pal system. 


Another class of models focuses primarily on sin- 
gle decisions within a task, and assume an underlying 
stochastic process [35.186, 233-235]. Typically, these 
models address two-choice short-term decisions made 
over a second or two [35.186]. The decision process be- 
gins with a starting point and accumulates information 
over time resulting in a diffusive (random walk) pro- 
cess. When the diffusion reaches one of two boundaries 
on either side of the starting point, the corresponding 
decision is made. This elegant approach can model such 
concrete issues as decision accuracy, decision time, and 
the distribution of decisions without any reference to 
the underlying neural mechanisms, which is both its 
chief strength and its primary weakness. Several con- 
nectionist models have also been developed based on 
paradigms similar to the diffusion approach [35.236— 
238]. The neural basis of such models has been dis- 
cussed in detail in [35.239]. A population-coding neural 
model that makes Bayesian decisions based on cumula- 
tive evidence has been described by Beck et al. [35.98]. 

Reinforcement learning [35.176] is widely used 
in many engineering applications, but several mod- 
els go beyond purely computational use and include 
details of the underlying brain regions and neurophysi- 
ology [35.141, 240]. Excellent reviews of such models 
are provided in [35.241—243]. Recently, models have 
also been proposed to show how dopamine-mediated 
learning could work with spiking neurons [35.244] and 
population codes [35.245]. 

Computational models that focus on working mem- 
ory per se (i.e., not on the entire problem of cognitive 
control) have mainly considered how the requirement 
of selective temporal persistence can be met by bio- 
logically plausible neural networks [35.246, 247]. Since 
working memories must bridge over temporal dura- 
tions (e.g., in remembering a cue over a delay period), 
there must be some neural mechanism to allow ac- 
tivity patterns to persist selectively in time. A natural 
candidate for this is attractor dynamics in recurrent 
neural networks [35.248, 249], where the recurrence 
allows some activity patterns to be stabilized by re- 
verberation [35.250]. The neurophysiological basis of 
such persistent activity has been studied in [35.251]. 
A central feature in many models of WM is the role 
of dopamine in the PFC [35.252—254]. In particular, 
it is believed that dopamine sharpens the response of 
PFC neurons involved in WM [35.255] and allows for 
reliable storage of timing information in the presence 
of distractors [35.246]. The model in [35.246, 252] in- 
cludes several biophysical details such as the effect of 
dopamine on different ion channels and its differential 
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modulation of various receptors. More abstract neu- 
ral models for WM have been proposed in [35.256] 
and [35.257]. 

A especially interesting type of attractor network 
uses the so-called bump attractors — spatially local- 
ized patterns of activity stabilized by local network 
connectivity and global competition [35.258]. Such 
a network has been used in a biologically plausible 
model of WM in the PFC in [35.259], which demon- 
strates that the memory is robust against distracting 


35.4 Conclusion 


This chapter has attempted to provide an overview of 
neurocomputational models for cognitive control, WM, 
and motor control. Given the vast body of both exper- 
imental and computational research in these areas, the 
review is necessarily incomplete, though every attempt 
has been made to highlights the major issues, and to 
provide the reader with a rich array of references cover- 
ing the breadth of each area. 


stimuli. A similar conclusion is drawn in [35.180] based 
on another bump attractor model of working memory. It 
shows that dopamine in the PFC can provide robustness 
against distractors, but robustness against internal noise 
is achieved only when dopamine in the BG locks the 
state of the striatum. Recently, Mongillo et al. [35.260] 
have proposed the novel hypothesis that the persis- 
tence of neural activity in WM may be due to calcium- 
mediated facilitation rather than reverberation through 
recurrent connectivity. 


The models described in this chapter relate to 
several other mental functions including sensorimo- 
tor integration, memory, semantic cognition, etc., as 
well as to areas of engineering such as robotics and 
agent systems. However, these links are largely ex- 
cluded from the chapter — in part for brevity, but mainly 
because most of them are covered elsewhere in this 
Handbook. 
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36. Cognitive Architectures and Agents 


Sebastien Hélie, Ron Sun 


A cognitive architecture is the essential structures 
and processes of a domain-generic computational 
cognitive model used for a broad, multiple-level, 
multiple domain analysis of cognition and behav- 
ior. This chapter reviews some of the most popular 
psychologically-oriented cognitive architectures, 
namely adaptive control of thought-rational 
(ACT-R), Soar, and CLARION. For each cognitive ar- 
chitecture, an overview of the model, some key 
equations, and a detailed simulation example 
are presented. The example simulation with AC- 
T-R is the initial learning of the past tense of 
irregular verbs in English (developmental psy- 
chology), the example simulation with Soar is the 
well-known missionaries and cannibals problem 
(problem solving), and the example simulation 
with CLARION is a complex mine field navigation 
task (autonomous learning). This presentation is 
followed by a discussion of how cognitive ar- 
chitectures can be used in multi-agent social 
simulations. A detailed cognitive social simula- 
tion with CLARION is presented to reproduce results 
from organizational decision-making. The chapter 
concludes with a discussion of the impact of neural 
network modeling on cognitive architectures and 
a comparison of the different models. 
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36.1 Background 


Cognitive theories are often underdetermined by 
data [36.1]. As such, different theories, with very little 
in common, can sometimes be used to account for the 
very same phenomena [36.2]. This problem can be re- 
solved by adding constraints to cognitive theories. The 
most intuitive approach to adding constraints to any 
scientific theory is to collect more data. While experi- 
mental psychologists have come a long way toward this 
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goal in over a century of psychology research, the gap 
between empirical and theoretical progress is still sig- 
nificant. 

Another tactic that can be adopted toward constrain- 
ing psychological theories is unification [36.1]. Newell 
argued that more data could be used to constraint a the- 
ory if the theory was designed to explain a wider range 
of phenomena. In particular, these unified (i.e., in- 
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tegrative) cognitive theories could be put to the test 
against well-known (stable) regularities that have been 
observed in psychology. So far, these integrative theo- 
ries have taken the form of cognitive architectures, and 
some of them have been very successful in explaining 
a wide range of data. 

A cognitive architecture is the essential struc- 
tures and processes of a domain-generic computa- 
tional cognitive model used for a broad, multiple- 
level, multiple-domain analysis of cognition and be- 
havior [36.3]. Specifically, cognitive architectures deal 
with componential processes of cognition in a struc- 
turally and mechanistically well-defined way. Their 
function is to provide an essential framework to fa- 
cilitate more detailed exploration and understanding 
of various components and processes of the mind. In 
this way, a cognitive architecture serves as an initial 
set of assumptions to be used for further development. 
These assumptions may be based on available empirical 
data (e.g., psychological or biological), philosophical 
thoughts and arguments, or computationally inspired 
hypotheses concerning psychological processes. A cog- 
nitive architecture is useful and important precisely 
because it provides a comprehensive initial frame- 
work for further modeling and simulation in many task 
domains. 

While there are all kinds of cognitive architectures 
in existence, in this chapter we are concerned specif- 
ically with psychologically-oriented cognitive archi- 
tectures (as opposed to software engineering-oriented 
cognitive architectures, e.g., LIDA [36.4], or neurally- 
oriented cognitive architectures, e.g., ART [36.5]). 
Psychologically-oriented cognitive architectures are 
particularly important because they shed new light on 
human cognition and, therefore, they are useful tools 
for advancing the understanding of cognition. In un- 
derstanding cognitive phenomena, the use of computa- 
tional simulation on the basis of cognitive architectures 
forces one to think in terms of processes, and in terms 
of details. Instead of using vague, purely conceptual 
theories, cognitive architectures force theoreticians to 
think clearly. They are, therefore, critical tools in the 
study of the mind. Cognitive psychologists who use 
cognitive architectures must specify a cognitive mecha- 
nism in sufficient detail to allow the resulting models 
to be implemented on computers and run as simula- 
tions. This approach requires that important elements 
of the models be spelled out explicitly, thus aiding in 
developing better, conceptually clearer theories. It is 
certainly true that more specialized, narrowly-scoped 
models may also serve this purpose, but they are not 


as generic and as comprehensive and thus may not be 
as useful to the goal of producing general intelligence. 

It is also worth noting that psychologically-oriented 
cognitive architectures are the antithesis of expert sys- 
tems. Instead of focusing on capturing performance 
in narrow domains, they are aimed to provide broad 
coverage of a wide variety of domains in a way that 
mimics human performance [36.6]. While they may 
not always perform as well as expert systems, busi- 
ness and industrial applications of intelligent systems 
increasingly require broadly-scoped systems that are 
capable of a wide range of intelligent behaviors, not 
just isolated systems of narrow functionalities. For ex- 
ample, one application may require the inclusion of 
capabilities for raw image processing, pattern recog- 
nition, categorization, reasoning, decision-making, and 
natural language communications. It may even require 
planning, control of robotic devices, and interactions 
with other systems and devices. Such requirements ac- 
centuate the importance of research on broadly-scoped 
cognitive architectures that perform a wide range of 
cognitive functionalities across a variety of task do- 
mains (as opposed to more specialized systems). 

In order to achieve general computational intel- 
ligence in a psychologically-realistic way, cognitive 
architectures should include only minimal initial struc- 
tures and independently learn from their own experi- 
ences. Autonomous learning is an important way of 
developing additional structure, bootstrapping all the 
way to a full-fledged cognitive model. In so doing, 
it is important to be careful to devise only minimal 
initial learning capabilities that are capable of boot- 
strapping, in accordance with whatever phenomenon 
is modeled [36.7]. This can be accomplished through 
environmental cues, structures, and regularities. The 
avoidance of overly complicated initial structures, and 
thus the inevitable use of autonomous learning, may of- 
ten help to avoid overly representational models that are 
designed specifically for the task to be achieved [36.3]. 
Autonomous learning is thus essential in achieving gen- 
erality in a psychologically-realistic way. 


36.1.1 Outline 


The remainder of this chapter is organized as follows. 
The next three sections review some of the most popu- 
lar cognitive architectures that are used in psychology 
and cognitive science. Specifically, Sect. 36.2 reviews 
ACT-R, Sect. 36.3 reviews Soar, and Sect. 36.4 re- 
views CLARION. Each of these sections includes an 
overview of the architecture, some key equations, and 
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36.2 Adaptive Control of Thought-Rational (ACT-R) 


a detailed simulation example. Following this presenta- 
tion, Sect. 36.5 discusses how cognitive architectures 
can be used in multi-agent cognitive social simula- 


tions and presents a detailed example with CLARION. 
Finally, Sect. 36.6 presents a general discussion and 
compares the models reviewed. 


36.2 Adaptive Control of Thought-Rational (ACT-R) 


ACT-R is one of the oldest and most successful cog- 
nitive architectures. It has been used to simulate and 
implement many cognitive tasks and applications, such 
as the Tower of Hanoi, game playing, aircraft con- 
trol, and human—computer interactions [36.8]. ACT-R 
is based on three key ideas [36.9]: 


a) Rational analysis 

b) The distinction between procedural and declarative 
memories 

c) A modular structure linked with communication 
buffers (Fig. 36.1). 


According to the rational analysis of cognition (first 
key idea; [36.10]), the cognitive architecture is opti- 
mally tuned to its environment (within its computa- 
tional limits). Hence, the functioning of the architecture 
can be understood by investigating how optimal behav- 
ior in a particular environment would be implemented. 


According to Anderson, such optimal adaptation is 
achieved through evolution [36.10]. 

The second key idea, the distinction between declar- 
ative and procedural memories, is implemented by 
having different modules in ACT-R, each with its own 
representational format and learning rule. Briefly, pro- 
cedural memory is represented by production rules 
(similar to a production system) that can act on the en- 
vironment through the initiation of motor actions. In 
contrast, declarative memory is passive and uses chunks 
to represent world knowledge that can be accessed by 
the procedural memory but does not interact directly 
with the environment through motor actions. 

The last key idea in ACT-R is modularity. As seen in 
Fig. 36.1, procedural memory (i. e., the production sys- 
tem) cannot directly access information from the other 
modules: the information has to go through dedicated 
buffers. Each buffer can hold a single chunk at any 
time. Hence, buffers serve as information processing 
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Goal buffer Retrieval buffer 
(DLPFC) (VLPFC) 
tt i 
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Productions 
(basal ganglia) 


Visual buffer 
(DLPFC) 


Visual module 
(occipital/etc) 


Manual module 
(motor/cerebellum) 


Fig. 36.1 General architecture 
of ACT-R. DLPFC = dorsolat- 


eral prefrontal cortex; VLPFC 
= ventrolateral prefrontal cor- 
tex (after [36.11], by courtesy 

of the American Psychological 
Association) 
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bottlenecks in ACT-R. This restricts the amount of in- 
formation available to the production system, which in 
turn limits the processing that can be done by this mod- 
ule at any given time. Processing within each module 
is encapsulated. Hence, all the modules can operate in 
parallel without much interference. The following sub- 
sections describe the different ACT-R modules in more 
details. 


36.2.1 The Perceptual-Motor Modules 


The perceptual-motor modules in ACT-R include a de- 
tailed representation of the output of perceptual sys- 
tems, and the input of motor systems [36.8]. The visual 
module is divided into the well-established ventral 
(what) and dorsal (where) visual streams in the primate 
brain. The main function of the dorsal stream is to find 
the location of features in a display (e.g., red-colored 
items, curved-shaped objects) without identifying the 
objects. The output from this module is a location 
chunk, which can be sent back to the central production 
system. The function of the ventral stream is to iden- 
tify the object at a particular location. For instance, the 
central production system could send a request to the 
dorsal stream to find a red object in a display. The dor- 
sal stream would search the display and return a chunk 
representing the location of a red object. If the central 
production system needs to know the identity of that 
object, the location chunk would be sent to the ven- 
tral stream. A chunk containing the object identity (e.g., 
a fire engine) would then be returned to the production 
system. 


36.2.2 The Goal Module 


The goal module serves as the context for keeping track 
of cognitive operations and supplement environmental 
stimulations [36.8]. For instance, one can do many dif- 
ferent operations with a pen picked up from a desk (e.g., 
write a note, store it in a drawer, etc.). What operation 
is selected depends primarily on the goal that needs to 
be achieved. If the current goal is to clean the desk, 
the appropriate action is to store the pen in a drawer. 
If the current goal is to write a note, putting the pen in 
a drawer is not a useful action. 

In addition to providing a mental context to select 
appropriate production rules, the goal module can be 
used in more complex problem-solving tasks that need 
subgoaling [36.8]. For instance, if the goal is to play 
a game of tennis, one first needs to find an opponent. 
The goal module must create this subgoal that needs 


to be achieved before moving back to the original goal 
(i.e., playing a game of tennis). Note that goals are 
centralized in a unique module in ACT-R and that pro- 
duction rules only have access to the goal buffer. The 
current goal to be achieved is the one in the buffer, while 
later goals stored in the goal module are not accessible 
to the production. Hence, the play a game of tennis goal 
is not accessible to the production rules while the find 
an opponent subgoal is being pursued. 


36.2.3 The Declarative Module 


The declarative memory module contains knowledge 
about the world in the form of chunks [36.8]. Each 
chunk represents a piece of knowledge or a concept 
(e.g., fireman, bank, etc.). Chunks can be accessed 
effortfully by the central production system, and the 
probability of retrieving a chunk depends on the chunk 
activation 


Btw]! 
z ' (36.1) 


Pi=|1+e 


where P; is the probability of retrieving chunk i, B; is 
the base-level activation of chunk i, Sj; is the associa- 
tion strength between chunks j and i, W; is the amount of 
attention devoted to chunk j, t is the activation thresh- 
old, and € is a noise parameter. It is important to note 
that the knowledge chunks in the declarative module are 
passive and do not do anything on their own. The func- 
tion of this module is to store information so that it can 
be retrieved by the central production system (which 
corresponds to procedural memory). Only in the cen- 
tral production system can the knowledge be used for 
further reasoning or to produce actions. 


36.2.4 The Procedural Memory 


The procedural memory is captured by the central pro- 
duction rule module and fills the role of a central 
executive processor. It contains a set of rules in which 
the conditions can be matched by the chunks in all the 
peripheral buffers and the output is a chunk that can 
be placed in one of the buffers. The production rules 
are chained serially, and each rule application takes 
a fixed amount of psychological time. The serial nature 
of central processing constitutes another information 
processing bottleneck in ACT-R. 

Because only one production rule can be fired at any 
given time, its selection is crucial. In ACT-R, each pro- 
duction rule has a utility value that depends on: a) its 
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probability of achieving the current goal, b) the value 
(importance) of the goal, and c) the cost of using the 
production rule [36.8]. Specifically, 


U; = P;G- Ci, (36.2) 


where U; is the utility of production rule i, P; is the 
(estimated) probability that selecting rule i will achieve 
the goal, G is the value of the current goal, and C; 
is the (estimated) cost of rule i. The most useful rule 
is always selected in every processing cycle, but the 
utility values can be noisy, which can result in the se- 
lection of suboptimal rules [36.12]. Rule utilities are 
learned online by counting the number of times that ap- 
plying a rule has achieved the goal. Anderson [36.10] 
has shown that the selection according to these counts 
is optimal in the Bayesian sense. Also, production rules 
can be made more efficient by using a process called 
production compilation (36.12, 13]. Briefly, if two pro- 
duction rules are often fired in succession and the result 
is positive, a new production rule is created which di- 
rectly links the conditions from the first production rule 
to the action following the application of the second 
production rule. Hence, the processing time is cut in 
half by applying only one rule instead of two. 


36.2.5 Simulation Example: 
Learning Past Tenses of Verbs 


General computational intelligence requires the ab- 
straction of regularities to form new rules (e.g., rule 
induction). A well-known example in psychology is 
children learning English past tenses of verbs [36.13]. 
This classical result shows that children’s accuracy in 
producing the past tense of irregular verbs follows a U- 
shaped curve [36.14]. Early in learning, children have 
a separate memory representation for the past tense 
of each verb (and no conjugation rule). Hence, the 
past tenses of irregular verbs are used mostly correctly 
(Phase 1). After moderate training, children notice that 
most verbs can be converted to their past tense by 
adding the suffix -ed. This leads to the formulation of 
a default rule (e.g., to find the past tense of a verb, add 
the suffix -ed to the verb stem). This rule is a useful 
heuristic and works for all regular verbs in English. 
However, children tend to overgeneralize and (incor- 
rectly) apply the rule to irregular verbs. This leads 
to errors and the low point of the U-shaped curve 
(Phase 2). Finally, children learn that there are excep- 
tions to the default rule and memorize the past tense of 
irregular verbs. Performance improves again (Phase 3). 


In ACT-R, the early phase of training uses instance- 
based retrieval (i. e., retrieving the chunk representing 
each verb’s past tense using (36.1)). The focus of the 
presentation is on the induction of the default rule, 
which is overgeneralized in Phase 2 and correctly ap- 
plied in Phase 3. This is accomplished by joining two 
production rules. First, consider the following memory 
retrieval rule used in Phase 1 [36.12]: 


1. Retrieve-past-tense: 
IF the goal is to find the past tense of a word w: 
THEN issue a request to declarative memory for the 
past tense of w. 


If a perfect match is retrieved from declarative 
memory, a second rule is used to produce the (probably 
correct) response. However, if Rule 1 fails to retrieve 
a perfect match, the verb past tense is unknown and an 
analogy rule is used instead [36.12]: 


2. Analogy-find-pattern: 
IF the goal is to find the past tense of word w1; 
AND the retrieval buffer contains past tense w2- 
suffix of w2: 
THEN set the answer to w1-(w2-suffix). 


This rule produces a form of generalization using 
an analogy. Because Rule 2 always follows Rule 1, they 
can be combined using production compilation [36.12]. 
Also, w2 is likely to be a regular verb, so w2-suffix 
is likely to be -ed. Hence, combining Rules 1 and 2 
yields [36.12]: 


3. Learned-rule: 
IF the goal is to find the past tense of word w: 
THEN set the answer to w-ed 


which is the default rule that can be used to accu- 
rately find the past tense of regular verbs. The U-shaped 
curve representing the performance of children learn- 
ing irregular verbs can thus be explained with ACT-R 
as follows [36.13]: In Phase 1, Rule 3 does not exist 
and Rule 1 is applied to correctly conjugate irreg- 
ular verbs. In Phase 2, Rule 3 is learned and has 
proven useful with regular verbs (thus increasing P; 
in (36.2)). Hence, it is often selected to incorrectly 
conjugate irregular verbs. In phase 3, the irregular 
verbs become more familiar as more instances have 
been encountered. This increases their base-level ac- 
tivation in declarative memory (B; in (36.1)), which 
facilitates retrieval and increases the likelihood that 
Rule 1 is selected to correctly conjugate the irregular 
verbs. More details about this simulation can be found 
in [36.13]. 
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36.3 Soar 


Soar was the original unified theory of cognition pro- 
posed by Newell [36.1]. Soar has been used success- 
fully in many problem solving tasks such as Eight 
Puzzle, the Tower of Hanoi, Fifteen Puzzle, Think- 
a-Dot, and Rubik’s Cube. In addition, Soar has been 
used for many military applications such as train- 
ing models for human pilots and mission rehearsal 
exercises. According to the Soar theory of intelli- 
gence [36.15], human intelligence is an approxima- 
tion of a knowledge system [36.9]. Hence, the most 
important aspect of intelligence (natural or artifi- 
cial) is the use of all available knowledge [36.16], 
and failures of intelligence are failures of knowl- 
edge [36.17]. 

All intelligent behaviors can be understood in terms 
of problem solving in Soar [36.9]. As such, Soar is 
implemented as a set of problem-space computational 
models (PSCM) that partition the knowledge into goal 
relevant ways [36.16]. Each PSCM implicitly contains 
the representation of a problem space defined by a set 
of states and a set of operators that can be visual- 
ized using a decision tree [36.17]. In a decision tree 
representation, the nodes represent the states, and one 
moves around from state to state using operators (the 
branches/connections in the decision tree). The objec- 
tive of a Soar agent is to move from an initial state to 
one of the goal states, and the best operator is always 
selected at every time step [36.16]. If the knowledge 
in the model is insufficient to select a single best op- 
erator at a particular time step, an impasse is reached, 
and a new goal is created to resolve the impasse. This 
new goal defines its own problem space and set of 
operators. 


36.3.1 Architectural Representation 


The general architecture of Soar is shown in Fig. 36.2. 
The main structures are a working memory and a long- 
term memory. Working memory is a blackboard where 
all the relevant information for the current decision cy- 
cle is stored [36.17]. It contains a goal representation, 
perceptual information, and relevant knowledge that 
can be used as conditions to fire rules. The outcome 
of rule firing can also be added to the working mem- 
ory to cause more rules to fire. The long-term memory 
contains associative rules representing the knowledge 
in the system (in the form of IF —> THEN rules). The 
rules in long-term memory can be grouped/organized to 
form operators. 


36.3.2 The Soar Decision Cycle 


In every time step, Soar goes through a six-step deci- 
sion cycle [36.17]. The first step in Soar is to receive an 
input from the environment. This input is inserted into 
working memory. The second step is called the elabo- 
ration phase. During this phase, all the rules matching 
the content of working memory fire in parallel, and the 
result is put into working memory. This, in turn, can 
create a new round of parallel rule firing. The elabora- 
tion phase ends when the content of working memory is 
stable, and no new knowledge can be added in working 
memory by firing rules. 

The third step is the proposal of operators that are 
applicable to the content in working memory. If no op- 
erator is applicable to the content of working memory, 
an impasse is reached. Otherwise, the potential opera- 
tors are evaluated and ordered according to a symbolic 
preference metric. The fourth step is the selection of 
a single operator. If the knowledge does not allow for 
the selection of a single operator, an impasse is reached. 
The fifth step is to apply the operator. If the opera- 
tor does not result is a change of state, an impasse 
is reached. Finally, the sixth step is the output of the 
model, which can be an external (e.g., motor) or an in- 
ternal (e.g., more reasoning) action. 


36.3.3 Impasses 
When the immediate knowledge is insufficient to reach 
a goal, an impasse is reached and a new goal is created 


to resolve the impasse. Note that this subgoal produces 
its own problem space with its own set of states and 
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memory memory memory 
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Fig. 36.2 The general architecture of Soar. The subdivi- 
sion of long-term memory is a new addition in Soar 9 
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operators. If the subgoal reaches an impasse, another 
subgoal is recursively created to resolve the second im- 
passe, and so on. There are four main types of impasses 
in Soar [36.17]: 


1. No operator is available in the current state. 

New goal: Find an operator that is applicable in the 
current state. 

2. An operator is applicable in the current state, but its 

application does not change the current state. 
New goal: Modify the operator so that its appli- 
cation changes the current state. Alternatively, the 
operator can be modified so that it is no longer 
deemed applicable in the current state. 

3. Two or more operators are applicable in the current 
state but neither one of them is preferred according 
to the symbolic metric. 

New goal: Further evaluate the options and make 
one of the operators preferred to the others. 

4. More than one operator is applicable, and there 
is knowledge in working memory favoring two or 
more operators in the current state. 

New goal: Resolve the conflict by removing from 
working memory one of the contradictory prefer- 
ences. 


Regardless of which type of impasse is reached, re- 
solving an impasse is an opportunity for learning in 
Soar [36.17]. Each time a new result is produced while 
achieving a subgoal, a new rule associating the current 
state with the new result is added in the long-term mem- 
ory to ensure that the same impasse will not be reached 
again in the future. This new rule is called a chunk to 
distinguish it from rules that were precoded by the mod- 
eler (and learning is called chunking). 


36.3.4 Extensions 


Unlike ACT-R (and CLARION, as described next), 
Soar was originally designed as an artificial intelligence 
model [36.16]. Hence, initially, more attention was paid 
to functionality and performance than to psychological 
realism. However, Soar has been used in psychology 
and Soar 9 has been extended to increase its psycho- 
logical realism and functionality [36.18]. This version 
of the architecture is illustrated in Fig. 36.2. First, the 
long-term memory has been further subdivided in cor- 
respondence with psychology theory [36.19]. The asso- 
ciative rules are now part of the procedural memory. 
In addition to the procedural memory, the long-term 
memory now also includes a semantic and an episodic 


memory. The semantic memory contains knowledge 
structures representing factual knowledge about the 
world (e.g., the earth is round), while the episodic 
memory contains a snapshot of the working memory 
representing an episode (e.g., Fido the dog is now sit- 
ting in front of me). 

At the subsymbolic level, Soar 9 includes activa- 
tions in working memory to capture recency/usefulness 
(as in ACT-R). In addition, Soar 9 uses non-symbolic 
(numerical) values to model operator preferences. 
These are akin to utility functions and are used when 
symbolic operators as insufficient to select a single op- 
erator [36.20]. When numerical preferences are used, 
an operator is selected using a Boltzmann distribution 


eS(O)/T 


P(O;) = SSO sO 
J 


(36.3) 


where P(O;) is the probability of selecting operator i, 
S(O;) is the summed support (preference) for operator 
i, and t is a randomness parameter. Numerical operator 
preferences can be learned using reinforcement learn- 
ing. 

Finally, recent work has been initiated to add a clus- 
tering algorithm that would allow for the creation of 
new symbolic structures and a visual imagery module 
to facilitate symbolic spatial reasoning (not shown in 
Fig. 36.2). Also, the inclusion of emotions is now being 
considered (via appraisal theory [36.21]). 


36.3.5 Simulation Example: 
Learning in Problem Solving 


Nason and Laird [36.20] proposed a variation of Soar 
that includes a reinforcement learning algorithm (fol- 
lowing the precedents of CLARION and ACT-R) to 
learn numerical preference values for the operators. In 
this implementation (called Soar-RL), the preferences 
are replaced by Q-values [36.22] that are learned using 
environmental feedback. After the Q-value of each rele- 
vant operator has been calculated (i. e., all the operators 
available in working memory), an operator is stochasti- 
cally selected (as in (36.3)). 

Soar-RL has been used to simulate the missionaries 
and cannibals problem. The goal in this problem is to 
transport three missionaries and three cannibals across 
a river using a boat that can carry at most two persons 
at a time. Several trips are required, but the cannibals 
must never outnumber the missionaries on either river- 
bank. This problem has been used as a benchmark in 
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problem-solving research because, if the desirability of 
a move is evaluated in terms of the number of people 
that have crossed the river (which is a common assump- 
tion), a step backward must be taken midway in solving 
the problem (i. e., a move that reduces the number of 
peoples that crossed the river must be selected). 

In the Soar-RL simulation, the states were defined 
by the number of missionaries and cannibals on each 
side of the river and the location of the boat. The op- 
erators were boat trips transporting people, and the 
Q-values of the operators were randomly initialized. 
Also, to emphasize the role of reinforcement learn- 
ing in solving this problem, chunking was disengaged. 
Hence, the only form of adaptation was the adjustment 


36.4 CLARION 


CLARION is an integrative cognitive architecture 
consisting of a number of distinct subsystems with 
a dual representational structure in each subsystem (im- 
plicit versus explicit representations). CLARION is the 


ACS 


NACS 


Action-centered Non-action- 
explicit centered explicit 
representation representation 
Action-centered Non-action- 
implicit centered implicit 
representation representation 


Output 


Goal structure Reinforcement 


Goal setting 


Filtering 
Selection 
Regulation 


Drives 


MS MCS 


Fig. 36.3 The CLARION architecture. ACS stands for the action- 
centered subsystem, NACS the non-action-centered subsystem, MS 
the motivational subsystem, and MCS the meta-cognitive subsys- 
tem (after [36.23]) 


of the operator Q-values. Success states (i. e., all peo- 
ple crossed the river) were rewarded, failure states (i. e., 
cannibals outnumbering missionaries on a riverbank) 
were punished, and all other states received neutral re- 
inforcement. 

Using this simulation methodology, Soar-RL gen- 
erally learned to solve the missionaries and cannibals 
problem. Most errors resulted from the stochastic de- 
cision process [36.20]. Nason and Laird also showed 
that the model performance can be improved fivefold 
by adding a symbolic preference preventing an opera- 
tor at time ¢ from undoing the result of the application 
of an operator at time t — 1. More details on this simu- 
lation can be found in [36.20]. 


newest of the reviewed architectures, but it has al- 
ready been successfully applied to several tasks such 
as navigation in mazes and mine fields, human rea- 
soning, creative problem solving, and cognitive social 
simulations. CLARION is based on the following ba- 
sic principles [36.24]. First, humans can learn with or 
without much a priori specific knowledge to begin with, 
and humans learn continuously from on-going expe- 
rience in the world. Second, there are different types 
of knowledge involved in human learning (e.g., pro- 
cedural vs. declarative, implicit vs. explicit; [36.24]), 
and different types of learning processes are involved 
in acquiring different types of knowledge. Third, moti- 
vational processes as well as meta-cognitive processes 
are important and should be incorporated in a psycho- 
logically realistic cognitive architecture. According to 
CLARION, all three principles are required to achieve 
general computational intelligence. An overview of the 
architecture is shown in Fig. 36.3. 

The CLARION subsystems include the action-cen- 
tered subsystem (ACS), the non-action-centered subsys- 
tem (NACS), the motivational subsystem (MS), and the 
meta-cognitive subsystem (MCS). The role of the ACS 
is to control actions, regardless of whether the actions 
are for external physical movements or for internal men- 
tal operations. The role of the NACS is to maintain gen- 
eral knowledge. The role of the MS is to provide under- 
lying motivations for perception, action, and cognition, 
in terms of providing impetus and feedback (e.g., indi- 
cating whether or not outcomes are satisfactory). The 
role of the MCS is to monitor, direct, and modify dy- 
namically the operations of the other subsystems. 
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Each of these interacting subsystems consists of 
two levels of representations. In each subsystem, the 
top level encodes explicit (e.g., verbalizable) knowl- 
edge and the bottom level encodes implicit (e.g., non- 
verbalizable) knowledge. The two levels interact, for 
example, by cooperating through a combination of the 
action recommendations from the two levels, respec- 
tively, as well as by cooperating in learning through 
bottom-up and top-down processes. Essentially, it is 
a dual-process theory of mind [36.24]. 


36.4.1 The Action-Centered Subsystem 


The ACS is composed of a top and a bottom level. 
The bottom level of the ACS is modular. A number 
of neural networks co-exist, each of which is adapted 
to a specific modality, task, or group of input stim- 
uli. These modules can be developed in interacting 
with the world (computationally, through various de- 
composition methods [36.25]). However, some of them 
are formed evolutionarily, reflecting hardwired instincts 
and propensities [36.26]. Because of these networks, 
CLARION is able to handle very complex situations 
that are not amenable to simple rules. 

In the top level of the ACS, explicit symbolic con- 
ceptual knowledge is captured in the form of explicit 
symbolic rules (for details, see [36.24]). There are many 
ways in which explicit knowledge may be learned, in- 
cluding independent hypothesis testing and bottom-up 
learning. The basic process of bottom-up learning is as 
follows: if an action implicitly decided by the bottom 
level is successful, then the model extracts an explicit 
rule that corresponds to the action selected by the bot- 
tom level and adds the rule to the top level. Then, 
in subsequent interactions with the world, the model 
verifies the extracted rule by considering the outcome 
of applying the rule. If the outcome is not successful, 
then the rule should be made more specific; if the out- 
come is successful, the agent may try to generalize the 
rule to make it more universal [36.27]. After explicit 
rules have been learned, a variety of explicit reason- 
ing methods may be used. Learning explicit conceptual 
representations at the top level can also be useful in 
enhancing learning of implicit reactive routines at the 
bottom level [36.7]. The action-decision cycle in the 
ACS can be described by the following steps: 


1. Observe the current state of the environment 
2. Compute the value of each possible action in the 
current state in the bottom level 


3. Compute the value of each possible action in the 
current state in the top level 

4. Choose an appropriate action by stochastically se- 
lecting or combining the values in the top and 
bottom levels 

5. Perform the selected action 

6. Update the top and bottom levels according to the 
received feedback (if any) 

7. Go back to Step 1. 


36.4.2 The Non-Action-Centered Subsystem 


The NACS may be used for representing general knowl- 
edge about the world (i.e., the semantic memory and 
the episodic memory), and for performing various kinds 
of memory retrievals and inferences. The NACS is also 
composed of two levels (a top and a bottom level) and 
is under the control of the ACS (through its actions). 

At the bottom level, associative memory networks 
encode non-action-centered implicit knowledge. Asso- 
ciations are formed by mapping an input to an output 
(such as mapping 2 + 3 to 5). Backpropagation [36.7, 
28] or Hebbian [36.29] learning algorithms can be used 
to establish such associations between pairs of inputs 
and outputs. At the top level of the NACS, a general 
knowledge store encodes explicit non-action-centered 
knowledge [36.29, 30]. In this network, chunks (passive 
knowledge structures, similar to ACT-R) are specified 
through dimensional values (features). A node is set up 
in the top level to represent a chunk. The chunk node 
connects to its corresponding features represented as in- 
dividual nodes in the bottom level of the NACS [36.29, 
30]. Additionally, links between chunk nodes encode 
explicit associations between pairs of chunks, known 
as associative rules. Explicit associative rules may be 
learned in a variety of ways [36.24]. 

During reasoning, in addition to applying associa- 
tive rules, similarity-based reasoning may be employed 
in the NACS. Specifically, a known (given or inferred) 
chunk may be automatically compared with another 
chunk. If the similarity between them is sufficiently 
high, then the latter chunk is inferred. The similarity 
between chunks i and j is computed by using 


NeiNg 
f (ng) 


where Sci~cj is the similarity from i to chunk j, nine; 
counts the number of features shared by chunks i and j 
(i.e., the feature overlap), nej counts the total number 


(36.4) 


Sci~g = 
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of features in chunk j, and f(x) is a slightly super- 
linear, monotonically increasing, positive function (by 
default, f(x) = x1). Thus, similarity-based reasoning 
in CLARION is naturally accomplished using (1) top- 
down activation by chunk nodes of their corresponding 
bottom-level feature-based representations, (2) calcu- 
lation of feature overlap between any two chunks (as 
in (36.4)), and (3) bottom-up activation of the top- 
level chunk nodes. This kind of similarity calculation 
is naturally accomplished in a multi-level cognitive ar- 
chitecture and represents a form of synergy between the 
explicit and implicit modules. Each round of reasoning 
in the NACS can be described by the following steps: 


1. Propagate the activation of the activated features in 
the bottom level 

2. Concurrently, fire all applicable associative rules in 
the top level 

3. Integrate the outcomes of top and bottom-level pro- 
cessing 

4. Update the activations in the top and bottom levels 
(e.g., similarity-based reasoning) 

5. Go back to Step 1 (if another round of reasoning is 
requested by the ACS). 


36.4.3 The Motivational 
and Meta-Cognitive Subsystems 


The motivational subsystem (MS) is concerned with 
drives and their interactions [36.31], which lead to ac- 
tions. It is concerned with why an agent does what 
it does. Simply saying that an agent chooses actions 
to maximize gains, rewards, reinforcements, or pay- 
offs leaves open the question of what determines these 
things. The relevance of the MS to the ACS lies pri- 
marily in the fact that it provides the context in which 
the goal and the reinforcement of the ACS are set. It 
thereby influences the working of the ACS, and by ex- 
tension, the working of the NACS. 

Dual motivational representations are in place in 
CLARION. The explicit goals (such as finding food) 
of an agent may be generated based on internal 
drives (for example, being hungry; see [36.32] for 
details). Beyond low-level drives (concerning physio- 
logical needs), there are also higher-level drives. Some 
of them are primary, in the sense of being hard wired, 
while others are secondary (derived) drives acquired 
mostly in the process of satisfying primary drives. 

The meta-cognitive subsystem (MCS) is closely 
tied to the MS. The MCS monitors, controls, and reg- 
ulates cognitive processes for the sake of improving 


cognitive performance [36.33, 34]. Control and regula- 
tion may be in the forms of setting goals for the ACS, 
setting essential parameters of the ACS and the NACS, 
interrupting and changing on-going processes in the 
ACS and the NACS, and so on. Control and regulation 
can also be carried out through setting reinforcement 
functions for the ACS. All of the above can be done on 
the basis of drive activations in the MS. The MCS is 
also made up of two levels: the top level (explicit) and 
the bottom level (implicit). 


36.4.4 Simulation Example: 
Minefield Navigation 


Sun etal. [36.7] empirically tested and simulated 
a complex minefield navigation task. In the empirical 
task, the subjects were seated in front of a computer 
monitor that displayed an instrument panel containing 
several gauges that provided current information on the 
status/location of a vehicle. The subjects used a joy- 
stick to control the direction and speed of the vehicle. 
In each trial, a random mine layout was generated, 
and the subjects had limited time to reach a target lo- 
cation without hitting a mine. Control subjects were 
trained for several consecutive days in this task. Sun 
and colleagues also tested three experimental condi- 
tions with the same amount of training but emphasized 
verbalization, over-verbalization, and dual-tasking (re- 
spectively). The human results show that learning was 
slower in the dual-task condition than in the single-task 
condition, and that a moderate amount of verbalization 
speeds up learning. However, the effect of verbalization 
is reversed in the over-verbalization condition; over- 
verbalization interfered with (slowed down) learning. 
In the CLARION simulation, simplified (explicit) 
rules were represented in the form state — action in the 
top level of the ACS. In the bottom level of the ACS, 
a backpropagation network was used to (implicitly) 
learn the input-output function using reinforcement 
learning. Reinforcement was received at the end of ev- 
ery trial. The bottom-level information was used to 
create and refine top-level rules (with bottom-up learn- 
ing). The model started out with no specific a priori 
knowledge about the task (the same as a typical sub- 
ject). The bottom level contained randomly initialized 
weights. The top level started empty and contained no 
a priori knowledge about the task (either in the form of 
instructions or rules). The interaction of the two levels 
was not determined a priori either: there was no fixed 
weight in combining outcomes from the two levels. The 
weights were automatically set based on relative perfor- 
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mance of the two levels on a periodic basis. The effects 
of the dual task and the various verbalization condi- 
tions were modeled using rule-learning thresholds so 
that more/less activities could occur at the top level. 
The CLARION simulation results closely matched the 
human results [36.7]. In addition, the human and sim- 


ulated data were input into a common ANOVA and 
no Statistically significant difference between human 
and simulated data was found in any of the conditions. 
Hence, CLARION did a good job of simulating detailed 
human data in the minefield navigation task. More de- 
tails about this simulation can be found in [36.7]. 


36.5 Cognitive Architectures as Models of Multi-Agent Interaction 


Most of the work in social simulation assumes rudimen- 
tary cognition on the part of agents. Agent models have 
frequently been custom-tailored to the task at hand, of- 
ten with a restricted set of highly domain-specific rules. 
Although this approach may be adequate for achieving 
the limited objectives of some social simulations, it is 
overall unsatisfactory. For instance, it limits the realism, 
and hence applicability of social simulation, and more 
importantly it also precludes any possibility of resolv- 
ing the theoretical question of the micro—macro link. 

Cognitive models, especially cognitive architec- 
tures, may provide better grounding for understanding 
multi-agent interaction. This can be achieved by incor- 
porating realistic constraints, capabilities, and tenden- 
cies of individual agents in terms of their psychological 
processes (and maybe even in terms of their physical 
embodiment) and their interactions with their environ- 
ments (which include both physical and social envi- 
ronments). Cognitive architectures make it possible to 
investigate the interaction of cognition/motivation on 
the one hand and social institutions and processes on 
the other, through psychologically realistic agents. The 
results of the simulation may demonstrate significant 
interactions between cognitive-motivational factors and 
social-environmental factors. Thus, when trying to un- 
derstand social processes and phenomena, it may be 
important to take the psychology of individuals into 
consideration given that detailed computational mod- 
els of cognitive agents that incorporate a wide range 
of psychological functionalities have been developed in 
cognitive science. 

For example, Sun and Naveh simulated an organi- 
zational classification decision-making task using the 
CLARION cognitive architecture [36.35]. In a classifi- 
cation decision-making task, agents gather information 
about problems, classify them, and then make further 
decisions based on the classification. In this case, the 
task is to determine whether a blip on a screen is a hos- 
tile aircraft, a flock of geese, or a civilian aircraft. In 
each case, there is a single object in the airspace. The 


object has nine different attributes, each of which can 
take on one of three possible values (e.g., its speed 
can be low, medium, or high). An organization must 
determine the status of an observed object: whether 
it is friendly, neutral, or hostile. There are a total of 
19683 possible objects, and 100 problems are chosen 
randomly from this set. 

Critically, no one single agent has access to all the 
information necessary to make a choice. Decisions are 
made by integrating separate decisions made by differ- 
ent agents, each of which is based on a different subset 
of information. In terms of organizational structures, 
there are two major types of interest: teams and hier- 
archies. In teams, decision-makers act autonomously, 
individual decisions are treated as votes, and the organi- 
zation decision is the majority decision. In hierarchies, 
agents are organized in a chain of command, such that 
information is passed from subordinates to superiors, 
and the decision of a superior is based solely on the 
recommendations of his/her subordinates. In this task, 
only a two-level hierarchy with nine subordinates and 
one supervisor is considered. 

In addition, organizations are distinguished by the 
structure of information accessible to each agent. There 
are two types of information access: distributed access, 
in which each agent sees a different subset of three at- 
tributes (no two agents see the same subset of three 
attributes), and blocked access, in which three agents 
see exactly the same subset of attributes. In both cases, 
each attribute is accessible to three agents. 

The human experiments by Carley et al. [36.36] 
were done in a 2 x 2 fashion (organization x information 
access). The data showed that humans generally per- 
formed better in team situations, especially when dis- 
tributed information access was in place. Moreover, 
distributed information access was generally better 
than blocked information access. The worst perfor- 
mance occurred when hierarchical organizational struc- 
ture and blocked information access were used in 
conjunction. 
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The results of the CLARION simulations closely 
matched the patterns of the human data, with teams out- 
performing hierarchal structures, and distributed access 
being superior to blocked access. As in the human data, 
the effects of organization and information access were 
present, but more importantly the interaction of these 
two factors with length of training was reproduced. 
These interactions reflected the following trends: (1) the 
superiority of team and distributed information access 
at the start of the learning process and, (2) either the dis- 
appearance or reversal of these trends towards the end. 
These trends persisted robustly across a wide variety of 
settings of cognitive parameters, and did not critically 
depend on any one setting of these parameters. Also, 
as in humans, performance was not grossly skewed to- 
wards one condition or the other. 


36.5.1 Extention 


One advantage of using a more cognitive agent in so- 
cial simulations is that we can address the question of 
what happens when cognitive parameters are varied. 
Because CLARION captures a wide range of cognitive 
processes, its parameters are generic (rather than task 
specific). Thus, one has the opportunity of studying so- 
cial and organizational issues in the context of a general 
theory of cognition. Below we present some of the re- 
sults observed (details can be found in [36.35]). 
Varying the parameter controlling the probability of 
selecting implicit versus explicit processing in CLAR- 
ION interacted with the length of training. Explicit rule 
learning was far more useful at the early stages of 
learning, when increased reliance on rules tended to 
boost performance (compared with performance toward 
the end of the learning process). This is because ex- 
plicit rules are crisp guidelines that are based on past 
success, and as such, they provide a useful anchor at 
the uncertain early stages of learning. However, by the 
end of the learning process, they become no more re- 
liable than highly trained networks. This corresponds 


36.6 General Discussion 


This chapter reviewed the most popular psycholog- 
ically-oriented cognitive architectures with some ex- 
ample applications in human developmental learning, 
problem-solving, navigation, and cognitive social sim- 
ulations. ACT* (ACT-R’s early version; [36.39]) and 
Soar [36.15] were some of the first cognitive architec- 


to findings in human cognition, where there are indica- 
tions that rule-based learning is more widely used in 
the early stages of learning, but is later increasingly 
supplanted by similarity-based processes and skilled 
performance [36.37, 38]. Such trends may partially ex- 
plain why hierarchies did not perform well initially; 
because a hierarchy’s supervisor was burdened with 
a higher input dimensionality, it took a longer time to 
encode rules (which were, nevertheless, essential at the 
early stages of learning). 

Another interesting result was the effect of vary- 
ing the generalization threshold. The generalization 
threshold determines how readily an agent generalizes 
a successful rule. It was better to have a higher rule gen- 
eralization threshold than a lower one (up to a point). 
That is, if one restricts the generalization of rules to 
those rules that have proven to be relatively success- 
ful, the result is a higher-quality rule set, which leads to 
better performance in the long run. 

This CLARION simulation showed that some cog- 
nitive parameters (e.g., learning rate) had a monolithic, 
across-the-board effect under all conditions, while in 
other cases, complex interactions of factors were at 
work (see [36.35] for full details of the analysis). This 
illustrates the importance of limiting one’s social simu- 
lation conclusions to the specific cognitive context in 
which human data were obtained (in contrast to the 
practice of some existing social simulations). By using 
CLARION, Sun and Naveh [36.35] were able to ac- 
curately capture organizational performance data and, 
moreover, to formulate deeper explanations for the re- 
sults observed. In cognitive architectures, one can vary 
parameters and options that correspond to cognitive 
processes and test their effects on collective perfor- 
mance. In this way, cognitive architectures may be used 
to predict human performance in social/organizational 
settings and, furthermore, to help to improve collective 
performance by prescribing optimal or near-optimal 
cognitive abilities for individuals for specific collective 
tasks and/or organizational structures. 


tures available and have been around since the early 
1980s, while CLARION was first proposed in the 
mid-1990s [36.30]. This chronology is crucial when ex- 
ploring their learning capacity. ACT* and Soar were 
developed before the connectionist revolution [36.40], 
and were, therefore, implemented using knowledge- 
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rich production systems [36.41]. In contrast, CLARION 
was proposed after the connectionist revolution and was 
implemented using neural networks. While some at- 
tempts have been made to implement ACT-R [36.42] 
and Soar [36.43] with neural networks, these archi- 
tectures remain mostly knowledge-rich production sys- 
tems grounded in the artificial intelligence tradition. 
One of the most important impacts of the connection- 
ist revolution has been data-driven learning rules (e.g., 
backpropagation) that allows for autonomous learning. 
CLARION was created within this tradition, and every 
component in CLARION has been implemented us- 
ing neural networks. For instance, explicit knowledge 
may be implemented using linear, two-layer neural net- 
works [36.7, 23, 28, 29], while implicit knowledge has 
been implemented using nonlinear multilayer back- 
propagation networks in the ACS [36.7, 29] and recur- 
rent associative memory networks in the NACS [36.23, 
29]. This general philosophy has also been applied to 
modeling the MS and the MCS using linear (explicit) 
and nonlinear (implicit) neural networks [36.44]. As 
such, CLARION requires less pre-coded knowledge 
to achieve its goals, and can be considered more au- 
tonomous. 

While the different cognitive architectures were 
motivated by different problems and took different im- 
plementation approaches, they share some theoretical 
similarities. For instance, Soar is somewhat similar to 
the top levels of CLARION. It contains production rules 
that fire in parallel and cycles until a goal is reached. In 
CLARION, top-level rules in the NACS fire in paral- 
lel in cycles (under the control of the ACS). However, 
CLARION includes a distinction between action-cen- 
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37. Embodied Intelligence 


Angelo Cangelosi, Josh Bongard, Martin H. Fischer, Stefano Nolfi 


Embodied intelligence is the computational ap- 
proach to the design and understanding of 
intelligent behavior in embodied and situated 
agents through the consideration of the strict 
coupling between the agent and its environment 
(situatedness), mediated by the constraints of the 
agent's own body, perceptual and motor system, 
and brain (embodiment). The emergence of the 
field of embodied intelligence is closely linked 
to parallel developments in computational in- 
telligence and robotics, where the focus is on 
morphological computation and sensory-motor 
coordination in evolutionary robotics models, and 
in neuroscience and cognitive sciences where 
the focus is on embodied cognition and devel- 
opmental robotics models of embodied symbol 
learning. This chapter provides a theoretical 
and technical overview of some principles of 
embodied intelligence, namely morphological 
computation, sensory-motor coordination, and 
developmental embodied cognition. It will also 
discuss some tutorial examples on the model- 
ing of body/brain/environment adaptation for the 
evolution of morphological computational agents, 
evolutionary robotics model of navigation and ob- 
ject discrimination, and developmental robotics 
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models of language and numerical cognition in 
humanoid robots. 


37.1 Introduction to Embodied Intelligence 


Organisms are not isolated entities which develop their 
sensory-motor and cognitive skills in isolation from 
their social and physical environment, and indepen- 
dently from their motor and sensory systems. On the 
contrary, behavioral and cognitive skills are dynamical 
properties that unfold in time and arise from a large 
number of interactions between the agents’ nervous 
system, body, and environment [37.1—7]. Embodied in- 
telligence is the computational approach to the design 
and understanding of intelligent behavior in embod- 
ied and situated agents through the consideration of 


the strict coupling between the agent and its environ- 
ment (situatedness), mediated by the constraints of the 
agent’s own body, perceptual and motor system, and 
brain (embodiment). 

Historically, the field of embodied intelligence has 
its origin from the development and use of bio-inspired 
computational intelligence methodologies in computer 
science and robotics, and the overcoming of the limita- 
tions of symbolic approaches typical of classical artifi- 
cial intelligence methods. As argued in Brooks’ [37.2] 
seminal paper on Elephants don’t play chess, the study 
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of apparently simple behaviors, such as locomotion 
and motor control, permits an understanding of the 
embodied nature of intelligence, without the require- 
ment to start from higher order abstract skills as those 
involved in chess playing algorithms. Moreover, the 
emergence of the field of embodied intelligence is 
closely linked to parallel developments in robotics, with 
the focus on morphological computation and sensory— 
motor coordination in evolutionary and developmen- 
tal robotics models, and in neuroscience and cogni- 
tive sciences with the focus on embodied cognition 
(EC). 

The phenomenon of morphological computation 
concerns the observation that a robot’s (or animal’s) 
body plan may perform computations: A body plan 
that allows the robot (or animal) to passively exploit 
interactions with its environment may perform compu- 
tations that lead to successful behavior; in another body 
plan less well suited to the task at hand, those com- 
putations would have to be performed by the control 
policy [37.8-10]. If both the body plans and control 
policies of robots are evolved, evolutionary search may 
find robots that exhibit more morphological computa- 
tion than an equally successful robot designed by hand 
(see more details in Sect. 37.2). 

The principle of sensory-motor coordination, 
which concerns the relation between the characteris- 
tics of the agents’ control policy and the behaviors 
emerging from agent/environmental interactions, has 
been demonstrated in numerous evolutionary robotics 
models [37.6]. Experiments have shown how adap- 
tive agents can acquire an ability to coordinate their 
sensory and motor activity so as to self-select their 
forthcoming sensory experiences. This sensory—motor 
coordination can play several key functions such as en- 
abling the agent to access the information necessary to 
make the appropriate behavioral decision, elaborating 
sensory information, and reducing the complexity of the 
agents’ task to a manageable level. These two themes 
will be exemplified through the illustration of evolu- 
tionary robotics experiments in Sect. 37.3 in which the 
fine-grained characteristics of the agents’ neural control 
system and body are subjected to variations (e.g. gene 
mutation) and in which variations are retained or dis- 
carded on the basis of their effects at the level of the 


overall behavior exhibited by the agent in interaction 
with the environment. 

In cognitive and neural sciences, the term em- 
bodied cognition (EC) [37.11, 12] is used to refer to 
systematic relationships between an organism’s cogni- 
tive processes and its perceptual and response reper- 
toire. Notwithstanding the many interpretations of this 
term [37.13], the broadest consensus of the proponents 
of EC is that our knowledge representations encom- 
pass the bodily activations that were present when 
we initially acquired this knowledge (for differentia- 
tions, [37.14]). This view helps us to understand the 
many findings of modality-specific biases induced by 
cognitive computations. Examples of EC in psychology 
and cognitive science can be sensory—motor (e.g., a sys- 
tematic increase in comparison time with angular dis- 
parity between two views of the same object [37.15]), or 
conceptual (e.g., better recall of events that were experi- 
enced in the currently adopted body posture [37.16]), or 
emotional in nature (e.g., interpersonal warmth induced 
by a warm handheld object [37.17]). Such findings were 
hard to accommodate under the more traditional views 
where knowledge was presumed symbolic, amodal and 
abstract and thus dissociated from sensory input and 
motor output processes. 

Embodied cognition experiments in psychology 
have inspired the design of developmental robotics 
models [37.18] which exploit the ontogenetic inter- 
action between the developing (baby) robot and its 
social and physical environment to acquire both sim- 
ple sensory—motor control strategies and higher order 
capabilities such as language and number learning 
(Sect. 37.4). 

To provide the reader with both a theoretical and 
technical understanding of the principles of morpho- 
logical computation, sensory—motor coordination and 
developmental EC the following three sections will 
review the progress in these fields, and analyze in de- 
tail some key studies as examples. The presentation of 
studies on the modeling of both sensory—motor tasks 
(such as locomotion, navigation, and object discrimi- 
nation) and of higher order cognitive capabilities (such 
as linguistic and numerical cognition) demonstrates the 
impact of embodied intelligence in the design of a vari- 
ety of perceptual, motor, and cognitive skills. 


37.2 Morphological Computation for Body-Behavior Coadaptation 


Embodied intelligence dictates that there are certain 
body plans and control policies that, when combined, 


will produce some desired behavior. For example, 
imagine that the desired task is active categorical per- 
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ception (ACP) [37.19, 20]. ACP requires a learner to 
actively interact with objects in its environment to clas- 
sify those objects. This stands in contrast to passive 
categorization whereby an agent observes objects from 
a distance — perhaps it is fed images of objects or views 
them through a camera — and labels the objects accord- 
ing to their perceived class. In order for an animal or 
robot to perform ACP, it must not only possess a con- 
trol policy that produces as output the correct class for 
the object being manipulated, but also some manipula- 
tor with which to physically affect (and be affected by) 
the object. 

One consequence of embodied intelligence is that 
certain pairings of body and brain produce the desired 
behavior, and others do not. Returning to the example 
of ACP, if a robot’s arm is too short to reach the objects 
then it obviously will not be able to categorize them. 
Imagine now a second robot that possesses an arm of 
the requisite length but can only bring the back of its 
hand into contact with the objects. Even if this robot’s 
control policy distinguishes between round and edged 
objects based on the patterned firing of touch sensors 
embedded in its palm, this robot will also not be able to 
perform ACP. 

A further consequence of embodied intelligence is 
that some body plans may require a complex control 
policy to produce successful behavior, while another 
body plan may require a simpler control policy. This has 
been referred to as the morphology and control tradeoff 
in the literature [37.7]. Continuing the ACP example, 
consider a third robot that can bring its palm and fingers 
into contact with the objects, but only possesses a single 
binary touch sensor in its palm. In order to distinguish 
between round and edged objects, this robot will require 
a control policy that performs some complex signal pro- 
cessing on the time series data produced by this single 
sensor during manipulation. A fourth robot however, 
equipped with multiple tactile sensors embedded in its 
palm and fingers, may be able to categorize objects im- 
mediately after grasping them: Perhaps round objects 
produce characteristic static patterns of tactile signals 
that are markedly different from those patterns pro- 
duced when grasping edged objects. 

The morphology and control tradeoff however 
raises the question as to what is being traded. It has 
been argued that what is being traded is computa- 
tion [37.7,8]. If two robots succeed at a given task, 
and each robot is equipped with the simplest control 
policy that will allow that robot to succeed, but one 
control policy performs fewer computations than the 
other control policy, then the body plan of the robot 


equipped with the simpler control policy must perform 
the missing computations required to succeed at the 
task. 

This phenomenon of a robot’s (or animal’s) body 
plan performing computation has been termed mor- 
phological computation [37.8—10]. Paul [37.8] outlined 
a theoretical robot that uses its body to compute the 
XOR function. In another study [37.9] it was shown 
how the body of a vacuum cleaning robot could literally 
replace a portion of its artificial neural network con- 
troller, thus subsuming the computation normally per- 
formed by that part of the control policy into the robot’s 
body. Pfeifer and Gomez [37.21] describe a number of 
other robots that exhibit the phenomenon of morpho- 
logical computation. 


37.2.1 The Counterintuitive Nature 
of Morphological Computation 


All of the robots outlined by Pfeifer and Gomez [37.21] 
were designed manually; in some cases the control poli- 
cies were automatically optimized. If for each task there 
are a spectrum of robot body plan/control policy pair- 
ings that achieve the task, one might ask where along 
this spectrum the human-designed robots fall. That is, 
what mixtures of morphological computation and con- 
trol computation do human designers tend to favor? 
The bulk of the artificial intelligence literature, since 
the field’s beginnings in the 1950s, seems to indicate 
that humans exhibit a cognitive chauvinism: we tend 
to favor control complexity over morphological com- 
plexity. Classical artificial intelligence dispensed with 
the body altogether: it was not until the 1980s that the 
role of morphology in intelligent behavior was explic- 
itly stated [37.2]. As a more specific example, object 
manipulation was first addressed by creating rigid, ar- 
ticulated robot arms that required complex control poli- 
cies to succeed [37.22]. Later, it was realized that soft 
manipulators could simplify the amount of control re- 
quired for successful manipulation (e.g., [37.23]). Most 
recently, a class of robot manipulators known as jam- 
ming grippers’ was introduced [37.24]. In a jamming 
gripper, a robot arm is tipped with a bag of granular 
material such that when air is removed from the bag 
the grains undergo a phase transition into a jammed, 
solid-like state. The control policies for jamming grip- 
pers are much simpler than those required for rigid or 
even soft multifingered dexterous manipulators: at the 
limit, the controller must switch the manipulator be- 
tween just two states (grip or release), regardless of the 
object. 
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Despite the fact that the technology for creating 
jamming grippers has existed for decades, it took a long 
time for this class of manipulators to be discovered. 
In other branches of robotics, one can discern a sim- 
ilar historical pattern: new classes of robot body plan 
were successively proposed that required less and less 
explicit control. In the field of legged locomotion for 
example, robots with whegs (wheel-leg hybrids) were 
shown to require less explicit control than robots with 
legs to enable travel over rough terrain [37.25]. 

These observations suggest that robots with more 
morphological computation are less intuitive for hu- 
mans to formulate and then design than robots with 
less morphological computation. However, there may 
be a benefit to creating robots that exhibit significant 
amounts of morphological computation. For example, 
hybrid dynamic walkers require very little control and 
are much more energy efficient compared to fully ac- 
tuated legged robots [37.26]. It has been argued that 
tensegrity robots also require relatively little control 
compared to robots composed of serially linked rigid 
components, and this class of robot has several desir- 
able properties such as the ability to absorb and recover 
from external perturbations [37.9]. 

So, if robots that exhibit morphological compu- 
tation are desirable, yet it is difficult for humans to 
navigate in this part of the space of possible robots, can 
an automated search method be used to discover such 
robots? 


37.2.2 Evolution and Morphological 
Computation 


One of the advantages of using evolutionary algorithms 
to design robots, compared to machine learning meth- 
ods, is that both the body plan and the control policy can 
be placed under evolutionary control [37.27]. Typically, 
machine learning methods optimize some of the param- 
eters of a control policy with a fixed topology. However, 
if the body plans and control policies of robots are 
evolved, and there is sufficient variation within the 
population of evolving robots, search may discover 
multiple successful robots that exhibit varying degrees 
of morphological computation. Or, alternatively, if mor- 
phological computation confers a survival advantage 
within certain contexts, a phylogeny of robots may 
evolve that exhibit increasing amounts of morpholog- 
ical computation. 

A recent pair of experiments illustrates how mor- 
phological computation may be explored. An evolution- 
ary algorithm was employed to evolve the body plans 


Fig. 37.1ła-d A sample of four evolved robots with differ- 
ing amounts of morphological complexity. (a) A simple- 
shaped robot that evolved to locomote over flat ground. 
(b-d) Three sample robots, more morphologically com- 
plex than the robot in (a), that evolved in icy environments 
(after Auerbach and Bongard [37.28]). To view videos of 
these robots see [37.29] 


and control policies of robots that must move in one of 
two environments. The first environment included noth- 
ing else other than a flat, high-friction ground plane 
(Fig. 37.1a). The second environment was composed 
of a number of low-friction bars that sit atop the high- 
friction ground plane (Fig. 37.1b-d). These bars can 
be thought of as ice distributed across a flat landscape. 
In order for robots to move across the icy terrain, 
they must evolve appendages that are able to reach 
down between the icy blocks, come into contact with 
the high-friction ground, and push or pull themselves 
forward. 

It was found that robots evolved to travel over the 
ice had more complex shapes than those evolved to 
travel over flat ground (compare the robot in Fig. 37.la 
to those in Fig. 37.1b—d) [37.28]. However, it was 
also found that the robots that travel over ice had 
fewer mechanical degrees of freedom (DOFs) than the 
robots evolved to travel over flat ground [37.30]. If 
a robot possesses fewer mechanical DOFs, one can 
conclude that it has a simpler control policy, because 
there are fewer motors to control. It seems that the 
robots evolved to travel over ice do so in the follow- 
ing manner: the complex shapes of their appendages 
cause the appendages to reach down into the crevices 
between the ice without explicit control; the simple con- 
trol policy then simply sweeps the appendages back and 
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forth, horizontally, to in effect skate along the tops of 
the ice. In contrast, robots evolved to travel over flat 
ground must somehow push back, reach up, and pull 
forward — using several mechanical DOFs — to move 
forward. 

One could conclude from these experiments that the 
robots evolved to travel over ice perform more mor- 
phological computation than those evolved to travel 


over flat ground: the former robots have more com- 
plex bodies but simpler control policies than the latter 
robots, yet both successfully move in their environ- 
ments. Much more work is required to generalize this 
result to different robots, behaviors, and environments, 
but this initial work suggests that evolutionary robotics 
may be a unique tool for studying the phenomenon of 
morphological computation. 


37.3 Sensory-Motor Coordination in Evolving Robots 


The actions performed by embodied and situated agents 
inevitably modify the agent—environmental relation 
and/or the environment. The type of stimuli that an 
agent will sense at the next time step at t+; crucially 
depends, for example, on whether the agent turns left 
or right at the current time t. Similarly, the stimuli 
that an agent will experience next at time t+; when 
standing next to an object depend on the effort with 
which it will push the object at time t. This implies 
that actions might play direct and indirect adaptive 
roles. Actions playing a direct role are, for example, 
foraging or predator escaping behaviors that directly 
impact on the agent’s own survival chances. Action 
playing indirect roles consists, for example, in wander- 
ing through the environment to spot interesting sensory 
information (e.g., the perception of a food area that 
might eventually afford foraging actions) or playing 
a fighting game with a conspecific that might en- 
able the agent to acquire capacities that might later 
be exploited to deal with aggressive individuals. The 
possibility to self-select useful sensory stimuli through 
action is referred with the term sensory—motor coordi- 
nation. 

Together with morphological computation, senso- 
ry—motor coordination constitutes a fundamental prop- 
erty of embodied and situated agents and one of most 
important characteristic that can be used to differen- 
tiate these systems from alternative forms of intelli- 
gence. In the following sections, we illustrate three 
of the key roles that can be played by sensory—motor 
coordination: 


i) The discovery of parsimonious behavioral strategies 

ii) The access and generation of useful sensory infor- 
mation through action and active perception 

iii) The constraining and channeling of the learning 
process during evolution and development. 


37.3.1 Enabling the Discovery 
of Simple Solutions 


Sensory—motor coordination can be exploited to find 
solutions relying on more parsimonious control policies 
than alternative solutions not relying, or relying less, on 
this principle. An example is constituted by a set of ex- 
periments in which a Khepera robot [37.31] endowed 
with infrared and speed sensors, has been evolved for 
the ability to remain close to large cylindrical objects 
(food) while avoiding small cylindrical objects (dan- 
gers). From a passive perspective, that does not take into 
account the possibility to exploit sensory—motor coor- 
dination, the ability to discriminate between sensory 
stimuli experienced near small and large cylindrical ob- 
jects requires a relatively complex control policy since 
the two classes of stimuli strongly overlap in the robot’s 
perceptual space [37.32]. On the other hand, robots 
evolved for the ability to perform this task tend to con- 
verge on a solution relying on a rather simple control 
policy: the robots begin to turn around objects as soon 
as they approach them and then discriminate the size of 
the object on the basis of the sensed differential speed 
of the left and right wheels during the execution of the 
object-circling behavior [37.33]. In other words, the ex- 
ecution of the object-circling behavior allows the robots 
to experience sensory stimuli on the wheel sensors that 
are well differentiated for small and large objects. This, 
in turn, allows them to solve the object discrimina- 
tion problem with a rather simple but reliable control 
policy. 

Another related experiment in which a Khepera 
robot provided solely with infrared sensors was adapted 
for finding and remaining close to a cylindrical object, 
while avoiding walls, demonstrates how sensory—motor 
coordination can be exploited to solve tasks that re- 
quire the display of differentiated behavior in different 
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environmental circumstances, without discriminating 
the contexts requiring different responses [37.32, 34]. 
Indeed, evolved robots manage to avoid walls, find 
a cylindrical object, and remain near it simply by mov- 
ing backward or forward when their frontal infrared 
sensors are activated or not, respectively, and by turning 
left or right when their right and left infrared sensors 
are activated, respectively (providing that the turning 
speed and the move forward speed is appropriately reg- 
ulate on the basis of the sensors activation). Indeed, 
the execution of this simple control rule combined with 
the effects of the robot’s actions lead to the exhibi- 
tion of a move-forward behavior far from obstacles, an 
obstacle avoidance behavior near walls, and an oscil- 
latory behavior near cylindrical objects (in which the 
robot remains near the object by alternating forward and 
backward and/or turn-left and turn-right movements). 
The differentiation of the behavior observed during the 
robot/wall and robot/cylinder interactions can be ex- 
plained by considering that the execution of the same 
action produces different sensory effects in interaction 
with different objects. In particular, the execution of 
a turn-left action at time f elicited by the fact that 
the right infrared sensors are more activated than the 
left sensors near an object leads to the perception of: 
(i) a similar sensory stimulus eliciting a similar action 
at time f+ , ultimately producing an object avoidance 
behavior near a wall object, (ii) a different sensory 
stimulus (in which left infrared sensors can become 
more activated than the left infrared sensors) eliciting 
a turn-right action at time t+; ultimately producing an 
oscillatory behavior near the cylinder. 

Examples of clever use of sensory—motor coor- 
dination abound in natural and artificial evolution. 
A paradigmatic example of the use of sensory—motor 
coordination in natural organisms are the navigation ca- 
pabilities of flying insects that are based on the optic 
flow, i.e., the apparent motion of contrasting objects 
in the visual field caused by the relative motion of the 
agent [37.35]. Houseflies, for example, use this solu- 
tion to navigate up to 700 body lengths per second 
in unknown 3D environment while using quite modest 
processing resources, i. e., about 0.001% of the number 
of neurons present in the human brain [37.36]. Ex- 
amples in the evolutionary robotics literature include 
wheeled robots performing navigation tasks ((37.32], 
see below), artificial fingers and humanoid robotic arms 
evolved for the ability to discriminate between ob- 
ject varying in shapes [37.20, 37], and wheeled robots 
able to navigate visually by using a pan-tilt cam- 
era [37.38]. 


37.3.2 Accessing and Generating 
Information Through Action 


A second fundamental role of sensory—motor coordi- 
nation consists in accessing and/or generating useful 
sensory information though action. Differently from ex- 
perimental settings in which stimuli are brought to the 
passive agent by the experimenter, in ecological condi- 
tions agents need to access relevant information through 
action. For example, infants access the visual informa- 
tion necessary to recognize the 3D structure of an object 
by rotating it in the hand and by keeping it at close dis- 
tance so to minimize visual occlusions [37.39]. The use 
of sensory—motor coordination for this purpose is usu- 
ally named active perception [37.37, 40, 41]. 

Interestingly, action can be exploited not only to 
access sensory information but also to generate it. To 
understand this aspect, we should consider that through 
their action agents can elaborate the information they 
access through their sensory system over time and store 
the result of the elaboration in their body state and/or 
in their posture or location. A well-known example of 
this phenomenon is constituted by depth perception as 
a result of convergence, i.e., the simultaneous inward 
movement of both eyes toward each other, to maintain 
a single binocular percept of a selected object. The exe- 
cution of this behavior produces a kinesthetic sensation 
in the eye muscles that reliably correlates with the ob- 
ject’s depth. 

The careful reader might have recognized that the 
robot’s behavioral discrimination strategies to perceive 
larger and smaller cylindrical objects, described in the 
previous section, exploit the same active perception 
mechanism. For a robot provided with infrared and 
wheel-speed sensors, the perception of object size nec- 
essarily requires a capacity to integrate the information 
provided by several stimuli. The elaboration of this in- 
formation however is not realized internally, within the 
robot’s nervous system, but rather externally through 
the exhibition of the object-circling behavior. It is this 
behavior that generates the corresponding kinesthetic 
sensation on the wheel sensors that is then used by the 
robot to decide to remain or leave, depending on the 
circumstances. 

Examples of clever strategies able to elaborate the 
required information through action and active percep- 
tion abound in evolutionary robotics experiments. By 
carrying out an experiment in which a robot needed to 
reach two foraging areas located in the northeast and 
southwest side of a rectangular environment surrounded 
by walls, Nolfi [37.34] observed that the evolved robots 
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developed a clever strategy that allows them to compute 
the relative length of the two sides of the environment 
and to navigate toward the two right corners on the 
basis of a simple control policy. The strategy consists 
in leaving the first encountered corner with an angle 
of about 45° with respect to the two sides, moving 
straight, and then eventually following the left side of 
the next encountered wall ({37.34] for details). Another 
clever exploitation of sensory—motor coordination was 
observed in an experiment involving two cooperating 
robots that helped each other to navigate toward circu- 
lar target areas [37.42]. Evolved robots discovered and 
displayed a behavior solution that allowed them to in- 
form each other on the relative location of the center of 
their target navigation area despite their sensory system 
being unable to detect their relative position within the 
area [37.42]. 


37.3.3 Channeling the Course 
of the Learning Process 


A third fundamental role of sensory-motor coordina- 
tion consists in channeling the course of the forthcom- 
ing adaptive process. 

The sensory states experienced during learning cru- 
cially determine the course and the outcome of the 
learning process [37.43]. This implies that the actions 
displayed by an agent, that co-determine the agent’s 
forthcoming sensory states, ultimately affect how the 
agent changes ontogenetically. In other words, the be- 
havior exhibited by an agent at a certain stage of its 


development constraints and channels the course of the 
agent’s developmental process. 

Indeed, evolutionary robotics experiments indicate 
how the evolution of plastic agents (agents that vary 
their characteristics while they interact with the envi- 
ronment [37.44]) lead to qualitatively different results 
with respect to the evolution of nonplastic individuals. 
The traits evolved in the case of nonplastic individuals 
are selected directly for enabling the agent to display 
the required capabilities. The traits evolved in the case 
of plastic individuals, instead, are selected primarily for 
enabling the agents to acquire the required capabilities 
through an ontogenetic adaptation process. This implies 
that, in this case, the selected traits do not enable the 
agent to master their adaptive task (agents tend to dis- 
play rather poor performance at the beginning of their 
lifetime) but rather to acquire the required capacities 
through ontogenetic adaptation. 

More generally, the behavioral strategies adopted by 
agents at a certain stage of their developmental process 
can crucially constrain the course of the adaptive pro- 
cess. For example, agents learning to reach and grasp 
objects might temporarily reduce the complexity of the 
task to be mastered by freezing (i. e., locking) selected 
DOFs and by then unfreezing them when their capacity 
reaches a level that allows them to master the task in 
its full complexity [37.45, 46]. This type of process can 
enable exploratory learning by encompassing variation 
and selection of either the general strategy displayed by 
the agent or the specific way in which the currently se- 
lected strategy is realized. 


37.4 Developmental Robotics for Higher Order Embodied Cognitive 


Capabilities 


37.4.1 Embodied Cognition 
and Developmental Robots 


The previous sections have demonstrated the fun- 
damental role of embodiment and of the agent- 
environment coupling in the design of adaptive agents 
and robots capable to perform sensory—motor tasks 
such as navigation and object discrimination. How- 
ever, embodiment also plays an important role in higher 
order cognitive capabilities [37.12], such as object cat- 
egorization and representation, language learning, and 
processing, and even the acquisition of abstract con- 
cepts such as numbers. In this section, we will consider 
some of the key psychological and neuroscience ev- 


idence of EC and its contribution in the design of 
linguistic and numerical skills in cognitive robots. 
Intelligent behavior has traditionally been mod- 
eled as a result of activation patterns across distributed 
knowledge representations, such as hierarchical net- 
works of interrelated propositional (symbolic) nodes 
that represent objects in the world and their attributes 
as abstract, amodal (nonembodied) entities [37.47]. For 
example, the response bird to a flying object with feath- 
ers and wings would result from perceiving its features 
and retrieving its name from memory on the basis of 
a matching process. Such traditional views were at- 
tractive for a number of reasons: They followed the 
predominant philosophical tradition of logical concep- 
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tual knowledge organization, according to which all 
objects are members of categories and category mem- 
bership can be determined in an all-or-none fashion via 
defining features. Also, such hierarchical knowledge 
networks were consistent with cognitive performance 
in simple tasks such as speeded property verification, 
which were thought to tap into the retrieval of knowl- 
edge. For example, verifying the statement a bird has 
feathers was thought to be easier than verifying the 
statement a bird is alive because the feature feathers 
was presumably stored in memory as defining the cate- 
gory bird, while the feature alive applies to all animals 
and was therefore represented at a superordinate level 
of knowledge, hence requiring more time to retrieve af- 
ter having just processed bird [37.47]. Finally, it was 
convenient to computationally model such networks by 
liking the human mind to an information processing de- 
vice with systematic input, storage, retrieval, and output 
mechanisms. Thus, knowledge was considered as an ab- 
stract commodity independent of the physical device 
within which it was implemented. 

More recent work called into question several of 
these assumptions about the workings of the human 
mind. For example, graded category memberships and 
prototypicality effects in categorization tasks pointed to 
disparities between the normative logical knowledge or- 
ganization and the psychological reality of knowledge 
retrieval [37.48]. Computational modeling of cognitive 
processes has revealed alternative, distributed represen- 
tational networks for computing intelligent responses 
in perceptual, conceptual, and motor tasks that avoid 
the neurophysiologically implausible assumption of lo- 
calized storage of specific knowledge [37.49]. Most 
importantly, though, traditional propositional knowl- 
edge networks were limited to explaining the meaning 
of any given concept in terms of an activation pattern 
across other conceptual nodes, thus effectively defin- 
ing the meaning of one symbol in terms of arbitrary 
other symbols. This process never referred to a concrete 
experience or event and essentially made the process 
of connecting internal and external referents arbitrary. 
In other words, traditional knowledge representations 
never make contact with specific sensory and motor 
modalities that is essential to imbue meaning to the ac- 
tivation pattern in a network. This limitation is known 
as the grounding problem [37.50] and points to a fun- 
damental flaw in traditional attempts to model human 
knowledge representations. 

A second reason for abandoning traditional amodal 
models of knowledge representation is the fact that 
these models cannot account for patterns of sensory 


and motor excitation that occur whenever we activate 
our knowledge. Already at the time when symbol ma- 
nipulation approaches to intelligent behavior had their 
heyday there was powerful evidence for a mandatory 
link between intelligent thought and sensory—motor 
experience: When matching two images of the same ob- 
ject, the time we need to recognize that it is the same 
object is linearly related to the angular disparity be- 
tween the two views [37.15]. This result suggests that 
the mental comparison process simulates the physical 
object rotation we would perform if the two images 
were manipulable in our hands. In recent years, there 
has been both more behavioral and also neuroscientific 
evidence of an involvement of sensory—motor processes 
in intelligent thought, leading to the influential notion of 
action simulation as an obligatory component of intel- 
ligent thought (for review, [37.51]). 

To summarize, the idea that sensory and motor pro- 
cesses are an integral part of our knowledge is driven 
by both theoretical and empirical considerations. On the 
theoretical side, the EC stance addresses the grounding 
problem, a fundamental limitation of classical views of 
knowledge representation. Empirically, it is tough for 
traditional amodal conceptualizations of knowledge to 
address systematic patterns of sensory and motor biases 
that accompany knowledge activation. 

Amongst the latest development in robotics and 
computational intelligence, the field of developmen- 
tal robotics has specifically focused on the essential 
role of EC in the ontogenetic development of cogni- 
tive capabilities. Developmental robotics (also know 
as epigenetic robotics and as the field of autonomous 
mental development) is the interdisciplinary approach 
to the autonomous design of behavioral and cogni- 
tive capabilities in artificial agents (robots) that takes 
direct inspiration from the developmental principles 
and mechanisms observed in natural cognitive systems 
(children) [37.18, 52-54]. In particular, the key princi- 
ple of developmental robotics is that the robot, using 
a set of intrinsic developmental principles regulating the 
real-time interaction between its body, brain, and en- 
vironment, can autonomously acquire an increasingly 
complex set of sensorimotor and mental capabilities. 
Existing models in developmental robotics have cov- 
ered the full range of sensory—motor and cognitive 
capabilities, from intrinsic motivation and motor con- 
trol to social learning, language and reasoning with 
abstract knowledge ([37.18] for a full overview). 

To demonstrate the benefits of combining EC with 
developmental robotics in the modeling of embodied 
intelligence, the two domains of the action bases of 
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language and of the relationship between space and nu- 
merical cognition have been chosen. In Sect. 37.4.2, we 
will look at seminal examples of the embodied bases 
of language in psycholinguistics, neuroscience, and de- 
velopmental psychology, and the corresponding devel- 
opmental robotics models. Section 37.4.3 will consider 
EC evidence on the link between spatial and numerical 
cognition, and a developmental robotics model of em- 
bodied language learning. 


37.4.2 Embodied Language Learning 


In experimental psychology and psycholinguistics, an 
influential demonstration of action simulation as part of 
language comprehension was first carried out by Glen- 
berg and Kaschak [37.55]. They asked healthy adults to 
move their right index finger from a button in their mid- 
sagittal plane either away from or toward their body 
to indicate whether a visually presented statement was 
meaningful or not. Sentences like Open the drawer led 
to faster initiation of movements toward than away from 
the body, while sentences like Close the drawer led to 
faster initiation of movements away from than toward 
the body. Thus, there was a congruency effect between 
the implied spatial direction of the linguistic descrip- 
tion and the movement direction of the reader’s motor 
response. This motor congruency effect in language 
comprehension has been replicated and extended (for 
review, [37.56]). It suggests that higher level cognitive 
feats (such as language comprehension) are ultimately 
making use of lower level (sensory—motor) capacities 
of the agent, as predicted by an embodied account of 
intelligence. 

In parallel, growing cognitive neuroscience evi- 
dence has shown that the cortical areas of the brain 
specialized for motor processing are also involved in 
language processing tasks; thus supporting the EC view 
that action and language are strictly integrated [37.57, 
58]. For example, Hauk et al. [37.59] carried out brain 
imaging experiments where participants read words re- 
ferring to face, arm, or leg actions (e.g., lick, pick, kick). 
Results support the embodied view of language, as the 
linguistic task of reading a word differentially activated 
parts of the premotor area that were directly adjacent, or 
overlapped, with region activated by actual movement 
of the tongue, the fingers, or the feet, respectively. 

The embodied nature of language has also been 
shown in developmental psychology studies, as in 
Tomasello’s [37.60] constructivist theory of language 
acquisition and in Smith and Samuelson’s [37.61] study 
on embodiment biases in early word learning. For ex- 


ample, Smith and Samuelson [37.61] investigated the 
role of embodiment factors such as posture and spa- 
tial representations during the learning of first words. 
They demonstrated the importance of the changes in 
postures involved in the interaction with objects lo- 
cated in different parts (left and right) of the child’s 
peripersonal space. Experimental data with 18-month 
old children show that infants can learn new names 
also in the absence of the referent objects, when the 
new label is said whilst the child looks at the same 
left/right location where the object has previously ap- 
peared. This specific study was the inspiration of a de- 
velopmental robotics study on the role of posture in 
the acquisition of object names with the iCub baby 
robot [37.62]. 

The iCub is an open source robotic platform devel- 
oped as a benchmark experimental tool for cognitive 
and developmental robotics research [37.63]. It has a to- 
tal of 53 DOF, with a high number of DOF (32) in 
the arms and hands to study object manipulation and 
the role of fine motor skills in cognitive development. 
This facilitates the replication of the experimental setup 
of Smith and Samuelson’s study [37.61]. In the iCub 
experiments, a human tutor shows two novel objects re- 
spectively in the left and right location of a table put in 
front of the robot. Initially the robot moves to look at 
each object and learns to categorize it according to its 
visual features, such as shape and color. Subsequently 
the tutor hides both objects, directs the robot’s atten- 
tion toward the right side where the first object was 
shown and says a new word: Modi. In the test phase 
both objects are presented simultaneously in the centre 
of the table, and the robot is asked Find the modi. The 
robot must then look and point at the object that was 
presented in the right location. Four different experi- 
ments were carried out, as in Smith and Samuelson’s 
child study. Two experiments differ with regards to 
the frequency of the left/right locations used to show 
each objects: the Default Condition when each object 
always appears in the same location, and the Switch 
Condition when the position of the two objects is var- 
ied to weaken the object/location spatial association. 
In the other two experimental conditions, the object 
is named whilst in sight, so to compare the relative 
weighting of the embodiment spatial constraints and the 
time constraint. 

The robot’s behavior is controlled by a modular 
neural network consisting of a series of pretrained 
Kohonen self-organizing maps (SOMs), connected 
through Hebbian learning weights that are trained on- 
line during the experiment [37.64]. The first SOM is 
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a color map as it is used to categorize objects accord- 
ing to their color (average RGB (red-green-blue) color 
of the foveal area). The second map, the auditory map, 
is used to represent the words heard by the robot, as 
the result of the automatic speech recognition system. 
The other SOM is the body-hub map, and this is the 
key component of the robot’s neural system that imple- 
ments the role of embodiment. The body-hub SOM has 
four inputs, each being the angle of a single joint. In the 
experiments detailed here only 2 degrees from the head 
(up/down and left/right motors), and 2 degrees from the 
eyes (up/down and left/right motors) are used. Embod- 
iment is operationalized here as the posture of eye and 
head position when the robot has to look to the left and 
to the right of the scene. 

During each experiment, the connection weight 
linking the color map and the auditory map to the body- 
hub map are adjusted in real time using a Hebbian 
learning rule. These Hebbian associative connections 
are only modified from the current active body pos- 
ture node. As the maps are linked together in real time, 
strong connections between objects typically encoun- 
tered in particular spatial locations, and hence in similar 
body postures, build up. 

To replicate the four experimental conditions of 
Smith and Samuelson [37.61], 20 different robots were 
used in each condition, with new random weights for 
the SOM and Hebbian connections. Results from the 
four conditions show a very high match between the 
robot’s data and the child experiment results, closely 
replicating the variations in the four conditions. For 
example, in the Default Condition 83% of the trials 
resulted in the robots selecting the spatially linked 
objects, whilst in the Switch condition, where the 
space/object association was weakened, the robots’ 
choices were practically due to chance at 55%. Smith 
and Samuelson [37.61] reported 71% of children se- 
lected the spatially linked object, versus 45% in the 
Switch condition. 

This model demonstrates that it is possible to build 
an embodied cognitive system that develops linguis- 
tic and sensorimotor capabilities through interaction 
with the world, closely resembling the embodiment 
strategies observed in children’s acquisition early word 
learning. Other cognitive robotics models have also 
been developed which exploit the principle of embod- 
iment in robots’ language learning, as in models of 
compositionality in action and language [37.65-68], in 
models of the cultural evolution of construction gram- 
mar [37.69, 70], and the modeling of the grounding of 
abstract words [37.71]. 


37.4.3 Number and Space 


Number concepts have long been considered as pro- 
totypical examples of abstract and amodal concepts 
because their acquisition would require generalizing 
across a large range of instances to discover the in- 
variant cardinality meaning of words such as two and 
four [37.72]. Mental arithmetic would therefore appro- 
priately be modeled as abstract symbol manipulation, 
such as incrementing a counter or retrieving factual 
knowledge [37.73]. But evidence for an inescapable 
reference back from abstract number concepts to the 
sensori-motor experiences during concept acquisition 
has been present for a long time. Specifically, Moyer 
and Landauer [37.74] showed that the speed of decid- 
ing which of two visually presented digits represents 
the larger number depends on their numerical distance, 
with faster decisions for larger distances. Thus, even in 
the presence of abstract symbols we seem to refer to 
analog representations, as if comparing sensory impres- 
sions of small and large object compilations. 

More recent studies provided further evidence that 
sensory—motor experiences have a strong impact on the 
availability of number knowledge. This embodiment 
signature can be documented by measuring the speed 
of classifying single digits as odd or even with lateral- 
ized response buttons. The typical finding is that small 
numbers (1, 2) are classified faster with left responses 
and large numbers (8, 9) are classified faster with right 
responses [37.76]. This spatial-numerical association 
response codes, or SNARCs effect, has been replicated 
across several tasks and extended to other effectors (for 
review [37.77]), including even attention shifts to the 
left or right side induced by small or large numbers, re- 
spectively [37.78]. 

Importantly, SNARC depends on one’s sensory— 
motor experiences, such as directional scanning and 
finger counting habits, as well as current task de- 
mands. For example, the initial acquisition of num- 
ber concepts in childhood occurs almost universally 
through finger counting and this learning process leaves 
a residue in the number knowledge of adults. Those 
who start counting on their left hand, thereby associ- 
ating small numbers with left space, have a stronger 
SNARC than those who start counting on their right 
hand [37.79]. Similarly, reading direction modulates 
the strength of SNARC-. In the original report by De- 
haene etal. [37.76], it was noticed that adults from 
a right-to-left reading culture presented with weaker 
or even reversed SNARC. The notion of a spill-over 
of directional reading habits into the domain of num- 
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ber knowledge was further supported by developmental 
studies showing that it takes around 3 years of schooling 
before the SNARC emerges [37.80]. However, more re- 
cent work has found SNARC even in preschoolers (for 
review [37.81], thus lending credibility to the role of 
embodied practices such as finger counting in the for- 
mation of SNARC. 

In a recent series of experiments with Russian— 
Hebrew bilinguals, Shaki et al. [37.82—84] (for review 
(37.85]) documented that both one’s habitual reading 
direction and the most recent, task-specific scanning di- 
rection determine the strength of one’s SNARC. These 
findings make clear that SNARC is a compound effect 
where embodied and situated (task-specific) factors add 
different weights to the overall SNARC. 

SNARC and other biases extend into more complex 
numerical tasks such as mental arithmetic. For exam- 
ple, the association of larger numbers with right space 
is also present during addition (the operational momen- 
tum or OM effect). Regardless of whether symbolic 
digits or nonsymbolic dot patterns are added together, 
participants tend to over-estimate the sum, and this 
bias also influences spatial behavior [37.86]. More gen- 
erally, intelligent behavior such as mental arithmetic 
seems to reflect component processes (distance effect, 
SNARC effect, OM effect) that are grounded in senso- 
rimotor experiences. 

The strong link between spatial cognition and num- 
ber knowledge permits the modeling of the embodiment 
processes in the acquisition of number in robots. This 
has been the case with the recent developmental model 
developed by Rucinski et al. [37.75,87] to model the 
SNARC effect and the contribution of pointing ges- 
tures in number acquisition. In the first study [37.75], 
a simulation model of the iCub is used. The robot is 
first trained to develop a body schema of the upper 
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body through motor babbling of its arms. The iCub is 
subsequently trained to learn to recognize numbers by 
associating quantities of objects with numerical sym- 
bols as 7 and 2. In the SNARC test case, the robot has 
to perform a psychological-like experiment and press 
a left or right button to make judgments on number 
comparison and parity judgment (Fig. 37.2b). 

The robot’s cognitive architecture is based on 
a modular neural network controller with two main 
components, following inspiration from a connectionist 
model of numerical cognition [37.88] and the TRoPI- 
CALS cognitive architecture of Caligiore et al. [37.89, 
90]. The two main components of the neural control 
system are: (i) ventral pathway network, responsible for 
processing of the identity of objects as well as task- 
dependent decision making and language processing; 
and (ii) dorsal pathway network, involved in process- 
ing of spatial information about locations and shapes of 
objects and processing for the robot’s action. 

The ventral pathway is modeled, following Chen 
and Verguts [37.88], with a symbolic input which en- 
codes the alphanumerical number symbols of numbers 
from 1 to 15, a mental number line encoding the num- 
ber meaning (quantity), a decision layer for the number 
comparison and parity judgment tasks, and a response 
layer, with two neurons for left/right hand response se- 
lection. The dorsal pathway is composed of a number 
of SOMs which code for spatial locations of objects 
in the robot peripersonal space. One map is associ- 
ated with gaze direction, and two maps respectively for 
each of the robot’s left and right arms. The input to the 
gaze map arrives from the 3-dimensional proprioceptive 
vector representing the robot gaze direction (azimuth, 
elevation and vergence). The input to each arm position 
map consists of a 7-dimensional proprioceptive vector 
representing the position of the relevant arm joints. This 
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dorsal pathway constitutes the core component of the 
model where the embodied properties of the model are 
directly implemented as the robot’s own sensorimotor 
maps. 

To model the developmental learning processes in- 
volved in number knowledge acquisition, a series of 
training phases are implemented. For the embodiment 
part, the robot is first trained to perform a process equiv- 
alent to motor babbling, to develop the gaze and arm 
space maps. With motor babbling the robot builds its 
internal visual and motor space representations (SOMs) 
by performing random reaching movements to touch 
a toy in its peripersonal space, whilst following its 
hand’s position. Transformations between the visual 
spatial map for gaze and the maps of reachable left 
and right spaces are implemented as connections be- 
tween the maps, which are learned using the classical 
Hebbian rule. At each trial of motor babbling, gaze 
and appropriate arm are directed toward the same 
point and resulting co-activations in already devel- 
oped spatial maps is used to establish links between 
them. 

The next developmental training establishes the 
links between number words (modeled as activations 
in the ventral input layer) and the number meaning 
(activations in the mental number line hidden layer). 
Subsequently the robot is taught to count. This stage 
models the cultural biases that result in the internal as- 
sociation of small numbers with the left side of space 
and large numbers with the right side. As an example 
of these biases, we considered a tendency of children 
to count objects from left to right, which is related to 
the fact that European culture is characterized by left- 
to-right reading direction [37.91]. In order to model the 
process of learning to count, the robot was exposed to 
an appropriate sequence of number words (fed to the 
ventral input layer of the model network), while at the 
same time the robot’s gaze was directed toward a spe- 
cific location in space (via the input to the gaze visual 
map). These spatial locations were generated in such 
a way that their horizontal coordinates correlated with 
number magnitude (small numbers presented on the 
left, large numbers on the right) with a certain amount 
of Gaussian noise. During this stage, Hebbian learning 
established links between number word and stimuli lo- 
cation in the visual field. 

Finally, the model is trained to perform number rea- 
soning tasks, such as number comparison and parity 
judgment, which corresponds to establishing appropri- 
ate links between the mental number line hidden layer 
and neurons in the decision layer. Specifically, one 


experiment focuses on the modeling of the SNARC ef- 
fect. The robot’s reaction time (i. e., amount of activity 
needed to exceed a response threshold in one of the two 
response nodes) in parity judgment and number com- 
parison tasks were recorded to calculate the difference 
between right hand and left hand RTs for the same num- 
ber. When difference values are plotted against number 
magnitudes the SNARC effect manifests itself in a neg- 
ative slope as in Fig. 37.2. As the connections between 
visual and motor maps form a gradient from left to 
right, the links to the left arm map become weaker, 
while those to the right become stronger. Thus, when 
a small number is presented, internal connections lead 
to stronger automatic activation of the representations 
linked with the left arm than that of the right arm, thus 
causing the SNARC effect. 

This model of space and number knowledge was 
also extended to include a more active interaction with 
the environment during the number learning process. 
This is linked to the fact that gestures such as point- 
ing at the object being counted, or the actual touching 
of the objects enumerated, has been show to improve 
the overall counting performance in children [37.92]. In 
the subsequent model by Rucinski et al. [37.87], a sim- 
pler neural control architecture was used based on the 
Elman recurrent network to allow sequential number 
counting and the representation of gestures as propri- 
oceptive states for the pointing gestures. The robot has 
to learn to produce a sequence of number words (from 
1 to 10) with the length of the sequence equivalent to 
the number of objects present in the scene. Visual in- 
put to the model is a one-dimensional saliency map, 
which can be considered a simple model of a retina. 
In input, the additional proprioceptive signal was ob- 
tained from a pointing gesture performed by the iCub 
humanoid robot and is used to implement the gestural 
input to the model in the pointing condition. The out- 
put nodes encode the phonetic representation of the 10 
numbers. 

During the experiment, the robot is first trained 
to recite a sequence of number words. Then, in order 
to assess the impact of the proprioceptive informa- 
tion connected with the pointing gesture, the training 
is divided into two conditions: (i) training to count 
the number of objects shown to the visual input in 
the absence of the proprioceptive gesture signal, and 
(ii) counting though pointing, via the activation of the 
gesture proprioceptive input. Results show that such 
a simple recurrent architecture benefits from the input 
of the proprioceptive gesturing signal, with improved 
counting accuracy. In particular, the model reproduces 
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the quantitative effects of gestures on the counted 
set size, replicating child psychology data reported 
in [37.92]. 

Overall, such a developmental robotics model 
clearly shows that the modeling of embodiment phe- 
nomena, such as the use of spatial representation in 
number judgments, and of the pointing gestures for 


37.5 Conclusion 


This chapter has provided an overview of the three key 
principles of embodied intelligence, namely morpho- 
logical computation, sensory—motor coordination, and 
EC, and of the experimental approaches and models 
from evolutionary robotics and developmental robotics. 
The wide range of behavioral and cognitive capabil- 
ities modeled through evolutionary and developmen- 
tal experiments (e.g., locomotion in different environ- 
ments, navigation and object discrimination, posture in 
early word learning and space and number integration) 
demonstrates the impact of embodied intelligence in the 
design of a variety of perceptual, motor and cognitive 
skills, including the potential to model the embodied 
basis of abstract knowledge as in numerical cognition. 

The current progress of both evolutionary and de- 
velopmental models of embodied intelligence, although 
showing significant scientific and technological ad- 
vances in the design of embodied and situated agents, 
still has a series of open challenges and issues. These 
issues are informing ongoing work in the various fields 
of embodied intelligence. 

One open challenge in morphological computation 
concerns how best to automatically design the body 
plans of robots so that they can best exploit this phe- 
nomenon. In parallel to this, much work remains to 
be done to understand what advantages morphological 
computation confers on a robot. For one, it is likely 
that a robot with a simpler control policy will be more 
robust to unanticipated situations: for example the jam- 
ming gripper is able to grasp multiple objects with 
the same control strategy; a rigid hand requires differ- 
ent control strategies for different objects. Secondly, 
a robot that performs more morphological computa- 
tion may be more easily transferred from the simulation 
in which it was evolved to a physical machine: with 
a simpler control policy there is less that can go wrong 
when experiencing the different sensor signals and mo- 
tor feedback generated by operation in the physical 
world. 


number learning, can allow us to understand the acqui- 
sition of abstract concepts in humans as well as artificial 
agents and robots. This further demonstrates the benefit 
of the embodied intelligence approach to model a range 
of behavioral and cognitive phenomena from simple 
sensory—motor tasks to higher order linguistic and ab- 
stract cognition tasks. 


Evolving robots provides a unique opportunity for 
developing rigorous methods for measuring whether 
and how much morphological computation a robot per- 
forms. For instance, if evolutionary algorithms can be 
designed that produce robots with similar abilities yet 
different levels of control and morphological complex- 
ity, and it is found that in most cases reduced control 
complexity implies greater morphological complexity, 
this would provide evidence for the evolution of mor- 
phological computation. 

The emerging field of soft robotics [37.93] provides 
much opportunity for exploring the various aspects 
of morphological computation because the space of 
all possible soft robot body plans — with large vari- 
ations in shape and continuous admixtures of hard 
and soft materials — is much larger than the space 
of rigid linkages traditionally employed in classical 
robots. 

The design issue, 1. e., the question of how systems 
able to exploit coordinated action and perception pro- 
cesses can be designed, represents an open challenge 
for sensory-motor coordination as well. As illustrated 
above, adaptive techniques in which the fine-grained 
characteristics that determine how agents react to cur- 
rent and previous sensory states are varied randomly 
and in which variations are retained or discarded on 
the basis of their effects at the level of the overall be- 
havior exhibited by the agent/s interacting with their 
environment constitutes an effective method. However, 
this method might not scale up well with the number of 
parameters to be adapted. The question of how sensory— 
motor coordination capabilities can be acquired through 
the use of other learning techniques that relays on 
shorter term feedbacks represents an open issue. An in- 
teresting research direction, in that respect, consists in 
the hypothesis that the development of sensory—motor 
coordination can be induced through the use of task 
independent criteria such as information theoretic mea- 
sures [37.94, 95]. 
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Other important research directions concerns the 
theoretical elaboration of the different roles that mor- 
phological computation and sensory-motor coordina- 
tion can play and the clarification of the relation- 
ship between processes occurring as a result of the 
agent/environmental interactions and processes occur- 
ring inside the agents’ nervous systems 

In developmental robotics models of EC the is- 
sues of open-ended, cumulative learning and of the 
scaling up of the sensory-motor and cognitive reper- 
toires still requires significant efforts and novel method- 
ological and theoretical approaches. Another issue, 
which combines both evolutionary and developmen- 
tal approaches, is the interaction of phylogenetic and 
ontogenetic phenomena in the body/environment/brain 
adaptation. 

Human development is characterized by cumula- 
tive, open-ended learning. This refers to the fact that 
learning and development do not start and stop at 
specific stages, but rather this is a life-long learning 
experience. Moreover, the skills acquired in various de- 
velopmental stages are accumulated and integrated to 
support the further acquisition of more complex capa- 
bilities. One consequence of cumulative, open-ended 
learning is cognitive bootstrapping. For example in lan- 
guage development, the phenomenon of the vocabulary 
spurt exist, in which the knowledge and experience 
from the slow learning of the first 50—100 words causes 
a redefinition of the word learning strategy, and to syn- 
tactic bootstrapping, where children rely on syntactic 
cues and word context in verb learning to determine the 
meaning of new verbs [37.96]. Although some com- 
putational intelligence models of the vocabulary spurt 
exist [37.97], robotic experiments on language learning 
have been restricted to smaller lexicons, not reaching 
the critical threshold to allow extensive modeling of 
the bootstrapping of the agent’s lexicon and grammar 
knowledge. These current limitations are also linked 
to the general issue of the scaling up of the robot’s 
motor and cognitive capabilities and of cross-modal 
learning. Most of the current cognitive robotics models 
typically focus on the separate acquisition of only one 
task or modality (perception, or phonetics, or semantics 
etc.), often with limited repertoires rarely reaching 10 
or slightly more learned actions or words. Thus a truly 
online, cross-modal, cumulative, open-ended develop- 
mental robotics model remains a fundamental challenge 
to the field. 


Another key challenge for future research is the 
modeling of the interaction of the different timescales 
of adaptation in embodied intelligence, that is between 
phylogenetic (evolutionary) factors and ontogenetic 
(development, learning, maturation) phenomena. For 
example, maturation refers to changes in the anatomy 
and physiology of both the child’s brain and the body, 
especially during the first years of life. Maturational 
phenomena related to the brain include the decrease of 
brain plasticity during early development, whilst matu- 
ration in the body is more evident due to the significant 
morphological growth changes a child goes through 
from birth to adulthood (see Thelen and Smith’s analy- 
sis of crawling and walking [37.98]). The ontogenetic 
changes due to maturation and learning have impor- 
tant implications for the interaction of development 
with phylogenetic changes due to evolution. Body mor- 
phology and brain plasticity variations can be in fact 
explained as evolutionary adaptations of the species to 
changing environmental context as with heterochronic 
changes [37.99]. For example, Elman etal. [37.43] 
discuss how genetic and heterochronic mechanisms 
provide an alternative explanation of the nature/nurture 
debate, where genetic phenomena produce architectural 
constraints of the organism’s brain and body, which 
subsequently control and affects the results of learn- 
ing interaction. Following this, Cangelosi [37.100] has 
tested the effects of heterochronic changes in the evo- 
lution of neural network architectures for simulated 
robotic agents. 

The interaction between ontogenetic and phylo- 
genetic factors has been investigated through evo- 
lutionary robotics models. For example, Hinton and 
Nolan [37.101] and Nolfi et al. [37.102] have devel- 
oped evolutionary computational models explaining 
the effects of learning in evolution. The modeling of 
the evolution of varying body and brain morpholo- 
gies in response to phylogenetic and ontogenetic re- 
quirements is also the goal of the evo-devo field of 
computational intelligence [37.7, 103-105]. These evo- 
lutionary/ontogenetic interaction models have, how- 
ever, mostly focused on simple sensory—motor tasks 
such as navigation and foraging. Future work com- 
bining evolutionary and developmental robotics mod- 
els can better provide theoretical and technological 
understanding of the contribution of different adapta- 
tion time scales and mechanisms in embodied intelli- 
gence. 
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38. Neuromorphic Engineering 


Giacomo Indiveri 


Neuromorphic engineering is a relatively young 
field that attempts to build physical realizations 
of biologically realistic models of neural sys- 
tems using electronic circuits implemented in very 
large scale integration technology. While originally 
focusing on models of the sensory periphery im- 
plemented using mainly analog circuits, the field 
has grown and expanded to include the modeling 
of neural processing systems that incorporate the 
computational role of the body, that model learn- 
ing and cognitive processes, and that implement 
large distributed spiking neural networks using 
a variety of design techniques and technologies. 
This emerging field is characterized by its multi- 
disciplinary nature and its focus on the physics 
of computation, driving innovations in theoretical 
neuroscience, device physics, electrical engineer- 
ing, and computer science. 


38.1 The Origins 


Models of neural information processing systems that 
link the type of information processing that takes place 
in the brain with theories of computation and com- 
puter science date back to the origins of computer 
science itself [38.1, 2]. The theory of computation based 
on abstract neural networks models was developed al- 
ready in the 1950s [38.3,4], and the development of 
artificial neural networks implemented on digital com- 
puters was very popular throughout the 1980s and the 
early 1990s [38.5—8]. Similarly, the history of imple- 
menting electronic models of neural circuits extends 
back to the construction of perceptrons in the late 
1950s [38.3] and retinas in the early 1970s [38.9]. How- 
ever, the modern wave of research utilizing very large 
scale integration technology and emphasizing the non- 
linear current characteristics of the transistor to study 
and implement neural computation began only in the 
mid-1980s, with the collaboration that sprung up be- 
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tween scientists such as Max Delbriick, John Hopfield, 
Carver Mead, and Richard Feynman [38.10]. Inspired 
by graded synaptic transmission in the retina, Mead 
sought to use the graded (analog) properties of tran- 
sistors, rather than simply operating them as on-off 
(digital) switches, to build circuits that emulate biologi- 
cal neural systems. He developed neuromorphic circuits 
that shared many common physical properties with pro- 
teic channels in neurons, and that consequently required 
far fewer transistors than digital approaches to emulat- 
ing neural systems [38.11]. Neuromorphic engineering 
is the research field that was born out of this activity 
and which carries on that legacy: it takes inspiration 
from biology, physics, mathematics, computer science, 
and engineering to design artificial neural systems for 
carrying out robust and efficient computation using low 
power, massively parallel analog very large scale in- 
tegration (VLSI) circuits, that operate with the same 
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physics of computation present in the brain [38.12]. In- 
deed, this young research field was born both out of 
the Physics of Computation course taught at Caltech 
by Carver Mead, John Hopfield, and Richard Feynman 
and with Mead’s textbook Analog Very Large Scale In- 
tegration and Neural Systems [38.11]. Prominent in the 
early expansion of the field were scientists and engi- 
neers such as Christof Koch, Terry Sejnowski, Rodney 
Douglas, Andreas Andreou, Paul Mueller, Jan van der 
Spiegel, and Eric Vittoz, training a generation of cross- 
disciplinary students. Examples of successes in neuro- 
morphic engineering range from the first biologically 
realistic silicon neuron [38.13], or realistic silicon mod- 
els of the mammalian retina [38.14], to more recent 
silicon cochlea devices potentially useful for cochlear 
implants [38.15], or complex distributed multichip ar- 
chitectures for implementing event-driven autonomous 
behaving systems [38.16]. 

It is now a well-established field [38.17], with 
two flagship workshops (the Telluride Neuromorphic 
Engineering [38.18] and Capo Caccia Cognitive Neu- 
romorphic Engineering [38.19] workshops) that are 


currently still held every year. Neuromorphic circuits 
are now being investigated by many academic and in- 
dustrial research groups worldwide to develop a new 
generation of computing technologies that use the same 
organizing principles of the biological nervous sys- 
tem [38.15,20,21]. Research in this field represents 
frontier research as it opens new technological and 
scientific horizons: in addition to basic science ques- 
tions on the fundamental principles of computation 
used by the cortical circuits, neuromorphic engineering 
addresses issues in computer-science, and electrical en- 
gineering which go well beyond established frontiers 
of knowledge. A major effort is now being invested 
for understanding how these neuromorphic computa- 
tional principles can be implemented using massively 
parallel arrays of basic computing elements (or cores), 
and how they can be exploited to create a new gener- 
ation of computing technologies that takes advantage 
of future (nano)technologies and scaled VLSI pro- 
cesses, while coping with the problems of low-power 
dissipation, device unreliability, inhomogeneity, fault 
tolerance, etc. 


38.2 Neural and Neuromorphic Computing 


Neural computing (or neurocomputing) is concerned 
with the implementation of artificial neural networks 
for solving practical problems. Similarly, hardware im- 
plementations of artificial neural networks (neurocom- 
puters) adopt mainly statistics and signal processing 
methods to solve the problem they are designed to 
tackle. These algorithms and systems are not neces- 
sarily tied to detailed models of neural or cortical 
processing. Neuromorphic computing on the other hand 
aims to reproduce the principles of neural computa- 
tion by emulating as faithfully as possible the detailed 
biophysics of the nervous system in hardware. In this 
respect, one major characteristic of these systems is 
their use of spikes for representing and processing sig- 
nals. This is not an end in itself: spiking neural networks 
represent a promising computational paradigm for solv- 
ing complex pattern recognition and sensory processing 
tasks that are difficult to tackle using standard ma- 
chine vision and machine learning techniques [38.22, 
23]. Much research has been dedicated to software 
simulations of spiking neural networks [38.24], and 
a wide range of solutions have been proposed for solv- 
ing real-world and engineering problems [38.25, 26]. 
Similarly, there are projects that focus on software 


simulations of large-scale spiking neural networks for 
exploring the computational properties of models of 
cortical circuits [38.27,28]. Recently, several research 
projects have been established worldwide to develop 
large-scale hardware implementations of spiking neural 
systems using VLSI technologies, mainly for allow- 
ing neuroscientists to carry out simulations and virtual 
experiments in real time or even faster than real-time 
scales [38.29-31]. Although dealing with hardware im- 
plementations of neural systems, either with custom 
VLSI devices or with dedicated computer architectures, 
these projects represent the conventional neurocomput- 
ing approaches, rather than neuromorphic-computing 
ones. Indeed, these systems are mainly concerned with 
fast and large simulations of spiking neural networks. 
They are optimized for speed and precision, at the cost 
of size and power consumption (which ranges from 
megawatts to kilowatts, depending on which approach 
is followed). An example of an alternative large-scale 
spiking neural network implementation that follows the 
original neuromorphic engineering principles (i. e., that 
exploits the characteristics of VLSI technology to di- 
rectly emulate the biophysics and the connectivity of 
cortical circuits) is represented by the Neurogrid sys- 
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tem [38.32]. This system comprises an array of 16 VLSI 
chips, each integrating mixed analog neuromorphic 
neuron and synapse circuits with digital asynchronous 
event routing logic. The chips are assembled on a 16.5 x 
19 cm? printed circuit board, and the whole system can 
model over one million neurons connected by billions 
of synapses in real time, and using only about ~ 3W of 
power [38.32]. 

Irrespective of the approach followed, these projects 
have two common goals: On one hand they aim to 
advance our understanding of neural processing in 


the brain by developing models and physically build- 
ing them using electronic circuits, and on the other 
they aim to exploit this understanding for develop- 
ing a new generation of radically different non-von 
Neumann computing technologies that are inspired by 
neural and cortical circuits. In this interdisciplinary 
journey neuroscience findings will influence theoreti- 
cal developments, and these will determine specifica- 
tions and constraints for developing new neuromor- 
phic circuits and systems that can implement them 
optimally. 


38.3 The Importance of Fundamental Neuroscience 


The neocortex is a remarkable computational de- 
vice [38.33]. It is the neuronal structure in the brain that 
most expresses biology’s ability to implement percep- 
tion and cognition. Anatomical and neurophysiological 
studies have shown that the mammalian cortex with its 
laminar organization and regular microscopic structure 
has a surprisingly uniform architecture [38.34]. Since 
the original work of Gilbert and Wiesel [38.35] on the 
neural circuits of visual cortex it has been argued that 
this basic architecture, and its underlying computational 
principles computational principles can be understood 
in terms of the laminar distribution of relatively few 
classes of excitatory and inhibitory neurons [38.34]. 
Based on these slow, unreliable and inhomogeneous 
computing elements, the cortex easily outperform to- 
day’s most powerful computers in a wide variety of 
computational tasks such as vision, audition, or mo- 
tor control. Indeed, despite the remarkable progress 
in information and communication technology and the 
vast amount of resources dedicated to information and 
communication technology research and development, 
today’s most fastest and largest computers are still not 
able to match neural systems, when it comes to carry- 
ing out robust computations in real-world tasks. The 
reasons for this performance gap are not yet fully un- 
derstood, but it is clear that one fundamental difference 
between the two types of computing systems lies in the 
style of computation. Rather than using Boolean logic, 
precise digital representations, and clocked operations, 
nervous systems carry out robust and reliable computa- 
tion using hybrid analog/digital unreliable components; 
they emphasize distributed, event driven, collective, 
and massively parallel mechanisms, and make exten- 
sive use of adaptation, self-organization and learning. 
Specifically, the patchy organization of the neurons in 


the cortex suggests a computational machine where 
populations of neurons perform collective computa- 
tion in individual clusters, transmit the results of this 
computation to neighboring clusters, and set the local 
context of the cluster by means of feedback connec- 
tions from/to other relevant cortical areas. This overall 
graphical architecture resembles graphical processing 
models that perform Bayesian inference [38.36, 37]. 
However, the theoretical knowledge for designing and 
analyzing these models is limited mainly to graphs 
without loops, while the cortex is characterized by 
massive recurrent (loopy) connectivity schemes. Re- 
cent studies exploring loopy graphical models related to 
cortical architectures started to emerge [38.33, 38], but 
issues of convergence and accuracy remain unresolved, 
hardware implementations in a cortical architectures 
composed of spiking neurons have not been addressed 
yet. 

Understanding the fundamental computational prin- 
ciples used by the cortex, how they are exploited for 
processing, and how to implement them in hardware, 
will allow us to develop radically novel computing 
paradigms and to construct a new generation of infor- 
mation and communication technology that combine 
the strengths of silicon technology with the perfor- 
mance of brains. Indeed fundamental research in neu- 
roscience has already made substantial progress in un- 
covering these principles, and information and commu- 
nication technologies have advanced to a point where 
it is possible to integrate almost as many transistors in 
a VLSI system as neurons in a brain. From the theoret- 
ical standpoint of view, it has been demonstrated that 
any Turing machine, and hence any conceivable dig- 
ital computation, can be implemented by a noise-free 
network of spiking neurons [38.39]. It has also been 
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shown that networks of spiking neurons can carry out 
a wide variety of complex state-dependent computa- 
tions, even in the presence of noise [38.40—44]. How- 
ever, apart from isolated results, a general insight into 
which computations can be carried out in a robust man- 
ner by networks of unreliable spiking elements is still 
missing. Current proposals in state-of-the-art computa- 
tional and theoretical neuroscience research represent 
mainly approximate functional models and are imple- 
mented as abstract artificial neural networks [38.45, 
46]. It is less clear how these functions are realized 
by the actual networks of neocortex [38.34], how these 
networks are interconnected locally, and how percep- 


tual and cognitive computations can be supported by 
them. 

Both additional neurophysiological studies on neu- 
ron types and quantitative descriptions of local and 
inter-areal connectivity patterns are required to deter- 
mine the specifications for developing the neuromor- 
phic VLSI analogs of the cortical circuits studied, and 
additional computational neuroscience and neuromor- 
phic engineering studies are required to understand 
what level of detail to use in implementing spiking 
neural networks, and what formal methodology to use 
for synthesizing and programming these non-von Neu- 
mann computational architectures. 


38.4 Temporal Dynamics in Neuromorphic Architectures 


Neuromorphic spiking neural network architectures 
typically comprise massively parallel arrays of sim- 
ple processing elements with memory and computation 
co-localized (Fig. 38.1). Given their architectural con- 
straints, these neural processing systems cannot process 
signals using the same strategies used by the conven- 
tional von Neumann computing architectures, such as 
digital signal processor or central processing unit, that 
time-domain multiplex small numbers of highly com- 
plex processors at high clock rates and operate by 
transferring the partial results of the computation from 
and to external memory banks. The synapses and neu- 
rons in these architectures have to process input spikes 
and produce output responses as the input signals ar- 
rive, in real time, at the rate of the incoming data. It is 
not possible to virtualize time and transfer partial re- 
sults in memory banks outside the architecture core, 
at higher rates. Rather it is necessary to employ re- 
sources that compute with time constants that are well 
matched to those of the signals they are designed to pro- 
cess. Therefore, to interact with the environment and 
process signals with biological timescales efficiently, 
hardware neuromorphic systems need to be able to 
compute using biologically realistic time constants. In 
this way, they are well matched to the signals they 
process, and are inherently synchronized with the real 
world events. 

This constraint is not easy to satisfy using analog 
VLSI technology. Standard analog circuit design tech- 
niques either lead to bulky and silicon-area expensive 
solutions [38.47] or fail to meet this condition, resorting 
to modeling neural dynamics at accelerated unrealistic 
timescales [38.48—50]. 


One elegant solution to this problem is to use cur- 
rent-mode design techniques [38.51] and log-domain 
circuits operated in the weak-inversion regime [38.52]. 
When metal oxide semiconductor field effect transis- 
tors are operated in this regime, the main mechanism 
of carrier transport is that of diffusion, as it is for ions 
flowing through proteic channels across neuron mem- 
branes. In general, neuromorphic VLSI circuits operate 
in this domain (also known as the subthreshold regime), 
and this is why they share many common physical prop- 
erties with proteic channels in neurons [38.52]. For 
example, metal oxide semiconductor field effect tran- 
sistor have an exponential relationship between gate- 
to-source voltage and drain current, and produce cur- 
rents that range from femto- to nanoampere resolution. 
In this domain, it is therefore possible to integrate rela- 
tively small capacitors in VLSI circuits, to implement 
temporal filters that are both compact and have bio- 
logically realistic time constants, ranging from tens to 
hundreds of milliseconds. 

A very compact subthreshold log-domain circuit 
that can reproduce biologically plausible temporal dy- 
namics is the differential pair integrator circuit [38.53], 
shown in Fig. 38.2. It can be shown, by log-domain cir- 
cuit analysis techniques [38.54, 55] that the response of 
this circuit is governed by the following first-order dif- 
ferential equation 


I d Inhi 
T (: F 1) lout + Lout = oe —In, 


(38.1) 
Tour J dt lr 


where the time constant t & C Ur/kl, the term Ur rep- 
resents the thermal voltage and « the subthreshold slope 
factor [38.52]. 
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Fig. 38.1 Neuromorphic spiking neural network architectures: detailed biophysical models of cortical circuits are derived from 
neuroscience experiments; neural networks models are designed, with realistic spiking neurons and dynamic synapses; these are 
mapped into analog circuits, and integrated in large numbers on VLSI chips. Input spikes are integrated by synaptic circuits, 
which drive their target postsynaptic neurons, which in turn integrate all synaptic inputs and generate action potentials. Spikes 
of multiple neurons are transmitted off chip using asynchronous digital circuits, to eventually control in real-time autonomous 


behaving systems 
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Fig. 38.2 Schematic diagram of neuromorphic integrator 
circuit. Input currents are integrated in time to produce out- 
put currents with large time constants, and with a tunable 
gain factor 


38.5 Synapse and Neuron Circuits 


Synapses are fundamental elements for computation 
and information transfer in both real and artificial neu- 
ral systems. They play a crucial role in neural coding 
and learning algorithms, as well as in neuromorphic 
neural network architectures. While modeling the non- 
linear properties and the dynamics of real synapses 


Although this first-order nonlinear differential 
equation cannot be solved analytically, for sufficiently 
large input currents (Jj, >> Ir) the term — Jm on the 
right-hand side of (38.1) becomes negligible, and even- 
tually when the condition Joy; >> In is met, the equation 
can be well approximated by 

d Tintin 


Ta lout + Tout = a : 


(38.2) 


Under the reasonable assumptions of nonnegligi- 
ble input currents, this circuit implements therefore 
a compact linear integrator with time constants that 
can be set to range from microseconds to hundreds of 
milliseconds. It is a circuit that can be used to build 
neuromorphic sensory systems that interact with the 
environment [38.56], and most importantly, is is a cir- 
cuit that reproduces faithfully the dynamics of synaptic 
transmission observed in biological synapses [38.57]. 


can be extremely onerous for software simulations in 
terms of computational power, memory requirements, 
and simulation time, neuromorphic synapse circuits can 
faithfully reproduce synaptic dynamics using integra- 
tors such as the differential pair integrator shown in 
Fig. 38.2. The same differential pair integrator cir- 
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cuit can be used to model the passive leak and con- 
ductance behavior in silicon neurons. An example of 
a silicon neuron circuit that incorporated the differen- 
tial pair integrator is shown in Fig. 38.3. This circuit 
implements an adaptive exponential integrate-and-fire 
neuron model [38.58]. In addition to the conductance- 
based behavior, it implements a spike-frequency adap- 
tation mechanisms, a positive feedback mechanism that 
models the effect of sodium activation and inactivation 
channels, and a reset mechanism with a free parameter 
that can be used to set the neuron’s reset potential. The 
neuron’s input differential pair integrator integrates the 
input current until it approaches the neuron’s threshold 
voltage. As the positive feedback circuit gets activated, 
it induces an exponential rise in the variable that rep- 
resents the model neuron membrane potential, which 
in the circuit of Fig. 38.3 is the current Jem. This 
quickly causes the neuron to produce an action poten- 
tial and make a request for transmitting a spike (1. e., the 
REQ signal of Fig. 38.3 is activated). Once the digital 
request signal is acknowledged, the membrane capaci- 
tance Cmem is reset to the neuron’s tunable reset poten- 
tial V.... These types of neuron circuits have been shown 
to be extremely low power, consuming about 7 pJ per 


spike [38.59]. In addition, the circuit is extremely com- 
pact compared to alternative designs [38.58], while still 
being able to reproduce realistic dynamics. 

As synapse and neuron circuits integrate their cor- 
responding input signals in parallel, the neural network 
emulation time does not depend on the number of 
elements involved, and the network response always 
happen in real time. These circuits can be therefore used 
to develop low-power large-scale hardware neural ar- 
chitectures, for signal processing and general purpose 
computing [38.58] 


38.5.1 Spike-Based Learning Circuits 


As large-scale very large scale integration (VLSI) net- 
works of spiking neurons are becoming realizable, the 
development of robust spike-based learning methods, 
algorithms, and circuits has become crucial. Spike- 
based learning mechanisms enable the hardware neural 
systems they are embedded in to adapt to the statis- 
tics of their input signals, to learn and classify complex 
sequences of spatiotemporal patterns, and eventually 
to implement general purpose state-dependent com- 
puting paradigms. Biologically plausible spike-driven 


Fig. 38.3 Schematic diagrams of a conductance-based integrate-and-fire neuron. An input differential pair integrator 
low-pass filter (M_i—3) implements the neuron leak conductance. A noninverting amplifier with current-mode positive 
feedback (Mai—6) produces address events at extremely low-power operation. A reset block (Mri—«) resets the neuron 
to the reset voltage Vs and keeps it reset for a refractory period, set by the V,ef bias voltage. An additional differential 
pair integrator low-pass filter (Mci—6) integrates the output events in a negative feedback loop, to implement a spike- 


frequency adaptation mechanism 


Neuromorphic Engineering 


38.6 Spike-Based Multichip Neuromorphic Systems 


synaptic plasticity mechanisms have been thoroughly 
investigated in recent years. It has been shown, for ex- 
ample, how spike-timing dependent plasticity (STDP) 
can be used to learn to encode temporal patterns 
of spikes [38.42,60,61]. In spike-timing dependent 
plasticity the relative timing of pre- and postsynap- 
tic spikes determine how to update the efficacy of 
a synapse. Plasticity mechanisms based on the timing 
of the spikes map very effectively onto silicon neuro- 
morphic devices, and so a wide range of spike-timing 
dependent plasticity models have been implemented 
in VLSI [38.62-67]. It is therefore possible to build 


large-scale neural systems that can carry out signal 
processing and neural computation, and include adap- 
tation and learning. These types of systems are, by 
their very own nature, modular and scalable. It is pos- 
sible to develop very large scale systems by designing 
basic neural processing cores, and by interconnecting 
them together [38.68]. However, to interconnect mul- 
tiple neural network chips among each other, or to 
provide sensory inputs to them, or to interface them to 
conventional computers or robotic platforms, it is nec- 
essary to develop efficient spike-based communication 
protocols and interfaces. 


38.6 Spike-Based Multichip Neuromorphic Systems 


In addition to using spikes for signal efficient process- 
ing and computations, neuromorphic systems can use 
spiking representations also for efficient communica- 
tion. The use of asynchronous spike- or event-based 
representations in electronic systems can be energy effi- 
cient and fault tolerant, making them ideal for building 
modular systems and creating complex hierarchies of 
computation. In recent years, a new class of neuromor- 
phic multichip systems started to emerge [38.69-71]. 
These systems typically comprise one or more neuro- 
morphic sensors, interfaced to general-purpose neural 
network chips comprising spiking silicon neurons and 
dynamic synapses. The strategy used to transmit sig- 
nals across chip boundaries in these types of systems 
is based on asynchronous address-events: output events 
are represented by the addresses of the neurons that 
spiked, and transmitted in real time on a digital bus 


Encode Decode 
— o — r 


(Fig. 38.4). The communication protocol used by these 
systems is commonly referred to as address event rep- 
resentation [38.72, 73]. The analog nature of the AER 
(address event representation) signals being transmitted 
is encoded in the mean frequency of the neurons spikes 
(spike rates) and in their precise timing. Both types of 
representations are still an active topic of research in 
neuroscience, and can be investigated in real time with 
these hardware systems. Once on a digital bus, the ad- 
dress events can be translated, converted or remapped 
to multiple destinations using the conventional logic 
and memory elements. Digital address event representa- 
tion infrastructures allow us to construct large multichip 
networks with arbitrary connectivity, and to seamlessly 
reconfigure the network topology. Although digital, the 
asynchronous real-time nature of the AER protocol 
poses significant technological challenges that are still 


Action potential 


Address-event 
representation of 
action potential 


Outputs i : 
Fig. 38.4 Asynchronous communi- 


cation scheme between two chips: 
when a neuron on the source chip 
generates an action potential, its ad- 
dress is placed on a common digital 
bus. The receiving chip decodes the 
address events and routes them to the 
appropriate synapses 


Destination 
chip 


721 


9'8E | d Hed 


722 


8°8e|d Hed 


Part D 


Neural Networks 


being actively investigated by the electrical engineering 
community [38.74]. But by using analog processing in 
the neuromorphic cores and asynchronous digital com- 
munication outside them, neuromorphic systems can 
exploit the best of both worlds, and implement compact 


low-power brain inspired neural processing systems 
that can interact with the environment in real time, and 
represent an alternative (complementary) computing 
technology to the more common and the conventional 
VLSI computing architectures. 


38.7 State-Dependent Computation in Neuromorphic Systems 


General-purpose cortical-like computing architectures 
can be interfaced to real-time autonomous behaving 
systems to process sensory signals and carry out event- 
driven state-dependent computation in real time. How- 
ever, while the circuit design techniques and technolo- 
gies for implementing these neuromorphic systems are 
becoming well established, formal methodologies for 
programming them, to execute specific procedures and 
solve user defined tasks, do no exist yet. A first step 
toward this goal is the definition of methods and pro- 
cedures for implementing state-dependent computation 
in networks of spiking neurons. In general, state-depen- 
dent computation in autonomous behaving systems has 
been a challenging research field since the advent of 
digital computers. Recent theoretical findings and tech- 
nological developments show promising results in this 
domain [38.16, 43,44, 75,76]. But the computational 
tasks that these systems are currently able to perform re- 
main rather simple, compared to what can be achieved 
by humans, mammals, and many other animal species. 
We know, for instance, that nervous systems can ex- 
hibit context-dependent behavior, can execute programs 
consisting of series of flexible iterations, and can condi- 
tionally branch to alternative behaviors. A general un- 
derstanding of how to configure artificial neural systems 
to achieve this sophistication of processing, including 
also adaptation, autonomous learning, interpretation of 
ambiguous input signals, symbolic manipulation, in- 
ference, and other characteristics that we could regard 
as effective cognition is still missing. But progress is 
being made in this direction by studying the computa- 


38.8 Conclusions 


In this chapter, we presented an overview of the neuro- 
morphic engineering field, focusing on very large scale 
integration implementations of spiking neural networks 
and on multineuron chips that comprise synapses and 
neurons with biophysically realistic dynamics, nonlin- 
ear properties, and spike-based plasticity mechanisms. 
We argued that the multineuron chips built using these 


tional properties of spiking neural networks configured 
as attractors or winner-take-all networks [38.33, 44, 
77). When properly configured, these architectures pro- 
duce persistent activities, which can be regarded as 
computational states. Both software and VLSI event- 
driven soft-winner-take-all architectures are being de- 
veloped to couple spike-based computational models 
among each other, using the asynchronous communi- 
cation infrastructure, and use them to investigate their 
computational properties as neural finite-state machines 
in autonomous behaving robotic platforms [38.44, 78]. 

The theoretical, modeling, and VLSI design inter- 
disciplinary activities is carried out with tight interac- 
tions, in an effort to understand: 


1. How to use the analog, unreliable, and low-preci- 
sion silicon neurons and synapse circuits operated 
in the weak-inversion regime [38.52] to carry out 
reliable and robust signal processing and pattern 
recognition tasks; 

2. How to compose networks of such elements and 
how to embody them in real-time behaving systems 
for implementing sets of prespecified desired func- 
tionalities and behaviors; and 

3. How to formalize these theories and techniques to 
develop a systematic methodology for configuring 
these networks and systems to achieve arbitrary 
state-dependent computations, similar to what is 
currently done using high-level programming lan- 
guages such as Java or C++ for conventional 
digital architectures. 


silicon neurons and synaptic circuits can be used to 
implement an alternative brain inspired computational 
paradigm that is complementary to the conventional 
ones based on von Neumann architectures. 

Indeed, the field of neuromorphic engineering has 
been very successful in developing a new generation of 
computing technologies implemented with design prin- 
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ciples based on those of the nervous systems, and which 
exploit the physics of computation used in biological 
neural systems. It is now possible to design and im- 
plement complex large-scale artificial neural systems 
with elaborate computational properties, such as spike- 
based plasticity and soft winner-take-all behavior, or 
even complete artificial sensory-motor systems, able to 
robustly process signals in real time using neuromor- 
phic VLSI technology. 

Within this context, neuromorphic VLSI technology 
can be extremely useful for exploring neural processing 
strategies in real time. While there are clear advantages 
of this technology, for example, in terms of power bud- 
get and size requirements, there are also restrictions 
and limitations imposed by the hardware implemen- 
tations that limit their possible range of applications. 
These constraints include for example limited resolu- 
tion in the state variables or bounded parameters (e.g., 
bounded synaptic weights that cannot grow indefinitely 
or become negative). Also the presence of noise and 
inhomogeneities in all circuit components, place se- 
vere limitations on the precision and reliability of the 
computations performed. However, most, if not all, the 
limitations that neuromorphic hardware implementa- 
tions face, (e.g., in maintaining stability, in achieving 
robust computation using unreliable components, etc.) 
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Damien Coyle, Ronen Sosnik 


Neuroengineering of sensorimotor rhythm-based 
brain-computer interface (BCI) systems is the 
process of using engineering techniques to un- 
derstand, repair, replace, enhance, or otherwise 
exploit the properties of neural systems, engaged 
in the representation, planning, and execution 
of volitional movements, for the restoration and 
augmentation of human function via direct inter- 
actions between the nervous system and devices. 

This chapter reviews information that is fun- 
damental for the complete and comprehensive 
understanding of this complex interdisciplinary 
research field, namely an overview of the motor 
system, an overview of recent findings in neu- 
roimaging and electrophysiology studies of the 
motor cortical anatomy and networks, and the 
engineering approaches used to analyze motor 
cortical signals and translate them into control 
signals that computer programs and devices can 
interpret. 

Specifically, the anatomy and physiology of 
the human motor system, focusing on the brain 
areas and spinal elements involved in the gen- 
eration of volitional movements is reviewed. The 
stage is then set for introducing human proto- 
typical motion attributes, sensorimotor learning, 
and several computational models suggested to 
explain psychophysical motor phenomena based 
on the current knowledge in the field of neuro- 
physiology. 

An introduction to invasive and non-invasive 
neural recording techniques, including func- 
tional and structural magnetic resonance imaging 
(fMRI and sMRI), electrocorticography (ECoG), elec- 
troencephalography (EEG), intracortical single 
unit activity (SU) and multiple unit extracellular 
recordings, and magnetoencephalography (MEG) 
is integrated with coverage aimed at elucidating 
what is known about sensory motor oscillations 
and brain anatomy, which are used to generate 
control signals for brain actuated devices and al- 
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ternative communication in BCI. Emphasis is on 

latest findings in these topics and on highlight- 

ing what information is accessible at each of the 
different scales and the levels of activity that are 
discernible or utilizable for the effective control 

of devices using intentional activation sensorimo- 
tor neurons and/or modulation of sensorimotor 

rhythms and oscillations. 

The nature, advantages, and drawbacks of var- 
ious approaches and their suggested functions as 
the neural correlates of various spatiotemporal 
motion attributes are reviewed. Sections dealing 
with signal analysis techniques, translation algo- 
rithms, and adaption to the brain's non-stationary 
dynamics present the reader with a wide-ranging 
review of the mathematical and statistical tech- 
niques commonly used to extract and classify the 
bulk of neural information recorded by the various 
recording techniques and the challenges that are 
posed for deploying BCI systems for their intended 
uses, be it alternative communication and control, 
assistive technologies, neurorehabilitation, neu- 
rorestoration or replacement, or recreation and 
entertainment, among other applications. Lastly, 
a discussion is presented on the future of the field, 
highlighting newly emerging research directions 
and their potential ability to enhance our un- 
derstanding of the human brain and specifically 
the human motor system and ultimately how that 
knowledge may lead to more advanced and intel- 
ligent computational systems. 


39.1 Overview - Neuroengineering 


ii General occ. ccsese sceseseeescestcnceseess 728 
39.1.1 The Human Motor System........ 730 
39.2 Human Motor Control ......................... T32 
39.2.1 Motion Planning 
and Execution in Humans ....... 32 
39.2.2 Coordinate Systems Used 
to Acquire 


a New Internal Model ............. 732 


727 


v 
o 

= 

+ 
o 
Ww 
te) 


728 PartD 


L'6E | d Hed 


Neural Networks 


39.2.3 Spatial Accuracy 


and Reproducibility ................ 733 
39.3 Modeling the Motor System - 
Internal Motor Models........................ 733 


39.3.1 Forward Models, Inverse 

Models, and Combined Models 734 
39.3.2 Adaptive Control Theory .......... 734 
39.3.3 Optimization Principles........... 734 
39.3.4 Kinematic Features 

of Human Hand Movements 

and the Minimum Jerk 

Hypothesis asessorina 735 
39.3.5 The Minimum Jerk Model, 

The Target Switching Paradigm, 

and Writing-like Sequence 


MOVEMEMÝS 1. scnccccsssexseveracteres 736 
39.4 Sensorimotor Learning ....................... 736 
39.4.1 Explicit Versus Implicit 
Motor Learning... T3T 
39.4.2 Time Phases in Motor Learning. 737 
39.4.3 Effector Dependency............... T31 
39.4.4 Coarticulation.......... ee 738 
39.4.5 Movement Cuing.................... 738 
39.5 MRI and the Motor System - 
Structure and Function....................... 738 
39.6 Electrocorticographic Motor Cortical 
Surface Potentials....................::::cce 741 


39.7 MEG and EEG - Extra Cerebral 

Magnetic and Electric Fields 
of the Motor System .......................0 745 
39.7.1 Sensorimotor Rhythms and 

Other Surrounding Oscillations. 746 
39.7.2 Movement-Related Potentials.. 747 
39.7.3 Decoding Hand Movements 

MPI EEG oirir ieas T47 


39.8 Extracellular Recording - 
Decoding Hand Movements 
from Spikes and Local Field Potential.. 748 


39.8.1 Neural Coding Schemes ........... 749 
39.8.2 Single Unit Activity Correlates 

of Hand Motion Attributes ....... 751 
39.8.3 Local Field Potential Correlates 

of Hand Motion Attributes ....... 754 

39.9 Translating Brainwaves 

into Control Signals - BClIs.................. 754 
39.9.1 Pre-Processing and Feature 

Extraction/Selection ................ 755 
39.9.2 Classification. esce TST 


39.9.3 Unsupervised Adaptation 
in Sensorimotor Rhythms BCIs.. 757 


39.9.4 BCI Outlook ......iioeeeeeeeeeea 760 
39.10 Conclusion ........aeeeeeeeen 762 
References........... 00. cec ccc ceceeceeeeeeeeeeeaeeeeeuees 764 


39.1 Overview — Neuroengineering in General 


Neuroengineering is defined as the interdisciplinary 
field of engineering and computational approaches to 
problems in basic and clinical neurosciences. Thus, ed- 
ucation and research in neuroengineering encompasses 
the fields of engineering, mathematics, and computer 
science on the one hand, and molecular, cellular, and 
systems neurosciences on the other. Prominent goals in 
the field include restoration and augmentation of human 
functions via direct interactions between the nervous 
system and artificial devices. Much current research is 
focused on understanding the coding and processing of 
information in the sensory and motor systems, quanti- 
fying how this processing is altered in the pathological 
state, and how it can be manipulated through interac- 
tions with artificial devices, including brain-computer 
interfaces (BCIs) and neuroprostheses. 

Although there are many topics that can be covered 
under neuroengineering umbrella, this chapter does not 
aim to cover them all. Focus is on providing a com- 


prehensive overview of state-of-the-art in technologies 
and knowledge surrounding the human motor system. 
The motor system is extremely complex in terms of 
the functions it performs and the structure underlying 
the control it provides; however, there is an ever- 
increasing body of knowledge on how it works and 
is controlled. This has been facilitated by studies of 
animal models, computational models, electrophysiol- 
ogy, and neuroimaging of humans. Another important 
development that is extending the boundaries of our 
knowledge about the motor system is the development 
of brain—computer interface (BCI) technologies that in- 
volve intentional modulation of sensorimotor activity 
through executed as well as imagined movement (motor 
imagery). BCI research not only opens up a framework 
for non-muscular communication between humans and 
computers/machines but offers experimental paradigms 
for understanding the neuroscience of motor control, 
testing hypotheses, and gaining detailed insight into 
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motor control from the activity of a single neuron, 
a small population of neurons, networks of neurons, 
and the spatial and spectral relationship across multiple 
brain regions and networks. This knowledge will un- 
doubtedly lead to better diagnostics for motor-related 
pathologies, better BCIs for assistance and alterna- 
tive non-muscular communication applications for the 
physically impaired, better rehabilitation for those ca- 
pable of regaining lost motor function, and a better 
understanding of the brain as a whole. 

Relevantly, for the context and scope of this hand- 
book, BCI research will contribute to better gaining 
a better insight of information processing in the brain, 
resulting in better, more intelligent computational ap- 
proaches to developing, truly intelligent systems — sys- 
tems that perceive, reason, and act autonomously. 

The motor system is often considered to be at the 
heart of human intelligence. From the motor chauvin- 
ist’s point of view the entire purpose of the brain is to 
produce movement [39.1]. This assertion is based on the 
following observations about movement: 


1. Interaction with the world is only achieved through 
movement. 

2. All communication is mediated via the motor sys- 
tem including speech, sign language, gesture, and 
writing. 

3. All sensory and cognitive processes can be consid- 
ered inputs that determine future motor outputs. 


Neuroscientists and researchers focusing on other 
areas and functions of the brain may refute this sugges- 


tion given the fact that many regions related to general 
intelligence are located throughout the brain and that 
a single intelligence center is unlikely. No single neu- 
roanatomical structure determines general intelligence, 
and different types of brain designs can produce equiva- 
lent intellectual performance [39.3]. Nevertheless, there 
is no doubt that the motor system is critical to the 
advancement of human level intelligence and, there- 
fore, in the context of computational intelligence, this 
chapter focuses on reviewing studies and methodolo- 
gies that elucidate some of the aspects that we know 
about sensorimotor systems and how these can be stud- 
ied. Although the aim of the chapter is not to provide 
an exhaustive review of the available extensive litera- 
ture, it does aim to provide insights into key findings 
using some of the state-of-the-art experimental and 
methodological approaches deployed in neuroscience 
and neuroengineering, whilst at the same time review- 
ing methodology that may lead to the development of 
practical BCIs. BCIs have revealed new ways of study- 
ing how the brain learns and adapts, which in turn 
have helped improve BCI designs and better compu- 
tational intelligence for adapting the signal processing 
to the adaptation regime of the brain. One of the key 
findings in BCI research is that it can trigger plas- 
tic changes in different brain areas, suggesting that 
the brain has even greater flexibility than previously 
thought [39.4]. These findings can only serve to im- 
prove our understanding of how the brain, the most 
sophisticated and complex organism in the known uni- 
verse, functions, undoubtedly leading to better compu- 
tational systems. 
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Fig. 39.1 Major components of the 
motor system (after [39.2], courtesy 
of Shadmehr) 
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39.1.1 The Human Motor System 


The human motor system produces action. It controls 
goal-directed movement by selecting the targets of ac- 
tion, generating a motor plan and coordinating the 
generation of forces needed to achieve those objec- 
tives. Genes encode a great deal of the information 
required by the motor system — especially for actions in- 
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Fig. 39.2 Divisions of the spinal cord (after [39.5], courtesy of 
Shadmehr and McDonald) 


volving locomotion, orientation, exploration, ingestion, 
defence, aggression, and reproduction — but every indi- 
vidual must learn and remember a great deal of motor 
information during his or her lifetime. Some of that in- 
formation rises to conscious awareness, but much of it 
does not. Here we will focus on the motor system of hu- 
mans, drawing on information from primates and other 
mammals, as necessary. 


Major Components of the Motor System 

The central nervous system that vertebrates have 
evolved comprises six major components: the spinal 
cord, medulla, pons, midbrain, diencephalon and telen- 
cephalon, the last five of which compose the brain. In 
a different grouping, the hindbrain (medulla and pons), 
the midbrain, and the forebrain (telencephalon plus di- 
encephalon) constitute the brain. Taken together, the 
midbrain and hindbrain make up the brainstem. All lev- 
els of the central nervous system participate in motor 
control. However, let us take the simple act of reaching 
to pick up a cup of coffee to illustrate the function of the 
various components of the motor system (Fig. 39.1): 


@ The parietal cortex: Processes visual information 
and proprioceptive information to compute location 
of the cup with respect to the hand. Sends this infor- 
mation to the motor cortex. 

@ The motor cortex: Using the information regarding 
the location of the cup with respect to the hand, it 
computes forces that are necessary to move the arm. 
This computation results in commands that are sent 
to the brainstem and the spinal cord. 

@ The brainstem motor center: Sends commands to 
the spinal cord that will maintain the body’s balance 
during the reaching movement. 

@ The spinal cord: Motor neurons send the commands 
received from the motor cortex and the brainstem to 
the muscles. During the movement, sensory infor- 
mation from the limb is acquired and transmitted 
back to the cortex. Reflex pathways ensure stability 
of the limb. 

© The cerebellum: This center is important for co- 
ordination of multi-joint movements, learning of 
movements, and maintenance of postural stability. 

© The basal ganglia: This center is important for 
learning of movements, stability of movements, ini- 
tiation of movements, emotional, and motivational 
aspects of movements. 

@ The thalamus: May be thought of as a kind of 
switchboard of information. It acts as a relay be- 
tween a variety of subcortical areas and the cerebral 
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Fig. 39.3 A spinal segment (after [39.5], courtesy of Shadmehr and McDonald) 


cortex, although recent studies suggest that thala- 
mic function is more selective. The neuronal infor- 
mation processes necessary for motor control are 
proposed as a network involving the thalamus as 
a subcortical motor center. The nature of the inter- 
connected tissues of the cerebellum to the multiple 
motor cortices suggests that the thalamus fulfills 
a key function in providing the specific channels 
from the basal ganglia and cerebellum to the cor- 
tical motor areas. 


The spinal cord comprises four major divisions. 
From rostral to caudal, these are called: cervical, tho- 
racic, lumbar, and sacral (Fig. 39.2). Cervix is the Latin 
word for neck. The cervical spinal segments intervene 
between the pectoral (or shoulder) girdle and the skull. 
Thorax means chest (or breast plate). Lumbar refers to 
the loins. Sacral, the most intriguing name of all refers 
to some sort of sacred bone. 

In mammals, the cervical spinal cord has 8 seg- 
ments; the thoracic spinal cord has 12, and the lumbar 
and sacral cords both have 5. The parts of the spinal 
cord that receive inputs from and control the muscles 
of the arms (more generally, forelimbs) and legs (more 
generally, hind limbs) show enlargements associated 
with an increasing number and size of neurons and 
fibers: the cervical enlargement for the arms and the 


lumbar enlargement for the legs. Each segment is la- 
beled and numbered according to its order, from rostral 
to caudal, within each general region of spinal cord. 
Thus, the first cervical segment is abbreviated C1 and 
together the eight cervical segments can be designated 
as C1-C8. 

In each spinal segment, one finds a ring of white 
matter (WM) surrounding a central core of gray matter 
(GM) (Fig. 39.3). White matter is so called because the 
high concentration of myelin in the fiber pathways gives 
it a lighter, shiny appearance relative to regions with 
many cell bodies. The spinal gray matter bulges at the 
dorsal and ventral surfaces to form the dorsal horn and 
ventral horn, respectively. 

The cord has two major systems of neurons: de- 
scending and ascending. In the descending group, the 
neurons control both smooth muscles of the inter- 
nal organs and the striated muscles attached to our 
bones. The descending pathway begins in the brain, 
which sends electrical signals to specific segments in 
the cord. Motor neurons in those segments then con- 
vey the impulses towards their destinations outside the 
cord. 

The ascending group is the sensory pathways, send- 
ing sensory signals received from the limbs and mus- 
cles and our organs to specific segments of the cord 
and to the brain. These signals originate with special- 
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ized transducer cells, such as cells in the skin that 
detect pressure. The cell bodies of the neurons are 
in a gray, butterfly-shaped region of the cord (gray 
matter). The ascending and descending axon fibers 


39.2 Human Motor Control 


Motor control is a complex process that involves the 
brain, muscles, limbs, and often external objects. It 
underlies motion, balance, stability, coordination, and 
our interaction with others and technology. The general 
mission of the human motor control research field is to 
understand the physiology of normal human voluntary 
movement and the pathophysiology of different move- 
ment disorders. Some of the opening questions include: 
how do we select our actions of the many actions possi- 
ble? How are these behaviors sequenced for appropriate 
order and timing between them? How does perception 
integrate with motor control? And how are perceptual- 
motor skills acquired? In the following section the basic 
aspects of motor control — motor planning and motor 
execution are presented. 


39.2.1 Motion Planning 
and Execution in Humans 


Human goal-directed arm movements are fast, accurate, 
and can compensate for various dynamic loads exerted 
by the environment. These movements exhibit remark- 
able invariant properties, although a motor goal can 
be achieved using different combinations of elementary 
movements. 

Models of goal-directed human arm movements 
can be divided into two major groups: feedback and 
feed-forward. Feedback schemes for motion planning 
assume that the motion is generated through a feedback 
control law, whereas feed-forward schemes of trajec- 
tory formation propose that the movement is planned 
in advance and then executed. While a comprehensive 
model of human arm movements should include feed- 
forward as well as feedback control mechanisms, pure 
feedback control mechanisms cannot account for the 
fast and smooth movements performed by adult hu- 
mans. Although none of the existing models are able 
to account for all the characteristics of human mo- 
tion, there is compelling evidence that mechanisms for 
feed-forward motion planning exist within the central 
nervous system (CNS). A further supporting argument 
for the existence of a pre-planned trajectory is that visu- 


travel in a surrounding area known as the white mat- 
ter. It is called white matter because the axons are 
wrapped in myelin, a white electrically insulating 
material. 


ally directed movements are characterized by relatively 
long reaction times (RT) of 200—500 ms [39.6], which 
are supposed to reflect the time needed to plan an ad- 
equate movement. A partial knowledge of either the 
amplitude or the direction of the upcoming movement 
can significantly reduce the RT. 


39.2.2 Coordinate Systems Used to Acquire 
a New Internal Model 


In reaching for objects around us, neural processing 
transforms visuospatial information about target loca- 
tion into motor commands to specify muscle forces 
and joint motions that are involved in moving the 
hand to the desired location [39.7]. In planar reaching 
movements, extent and direction have different vari- 
able errors, suggesting that the CNS plans the move- 
ment amplitude and direction independently and that 
the hand paths are initially planned in vectorial co- 
ordinates without taking into account joint motions. 
In this framework, the movement vector is specified 
as an extent and direction from the initial hand posi- 
tion. Kinematic accuracy depends on learning a scaling 
factor from errors in extent and reference axes from 
errors in direction, and the learning of new reference 
axes shows limited generalization [39.8]. Altogether, 
these findings suggest that motor planning takes place 
in extrinsic, hand-centered, visually perceived coor- 
dinates. Finally however, vectorial information needs 
to be converted into muscle forces for the desired 
movement to be produced. This transformation needs 
to take into account the biomechanical properties of 
the moving arm, notably the interaction torques pro- 
duced at all the joints by the motion of all limb 
segments. For multi-joint arms, there are significant 
inertial dynamic interactions between the moving skele- 
tal segments, and several muscles pull across more 
than one joint. Clearly, these complexities raise com- 
plicated control problems since one needs to overcome 
or solve the inverse dynamics problem. The capac- 
ity to anticipate the dynamic effects is understood 
to depend on learning an internal models of muscu- 
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loskeletal dynamics and other forces acting on the 
limb. 

The equilibrium trajectory hypothesis for multi- 
joint arm motions [39.9] circumvented the complex 
dynamic problem mentioned above by using the spring- 
like properties of muscles and stating that multi-joint 
arm movements are generated by gradually shifting 
the hand equilibrium positions defined by the neuro- 
muscular activity. The magnitude of the force exerted 
on the arm, at any time, depends on the difference 
between the actual and equilibrium hand positions 
and the stiffness and viscosity about the equilibrium 
position. 

Neuropsychological studies indicate that for M1 
(primary motor) region, the representations that medi- 
ate motor behavior are distributed, often in a graded 
manner, across extensive, overlapping cortical regions, 
so that different memory systems can underlie different 
coordinate systems, which are used at different hierar- 
chical levels. 


39.2.3 Spatial Accuracy and Reproducibility 


Our ability to generate accurate and appropriate mo- 
tor behavior relies on tailoring our motor commands 
to the prevailing movement context. This context em- 
bodies parameters of both our own motor system, such 
as the level of muscle fatigue, and the outside world, 
such as the weight of a bottle to be lifted. As the con- 
sequence of a given motor command depends on the 
current context, the CNS has to estimate this context 
so that the motor commands can be appropriately ad- 
justed to attain accurate control. A current context can 
be estimated by integrating two sources of information: 
sensory feedback and knowledge about how the context 


is likely to have changed from the previous estimate. In 
the absence of sensory feedback about the context, the 
CNS is able to extrapolate the likely evolution of the 
context without requiring awareness that the context is 
changing [39.10]. 

Although the CNS tries to maximize our motion 
accuracy, systematic directional errors are still found. 
These errors may result from a number of sources. 
One cause for not being accurate is a visual distor- 
tion, which could be the outcome of fatigue of the 
eyes or inherent optical distortion. A second cause 
could be imperfect control processes due to the noise 
in the neuromuscular system or blood flow pulsations, 
which cause our movements to be jerky. A third cause 
could be that each movement we utilize, consciously 
or unconsciously, may involve different motor plans, 
which result in slightly different trajectories and end- 
point accuracies. 

In a simple aiming movement, the task is to mini- 
mize the final error, as measured by the variance about 
the target. The endpoint variability has an ellipsoid 
shape with two main axes perpendicular one to another. 
This finding led to the vectorial planning hypotheses 
stating that planning of visually guided reaches is ac- 
complished by independent specification of extent and 
direction [39.8]. It was later suggested that the aim of 
the optimal control strategy is to minimize the volume 
of the ellipsoid, thereby being as accurate as possi- 
ble. Non-smooth movements require increased motor 
commands, which generate increased noise; smooth- 
ness thereby leads to increased end-point accuracy but 
is not a goal in its own. Although the end-point-error 
cost function specifies the optimal movement, how one 
approaches this optimum for novel, unrehearsed move- 
ments is an open question. 


39.3 Modeling the Motor System - Internal Motor Models 


An internal model is a postulated neural process that 
simulates the response of the motor system in order to 
estimate the outcome of a motor command. The inter- 
nal model theory of motor control argues that the motor 
system is controlled by the constant interactions of the 
plant and the controller. The plant is the body part being 
controlled, while the internal model itself is considered 
part of the controller. Information from the controller, 
such as information from the CNS, feedback informa- 
tion, and the efference copy, is sent to the plant which 
moves accordingly. 


Internal models can be controlled through either 
feed-forward or feedback control. Feed-forward control 
computes its input into a system using only the cur- 
rent state and its model of the system. It does not use 
feedback, so it cannot correct for errors in its control. 
In feedback control, some of the output of the system 
can be fed back into the system’s input, and the sys- 
tem is then able to make adjustments or compensate 
for errors from its desired output. Two primary types 
of internal models have been proposed: forward mod- 
els and inverse models. In simulations, models can be 
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combined together to solve more complex movement 
tasks. 

The following section elaborates on the two internal 
models, introduces the concept of optimization princi- 
ples and its use in modeling human motor behavior, 
presenting a well-established motor control model for 
2-D volitional hand movement. 


39.3.1 Forward Models, Inverse Models, 
and Combined Models 


In their simplest form, forward models take the input 
of a motor command to the plant and output a pre- 
dicted position of the body. The motor command input 
to the forward model can be an efference copy. The out- 
put from that forward model, the predicted position of 
the body, is then compared with the actual position of 
the body. The actual and predicted position of the body 
may differ due to noise introduced into the system by 
either internal (e.g., body sensors are not perfect, sen- 
sory noise) or external (e.g., unpredictable forces from 
outside the body) sources. If the actual and predicted 
body positions differ, the difference can be fed back as 
an input into the entire system again so that an adjusted 
set of motor commands can be formed to create a more 
accurate movement. 

Inverse models use the desired and actual position 
of the body as inputs to estimate the necessary motor 
commands that would transform the current position 
into the desired one. For example, in an arm reaching 
task, the desired position (or a trajectory of consecutive 
positions) of the arm is input into the postulated inverse 
model, and the inverse model generates the motor com- 
mands needed to control the arm and bring it into this 
desired configuration. 

Theoretical work has shown that in models of motor 
control, when inverse models are used in combination 
with a forward model, the efference copy of the motor 
command output from the inverse model can be used 
as an input to a forward model for further predictions. 
For example if, in addition to reaching with the arm, the 
hand must be controlled to grab an object, an efference 
copy of the arm motor command can be input into a for- 
ward model to estimate the arm’s predicted trajectory. 
With this information, the controller can then generate 
the appropriate motor command telling the hand to grab 
the object. It has been proposed that if they exist, this 
combination of inverse and forward models would al- 
low the CNS to take a desired action (reach with the 
arm), accurately control the reach, and then accurately 
control the hand to grip an object. 


39.3.2 Adaptive Control Theory 


With the assumption that new models can be acquired 
and pre-existing models can be updated, the efference 
copy is important for the adaptive control of a move- 
ment task. Throughout the duration of a motor task, an 
efference copy is fed into a forward model known as 
a dynamics predictor whose output allows prediction of 
the motor output. When applying adaptive control the- 
ory techniques to motor control, the efference copy is 
used in indirect control schemes as the input to the ref- 
erence model. 


39.3.3 Optimization Principles 


Optimization theory is a valuable integrative and pre- 
dictive tool for studying the interaction between the 
many complex factors, which result in the generation of 
goal-directed motor behavior. It provides a convenient 
way to formulate a model of the underlying neural com- 
putations without requiring specific details on the way 
those computations are carried out. The components of 
optimization problems are: a task goal (defined mathe- 
matically by a performance criterion or a cost function), 
a system to be controlled (a set of system variables that 
are available for modulation), and an algorithm capa- 
ble of finding an analytical or a numerical solution. By 
rephrasing the learning problem within the framework 
of an optimization problem, one is forced to make ex- 
plicit, quantitative hypotheses about the goals of motor 
actions and to articulate how these goals relate to ob- 
servable behavior. 

As indicated in the last section, goal-directed arm 
movements exhibit remarkable invariant properties de- 
spite the fact that a given point in space can be reached 
through an infinite number of spatial, articular, and 
muscle combinations. In order to account for this ob- 
servation it is necessary to postulate the existence of 
a regulator, i.e., a functional constraint, to reduce the 
number of degrees of freedom available to perform the 
task. Most of the regulators proposed during the last 
decade refer to a general hypothesis that the nervous 
system fries to minimize some cost related to the move- 
ment performance. Nelson [39.11] first formulated this 
idea in an operative way by proposing to use mathemat- 
ical cost functions to estimate the energy or other costs 
consumed during a movement. This approach was fur- 
ther developed by several investigators who proposed. 
different criteria such as, for instance, minimum muscu- 
lar energy, minimum effort, minimum torque, minimum 
work, or minimum variance. A model that is indis- 
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putably one of the most mentioned in the literature and 
that has proven to be very powerful in describing multi- 
joint movements is the minimum jerk model described 
in the next section. 


39.3.4 Kinematic Features 
of Human Hand Movements 
and the Minimum Jerk Hypothesis 


Human point-to-point arm movements that are re- 
stricted in the horizontal plane tend to be straight, 
smooth, with single-peaked, bell-shaped velocity pro- 
files and are invariant with respect to rotation, transla- 
tion, and spatial or temporal scaling. Motor adaptation 
studies in which unexpected static loads or velocity- 
dependent force fields were applied during horizontal 
reaching movements further supported the hypothesis 
that arm trajectories follow a kinematic plan formu- 
lated in an extrinsic Cartesian task-space. The mor- 
phological invariance of the movement in Cartesian 
space supported the hypothesis that the hand trajec- 
tory in task-space is the primary variable computed 
during movement planning. It is assumed that follow- 
ing the planning process, the CNS performs non-linear 
inverse kinematics computations, which convert time 
sequences of hand position into time sequences of joint 
positions. 
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Fig. 39.4a,b Overlapped predicted (solid lines) and measured (dashed lines) hand paths (a,, bı), speeds (az, b2), and acceleration 
components along the y-axis (a3, b3) and along the x-axis (d) for two unconstrained point-to-point movements. (a) A movement 
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The kinematic features of one-joint goal directed 
movements were successfully modeled by the mini- 
mum jerk hypothesis [39.13] and were later extended 
for planar hand motion [39.12]. The minimum jerk 
model states that the time integral of the squared mag- 
nitude of the first derivative of the Cartesian hand 
acceleration (jerk) is minimized, 


E 2 
dr 
eal — | dt, 
dË 

0 


where r(t) = (x(t), y(t)) are the Cartesian coordinates 
of the hand and T is the movement duration. The solu- 
tion of this variational problem, assuming zero velocity 
and zero acceleration at the initial and final hand loca- 
tions 7;, rf, is given by 


o= i+ (10(4)'-15(4)' +6(4)) 


x (ri— re) . 


(39.1) 


(39.2) 


The experimental setup and the comparison between 
experimental data and the minimum-jerk model predic- 
tion for hand paths, tangential velocities, and accelera- 
tion components between different targets are depicted 
in Fig. 39.4. 
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between targets 3 and 6. (b) A movement between targets 1 and 4 (after [39.12], courtesy of Flash) 
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39.3.5 The Minimum Jerk Model, 
The Target Switching Paradigm, 
and Writing-like Sequence 
Movements 


The stereotyped kinematic patterns of planar reaching 
movements are not the expression of a pre-wired or 
inborn motor pattern, but the result of learning during 
ontogenesis. When infants start to reach, their reaching 
is characterized by multiple accelerations and decelera- 
tions of the hand, while experienced infants reach with 
much straighter hand paths and with a single smooth 
acceleration and deceleration of the hand. It is possi- 
ble to decompose a large proportion of infant reaches 
into an underlying sequence of sub-movements that re- 
semble simple movements of adults. It is now believed 
that the CNS uses small, smooth sub-movements, com- 
monly known as motion primitives, which are smoothly 
concatenated in time and space, in order to construct 
more complicated trajectories. Motor primitives can be 
considered neural control modules that can be flexibly 
combined to generate a large repertoire of behaviors. 
A primitive may represent the temporal profile of a par- 
ticular muscle activity (low level, dynamic intrinsic 
primitive) or a geometrical shape in visually perceived 
Cartesian coordinates (high level, kinematic extrinsic 
primitive [39.14, 15]). The overall motor output will 
be the sum of all primitives, weighted by the level of 
activations of each module. A behavior for which the 
motor system has many primitives will be easy to learn, 
whereas a behavior that cannot be approximated by any 
set of primitives would be impossible to learn [39.16]. 
The biological plausibility of the primitives’ mod- 
ules model was demonstrated in studies on spinalized 
frogs and rats that showed that the pre-motor circuits 
within the spinal cord are organized into a set of discrete 
modules [39.17]. Each module, when activated, induces 
a specific force field, and the simultaneous activation 
of multiple modules leads to the vectorial combination 
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Motion planning strategies may also change with learn- 
ing. If a task is performed for the first time the only 
strategy the CNS might follow is to develop a plan, 
which allows the execution of the task without taking 
into account the computational cost. A repetitive per- 
formance might result in a change in the coding of the 
movement and produce a more optimal behavior — at 


of the corresponding fields. Other evidence for the ex- 
istence of primitive sub-movements came from works 
on hemiplegic stroke patients, which showed that the 
patients’ first movements were clearly segmented and 
exhibited a remarkably invariant speed vs. time profile. 

The concept of superposition was further elaborated 
and modeled for target switching experiments [39.18]. 
It was found that arm trajectory modification in a dou- 
ble target displacement paradigm involves the vectorial 
summation of two independent plans, each coding for 
a maximally smooth point-to-point trajectory. The first 
plan is the initial unmodified plan for moving between 
the initial hand position and the first target location. 
The second plan is a time-shifted trajectory plan that 
starts and ends at rest and has the same amplitude and 
kinematic form as a simple point-to-point movement 
between the first and second target locations. 

The minimum jerk model is also a powerful model 
for predicting the generated trajectory when subjects 
are instructed to generate continuous movements from 
one target to another through an intermediate tar- 
get. It was also shown that, using the minimum jerk 
model, human handwriting properties can be faithfully 
reconstructed while specifying the velocities and the 
positions at via-points, taken at maximum curvature 
locations. 

Understanding primitives may only be achieved 
by investigating the neural correlates of sensorimotor 
learning and control. We already know a lot about 
the neural correlates of motor imagery and execution 
as highlighted in Sects. 39.6 and 39.7, which may 
provide a good starting point to investigate motion 
primitives, but we will have to go beyond basic cor- 
relates to understand the time-dependent, non-linear 
relationship among various neural correlates of motor 
learning and control. This will involve new exper- 
imental paradigms and computational methods. The 
following section overviews investigations into senso- 
rimotor learning. 


a lower computational cost. Thus, practice — the track 
for perfection, allows the performance of many tasks to 
improve, throughout life, with repetitions. 

Even in adulthood simple tasks such as reaching to- 
wards a target or rapidly and accurately tapping a short 
sequence of finger movements, which appear when 
mastered to be effortlessly performed, often require ex- 
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tensive training before skilled performance develops. 
A performance gains asymptotes after a long training 
period and is usually kept intact for years to come. 
Many studies have focused on different aspects of 
motor learning: time scale in motor learning and devel- 
opment, task and effector specificity, effect of attention, 
and intention and explicit vs. implicit motor leaning. 
These topics are discussed in the next section in the con- 
text of motor sequence learning. 


39.4.1 Explicit Versus Implicit 
Motor Learning 


When considering sequence learning one needs to dis- 
tinguish between explicit and implicit learning. Ex- 
plicit learning is frequently assumed to be similar to 
the processes which operate during conscious problem 
solving, and includes: conscious attempts to construct 
a representation of the task; directed search of mem- 
ory for similar or analogous task relevant information; 
and conscious attempts to derive and test hypotheses 
related to the structure of the task. This type of learn- 
ing has been distinguished from alternative models of 
learning, termed implicit learning. The term implicit 
learning denotes learning phenomena in which more 
or less complex structures are reflected in behavior, al- 
though the learners are unable to verbally describe these 
structures. Numerous studies have examined implicit 
learning of serial-order information using the serial re- 
action time (SRT) task. In this task, learning is revealed 
as a decrease in reaction times for stimuli presented 
when needed to repeat a sequence versus those pre- 
sented in a random order. 

There is a vast literature debating what is really 
learned in the SRT task. The description of a given se- 
quence structure is from a theoretical point of view not 
trivial because a given structure typically has several 
different structural components. Implicit learning may 
depend on each of these structural components. In se- 
quence learning tasks these components may pertain to: 
frequency-based, statistical structures (i. e.,redundancy), 
relational structures, and temporal and spatial patterns. 
A literature review shows that all of these components in- 
fluence on the rate in which a sequence is learned. 

Neuropsychological research suggests that implicit 
sequence learning in the SRT task is spared in pa- 
tients with organic amnesia, so implicit SRT learning 
does not appear to depend on the medial temporal 
and diencephalic brain regions that are critical for ex- 
plicit memory. Conversely, patients with Huntington or 
Parkinson diseases have consistently shown SRT im- 


pairments, so the basal ganglia seem to be critically 
involved in SRT learning. Recent studies indicate that 
the anterior striatum affects learning of new sequences 
while the posterior striatum is engaged in recalling 
a well-learned sequence. In the following section the 
discussion is restricted to explicit motor learning. 


39.4.2 Time Phases in Motor Learning 


It is reasonable to assume that a gain in a motor per- 
formance reflects a change in brain processing which 
is triggered by practice. The verity that many skills, 
when acquired, are retained over long time intervals 
suggests that training can induce long-lasting neural 
changes. Previous results from neuroimaging studies in 
which performance was modified over time have shown 
that different learning stages can be defined by altered 
brain activations patterns. As an effect of repetition or 
practice, several studies report that specific brain ar- 
eas showed an increase in the magnitude or extent of 
activation. Motor skill learning (e.g., sequential finger 
opposition tasks) requires prolonged training times and 
has two distinct phases, analogous to those subserving 
perceptual skill learning: an initial, fast improvement 
phase (fast learning) in which the extent of activation 
in the M1 area decreases (habituation-like response) 
and a slowly evolving, post training incremental per- 
formance gain (slow learning), in which the activation 
in M1 increases compared to control conditions [39.19]. 


39.4.3 Effector Dependency 


Another fascinating enigma in the realm of motor learn- 
ing is whether the representation of procedural memory 
in the brain changes throughout training and whether 
different neural correlates underlie the different learn- 
ing stages. A study conducted on monkeys, in which 
a sequence of ten button presses is learned by trial and 
error, has shown that the time course of improvement 
of two performance measures: key press errors and 
reaction-time (RT), was different [39.20]. The key press 
errors reached an asymptote within a shorter period of 
training compared to the RTs, which continued to de- 
crease throughout a longer time period. This finding 
suggested that the acquisition of sequence knowledge 
(as measured by key press errors) may take place 
quickly but long-term motor sequence learning (as mea- 
sured by RT) may take longer to be established, thus 
different aspects of the task are learned in different time 
scales. Further studies on monkeys and humans demon- 
strated that although effector-dependent and indepen- 
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dent learning occur simultaneously, effector-dependent 
representation might take longer to establish than 
effector-independent representation. 


39.4.4 Coarticulation 


After a motor sequence is extensively trained, most of 
the subjects undergo implicit or explicit anticipation, 
which results in a coarticulation — the spatial and tem- 
poral overlap of adjacent articulatory activities. It is 
well known that as we learn to speak, our speech be- 
comes smooth and fluent. Coarticulation in speech pro- 
duction is a phenomenon in which the articulator move- 
ments for a given speech sound vary systematically 
with the surrounding sounds and their associated move- 
ments. Several models have tried to predict the move- 
ments of fleshy tissue points on the tongue, lips, and 
jaw during speech production. Coarticulation has also 
been studied in the hand motor sequence. It was shown 
that pianists could anticipate a couple of notes before 
playing, which resulted in hand and finger kinematic 
divergence (assuming an anticipatory position) prior to 
the depression of the last common note. Such a diver- 
gence implies an anticipatory modification of sequen- 
tial movements of the hand, akin to the phenomenon 
of coarticulation in speech. Moreover, studies on flu- 
ent finger spelling has shown that rather than simply an 
interaction whereby a preceding movement affects the 
one following, the anticipated movement in a sequence 
could systematically affect the one preceding it. 


39.4.5 Movement Cuing 


Another important aspect of motor sequence perfor- 
manceis the type ofmovement cuing, external or internal. 
As internally cued movements are initiated at subject’s 
will, they have, by definition, predictable timing. Exter- 
nally triggered movements are performed in response to 


go signals; hence, they have unpredictable timing (un- 
less the timing of the go signal is not random and follows 
some temporal pattern that can be learned implicitly or 
explicitly). Studies on movement cuing in animals and 
patients with movement disorders have showed that the 
basal ganglia are presumably internal-cue generators and 
that they are preferentially connected with the supple- 
mentary motor area (SMA), an area that is concerned 
more with internal than with external motor initiation. 
In normal subjects the type of movement cuing influ- 
ences movement execution and performance. It has also 
been shown that teaching Parkinson disease (PD) pa- 
tients, who areimpaired withrespect to tasks involving the 
spontaneous generation of appropriate strategies, to initi- 
ate movements concurrently with external cue improved 
their motor performance. 

The preceding sections have provided a brief 
overview of the extensive literature available on un- 
derstanding the motor system from an experimental 
psychophysics and model-based perspective. A focus 
on general high level modeling is critical to under- 
standing motor control; however, the problem is being 
tackled from other perspectives, namely understand- 
ing the details of neuro and electrophysiology of brain 
regions and neural pathways involved in controlling 
motor function. In the context of developing brain- 
computer interfaces there have been significant efforts 
focused on understanding small network populations 
and structural, functional, and electrophysiological cor- 
relates of motor functions using epidural and subdural 
recordings, as well as non-invasively recorded elec- 
troencephalography (EEG), magnetoencephalography 
(MEG), and magnetic resonance imaging-based (MRI) 
technologies. Understanding the differences between 
imagined movement and motor execution, as well as 
the effects of movement feedback and no feedback have 
shed light on motor functioning. The following sections 
provide a snapshot of some recent findings. 


39.5 MRI and the Motor System - Structure and Function 


A new key phase of research is beginning to investigate 
how functional networks relate to structural networks, 
with emphasis on how distributed brain areas commu- 
nicate with each other [39.21]. Structural methods have 
been powerful in indicating when and where changes 
occur in both gray and white matter with learning and 
recovery [39.22] and disease [39.23]. Here we review 
some of the findings in sensorimotor systems with an 


emphasis on elucidating regions engaged in motor ex- 
ecution and motor imagery (imagined movement), and 
motor sequence learning. 

Even with identical practice, no two individuals are 
able to reach the same level of performance on a motor 
skill — nor do they follow the same trajectory of im- 
provement as they learn [39.24]. These differences are 
related to brain structure and function, but individual 
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differences in structure have rarely been explored. Stud- 
ies have shown individual differences in white matter 
(WM) supporting visuospatial attention, motor cortical 
connectivity through the Corpus callosum, and connec- 
tivity between the motor regions of the cerebellum and 
motor cortex. Steele et al. [39.24] studied the structural 
characteristics of the brain regions that are functionally 
engaged in motor sequence performance along with the 
fiber pathways involved. Using diffusion tensor imag- 
ing (DTD, probabilistic tractography, and voxel-based 
morphometry they aimed to determine the structural 
correlates of skilled motor performance. DTI is used to 
asses white matter integrity and perform probabilistic 
tractography. 

Fractional anisotropy (FA) is affected by WM 
properties, including axon myelination, diameter, and 
packing density. Differences in these properties may 
lead to individual differences in performance through 
pre-existing differences or training-induced changes in 
axon conduction velocity and synaptic synchronization, 
or density of innervation [39.24, 25]. Greater fiber in- 
tegrity along the superior longitudinal fasciculus (SLF) 
would be consistent with the idea that greater myelina- 
tion observed in relation to performance may underlie 
enhancements in synchronized activity between task- 
relevant regions. 

Voxel-based morphometry is used to assess gray 
matter (GM) volume. Individual differences in GM vol- 
ume may be influenced by multiple factors such as 
neuronal and glial cell density, synaptic density, vascu- 
lar architecture, and cortical thickness [39.26]. 

The majority of structural studies of individual 
differences find that better performance is associated 
with higher FA or greater GM volume. Individual 
differences in structural measures reflect differences in 
the microstructural organization of tissue related to task 
performance. A greater FA, an index of fiber integrity, 
may represent a greater ability for neurons in connected 
regions to communicate. Steele et al. [39.24] found 
enhanced synchronization performance on a temporal 
motor sequence task related to greater fiber integrity 
of the SLF, where the rate of improvement on synchro- 
nization was positively correlated with GM volume 
in cerebellar lobules HVI and V-regions that showed 
training-related decreases in activity in the same 
sample. The synchronization performance on the task 
was negatively correlated with FA in WM underlying 
the bilateral sensorimotor cortex, in particular within 
the bilateral corticospinal tract (CST), such that partic- 
ipants with greater final synchronization performance 
on the tasks had lower FA in these clusters. 


The results provide clear evidence of the impor- 
tance of structure in learning skilled tasks and that 
a larger corticospinal tract does not necessarily mean 
better performance. Enhanced fiber integrity in the SLF 
may result in reduced FA in regions where it crosses 
the CST and, therefore, there is a trade-off between 
the two in the region of the CST-SLF fiber crossing, 
which enables better performance for some motor im- 
agery and BCI participants — and is consistent with the 
idea of enhanced communication/synchronization be- 
tween regions that are functionally important for this 
task. The causes of inter-individual variability in brain 
structure are not fully understood, but are likely to 
include pre-existing genetic contributions and contribu- 
tions from learning and the environment [39.24]. Ullén 
et al. [39.27] attempted to address this by investigating 
whether millisecond variability in a simple, automatic 
timing task, isochronous finger tapping, correlates with 
intellectual performance and, using voxel-based mor- 
phometry, whether these two tasks share neuroanatom- 
ical substrates. Volumes of overlapping right prefrontal 
WM regions were found to be correlated with both sta- 
bility of tapping and intelligence. These results suggest 
a bottom-up explanation where extensive pre-frontal 
connectivity underlies individual differences in both 
variables as opposed to top-down mechanisms such as 
attentional control and cognitive strategies. 

Sensorimotor rhythm modulation is the most pop- 
ular BCI control strategy, yet little is known about the 
structural and functional differences that separate motor 
areas related to motor output from higher-order motor 
control areas or about the functional neural correlates 
of high-order control areas during voluntary motor con- 
trol. EEG and fMRI studies have shown the extent 
of motor regions that are active along with the tem- 
poral sequence of activations across different motor 
areas during a motor task and across different sub- 
jects [39.28, 29]. Ball et al. [39.28] have shown that all 
subjects in an EEG/fMRI study involving finger flex- 
ion had highly activated primary motor cortex areas 
along with activation of the frontal medial wall mo- 
tor areas. They also showed that some subjects had 
anterior type activations as opposed to posterior acti- 
vation for others, with some showing activity starting 
in the anterior cingulate motor area (CMA) and then 
shifting to the intermediate supplementary motor ar- 
eas. The time sequence of these activations was noted 
where it was shown that ~ 120ms before movement 
onset there was a drop in source strength in conjunction 
with an immediate increase of source strength in the 
M1 area. Those who showed more posterior activations 
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Fig. 39.5a-c Brain activation motor imagery (a) motor observation task (b) and motor execution task (c) showing mean 
activation of all participants (A), high aptitude users and low aptitude users individually (B), and the contrast of high 
aptitude users low aptitude users (C). The figure illustrates the maximum contrast between low aptitude and high aptitude 


BCI users (after [39.29], courtesy of Halder) 


were restricted to the posterior SMA. Some subjects 
showed activation of the inferior parietal lobe (IPL) 
during early movement onset. In all subjects showing 
activation of higher-order motor areas (anterior CMA, 
intermediate SMA, IPL), these areas became active be- 
fore the executive motor areas (M1 and posterior SMA). 
A number of these areas are related to attentional pro- 
cessing, others to triggering and others to executing. 
Understanding the sequence of these events for each 
individual in the context of rehabilitation and more ad- 
vanced brain and neural computer interfacing will be 
important. 

The neural mechanisms of brain—computer inter- 
face control were investigated by [39.29] in an {MRI 
study. It was shown that up to 30 different motor 
sites are significantly activated during motor execu- 
tion, motor observation, and motor imagery and that 
the number of activated voxels during motor observa- 
tion was significantly correlated with accuracy in an 
EEG sensorimotor rhythm-based (SMR) BCI task (see 
Sect. 39.7.1 for further details on SMR). Significantly 
higher activations of the supplementary motor areas for 
motor imagery and motor observation tasks were ob- 
served for high aptitude BCI users (see Fig. 39.5 for 


an illustration [39.29]). The results demonstrate that 
acquisition of the sensorimotor program reflected in 
SMR-BCI control (Sect. 39.7.1) is tightly related to the 
recall of such sensorimotor programs during observa- 
tion of movements and unrelated to the actual execution 
of these movement sequences. 

Using such knowledge about sensorimotor control 
will be critical in understanding and developing suc- 
cessful learning and control models for robotic devices 
and BCIs, and fully closing the sensorimotor learn- 
ing loop to enable finer manipulation abilities using 
BCIs and for retraining or enabling better relearn- 
ing of motor actions after cortical damage. Under- 
standing the neuroanatomy involved in motor execu- 
tion/imagery/observation may also provide a means of 
enhancing our knowledge of motion primitives and 
their neural correlates as discussed in Sect. 39.4. MRI 
and fMRI, however, only provide part of the picture, at 
the level of large networks of neurons, and on relatively 
large time scales. Invasive electrophysiology, however, 
can target specific neuronal networks at millisecond 
time resolution. The following section highlights some 
of the most recent findings from motor cortical surface 
potentials investigations. 
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The electroencephalogram (EEG) is derived from the 
action potentials of millions of neurons firing electri- 
cal pulses simultaneously. The human brain has more 
than 100000000000 (10!!) neurons and thousands of 
spikes (electrical pulses) are emitted each millisecond. 
EEG reflects the aggregate activity of millions of in- 
dividual neurons recorded with electrodes positioned 
in a standardized pattern on the scalp. Brainwaves 
are categorized into a number of different frequency 
bands including delta (1—4 Hz), theta (5—8 Hz), al- 
pha (8—12 Hz), mu (8—12 Hz), beta (13—30 Hz), and 
gamma (> 30 Hz). Each of these brain rhythms can 
be associated with various brain processes and be- 
havioral states, however, knowledge of exactly where 
brainwaves are generated in the brain, and if/how they 
communicate information, is very limited. By studying 
brain rhythms and oscillations we attempt to answer 
these questions and have realized that brain rhythms 
underpin almost every aspect of information process- 
ing in the brain, including memory, attention, and even 
our intelligence. We also observe that abnormal brain 
oscillations may underlie the problems experienced in 
diseases such as epilepsy or Alzheimer’s disease and 
we know that certain changes in brain rhythms and 
oscillations are good indicators of brain pathology as- 
sociated with these diseases. If we know more about the 
function of brainwaves we may be able to develop bet- 
ter diagnosis and treatments of these diseases. It may 
also lead to better computational tools and better bio- 
inspired processing tools to develop artificial cognitive 
systems. 

Brain rhythmic activity can be recorded non- 
invasively from the scalp as EEG or intracranially from 
the surface of the cortex as cortical EEG or the electro- 
corticogram (EEG is described in Sect. 39.7). 

Electrocorticography (ECoG), involving the clinical 
placement of electrode arrays on the brain surface (usu- 
ally above the dura) enables the recording of, similar 
to EEG, large-scale field potentials that are primar- 
ily derived from the aggregate synaptic potential from 
large neuronal populations, whereby synaptic current 
produces a change in the local electric field. ECoG 
can characterize local cortical potentials with high spa- 
tiotemporal precision (0.5 cm? in ECoG compared to 
1cm? in EEG) and high amplitudes (10—200 uV in 
ECoG compared to 10—100 uV in EEG). Furthermore, 
the ECoG spectral content can reach 300 Hz (compared 
to 60Hz in EEG) due to the closer vicinity of the 
electrodes to the electric source (the non-homogenous, 


anisotropic brain volume and tissues act as a low-pass 
filter). Independent individual finger movement dynam- 
ics can be resolved at the 20 ms time scale, which 
has been shown not to be possible with EEG (but has 
recently been demonstrated using MEG [39.30] as de- 
scribed in Sect. 39.7). Here we review some of the latest 
findings of ECoG studies involving human sensorimo- 
tor systems. 

The power spectral density (PSD) of the cortical 
potential can reveal properties within neuronal pop- 
ulations. Peaks in the PSD indicate activity that is 
synchronized across a neuronal population, for exam- 
ple, movement decreases the lateral frontoparietal alpha 
(8—12 Hz) and beta rhythm (12—30 Hz) amplitudes 
with limited spatial specificity whereas high gamma 
changes, which are spatially more focused, are also 
observable during motor control. Miller et al. [39.31], 
however, observed through a range of studies inves- 
tigating local gamma band-specific cortical process- 
ing, a lack of distinct peaks in the cortical potential 
PSD beyond 60 Hz and hypothesized the existence of 
broadband changes across all frequencies that were ob- 
scured at low frequencies by covariate fluctuations in 0 
(4—7 Hz), œ (8—12 Hz), and 6 (13—30 Hz) band oscil- 
lations. They demonstrated that there is a phenomenon 
that obeys a broadband, power-law form extending 
across the entire frequency range. Even with local brain 
activity in the gamma band there is an increase in power 
across all frequencies, and the power law shape is con- 
served. This suggests that there are phenomena with 
no special timescale where the neuronal population be- 
neath does not oscillate synchronously but may simply 
reflect a change in the population mean firing rate. 
Miller et al. [39.31] postulated that the power-law scal- 
ing during high y activity is a reflection of changes in 
asynchronous activity and not necessarily synchronous, 
rhythmic, action potential activity changes, as is often 
hypothesized. 

These findings suggest a fundamentally different 
approach to the way we consider the cortical poten- 
tial spectrum: power-law scaling reflects asynchronous, 
averaged input to the local neural population, whereas 
changes in characteristic brain rhythms reflect synchro- 
nized populations that coherently oscillate across large 
cortical regions. Miller et al. [39.31] also augment the 
findings by demonstrating power-law scaling in sim- 
ulated cortical potentials using small-scale, simplified 
integrate and fire neuronal models, an example of which 
is shown in Fig. 39.6. 
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Fig. 39.6a-d An illustration of how the power-law phenomena in the cortical potential might be generated based on a simulation 
study (see [39.31] for details). Panel (d) shows the PSD of this signal and has a knee 70 Hz, with a power law of P ~ 1/f*, which 
would normally be observed in ECoG PSD. The change in the spectra with increasing mean spike rate of synaptic input strongly 
resembles the change observed experimentally over motor cortex during activity, as demonstrated in [39.31] (after [39.31] with 
permission from Miller) 


Knowledge of this power-law scaling in the brain 
surface electric potential was subsequently exploited in 
a number of further studies investigating differences 
in motor cortical processing during imagined and ex- 
ecuted movements [39.32] and the role of rhythms and 
oscillations in sensorimotor activations [39.33]. 

As outlined, motor imagery to produce volitional 
neural signals to control external devices and for re- 
habilitation is one of the most popular approaches 
employed in brain-computer interfaces. As highlighted 
in the previous section neuroimaging using hemody- 
namic markers (positron emission tomography (PET) 
and fMRI) and extra cerebral magnetic and electric 
field studies (MEG and EEG) have shown that mo- 
tor imagery activates many of the same neocortical 
areas as those involved in planning and execution of 
movements. Miller et al. [39.32] studied the execution- 
imagery similarities with electrocorticographic cortical 
surface potentials in eight human subjects during overt 
action and kinaesthetic imagery of the same movement 
to determine what and where are the neuronal sub- 
strates that underlie motor imagery-based learning and 
the congruence of cortical electrophysiologic change 
associated with motor movement and motor imagery. 
The results show that the spatial distribution of acti- 
vation significantly overlaps between hand and tongue 
movement in the lower frequency bands (LFB) but not 
in the higher frequency bands (HFB), whereas during 


Fig. 39.7a-e An illustration of modes of neural activity 
with cortical beta rhythm states. (a) Modulation of broad- 
band amplitude by underlying rhythm can be thought of as 
population-averaged spike-field interaction. (b) Released 
cortex demonstrates a small amount of broadband power 
coupling to underlying rhythm phase, and the underlying 
spiking from pyramidal neurons is high in rate and only 
weakly coupled to the underlying rhythm phase. (c) Sup- 
pressed cortex demonstrates less broadband power but with 
higher modulation by the underlying rhythm, while under- 
lying single unit spiking is low in rate but tightly coupled 
to the rhythm phase. (d) A simplified heuristic for how 
rhythms might influence cortical computation: During ac- 
tive computation, pyramidal neurons (PN) engage in asyn- 
chronous activity, where mutual excitation has a sophisti- 
cated spatio-temporal pattern. Averaged across the popula- 
tion, the ECoG signal shows broadband increase, with neg- 
ligible beta. (e) During resting state, cortical neurons, via 
synchronized interneuron (IN) input, are entrained with the 
beta rhythm, which also involves extracortical circuits sym- 
bolized by the input froma synchronizing neuron in the tha- 
lamus (TN). The modulation of local activity with rhythms 
is revealed in the ECoG by significant broadband modula- 
tion with the phase of low frequency rhythms (after [39.33], 
courtesy of Miller; see [39.33] for further details) > 


kinaesthetic imagination of the same movement task the 
magnitude of spectral changes were smaller (26% less 
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spatially broad decrease in power in the LFB and the fo- real-time feedback of the magnitude of cortical acti- 
cal increase in HFB power were similar for movement vation of a particular electrode, in the form of cursor 
and imagery. movement on screen, the spatial distribution of HFB ac- 
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tivity was quantitatively conserved in each case, but the 
magnitude of the imagery associated spectral change in- 
creased significantly and, in many cases, exceeded that 
observed during actual movement. The spatially broad 
desynchronization in LFB is consistent with EEG- 
based imagery, which uses a/f desynchronization as 
a means of cursor control in BCIs [39.19]; however, 
the results demonstrate that this phenomena reflects 
an aspect of cortical processing that is fundamen- 
tally non-specific. LFB desynchronization may reflect 
altered feedback between cortical and subcortical struc- 
ture with a timescale of interaction that corresponds 
to the peak frequency in the PSD as opposed to local, 
somatotopically distinct, population-scale computation. 
Miller et al. [39.32] speculate that the significant LFB 
power difference during movement and imagery might 
be a correlate of a partial release of cortex by subcor- 
tical structures (partial decoherence of a synchronized 
corticothalamic circuit) as opposed to a complete re- 
lease during actual movement or after motor imagery 
feedback. 

The HFB change is reflective of a broadband PSD 
increase that is obscured at lower frequencies by the 
motor associated œ/f rhythms but which has been 
specifically correlated with local population firing rate 
and is observed in a number of spatially overlapping 
areas, including primary motor cortical areas for both 
movement and imagery. These findings have been used 
for much speculation about the neural substrates and 
electrophysiology of movement control. The results 
clearly demonstrate the congruence in large-scale ac- 
tivation between motor imagery and overt movement, 
and imagery-based feedback and the overlapping acti- 
vation in distributed circuits during movement and im- 
agery, the clear role of the primary motor cortex during 
motor imagery, and the role of feedback in the aug- 
mentation of widespread neuronal activity during motor 
imagery. Electrocorticographic evidence of the rele- 
vance of the role of primary motor areas during motor 
imagery to complement EEG and neuroimaging show- 
ing primary motor activation during imagery/movement 
such as those outlined in the previous section was also 
an outcome of the study. The dramatic augmentation 
given by feedback, particularly in primary motor cortex 
is significant, particularly in the context of BCI training, 
because it demonstrates a dynamic restructuring of neu- 
ronal dynamics across a whole population in the motor 
cortex on very short time scales (< 10min) [39.32]. 
This augmentation and restructuring can, indeed, result 
in improved motor imager performance over time but 
leads to the necessity to co-adapt the BCI signal pro- 


cessing to cope with associated non-stationary drifts in 
the resulting oscillations of cortical potentials (a topic 
to which we return in Sect. 39.9). 

Human motor behaviors such as reaching, reading, 
and speaking are executed and controlled by somato- 
motor regions of the cerebral cortex, which are located 
immediately anterior and posterior to the central sul- 
cus [39.33]. Electrical oscillations in the lower beta 
band (12—20Hz) have been shown to have an in- 
verse relationship to motor production and imagery, 
decreasing during movement initiation and production 
and rebounding (synchronization) following movement 
cessation and during imagery continuation in the peri- 
central somatomotor and somatosensory cortex. 

Investigations have been conducted to determine 
whether beta rhythms play an active role in the compu- 
tations taking place in somatomotor cortex or whether it 
is epiphenomenon of cortical state changes influenced 
by the other cortical or subcortical processes [39.33]. 
There is strong correlation between the firing time of 
individual neurons in the primary motor cortex and 
the phase of the beta rhythms in the local field poten- 
tial [39.34]. Miller et al. [39.33] have acquired ECoG 
evidence of the role of beta rhythms in the organi- 
zation of the somatomotor function by analyzing the 
broadband spectral power on fast time scales (tens 
of milliseconds) during rest (visual fixation) and fin- 
ger flexion. The results show that cortical activity 
has a robust entrainment on the phase of the beta 
rhythms, which is predominant in peri-central motor ar- 
eas whereby broadband spectral changes vary with the 
phase of underlying motor beta rhythm. This relation- 
ship between beta rhythms and local neuronal activity 
is a property of the idling brain (present during resting 
and selectively diminished during movement). Specifi- 
cally, Miller et al. [39.33] propose that the predominant 
pattern for the beta range shows a tendency for brain 
activity, as measured by broadband power, to increase 
just prior to the surface negative phase and decrease just 
prior to the surface positive phase of the beta rhythm, 
which they refer to as rhythmic entrainment. The pre- 
dominant phase couplings for 6/a/f ranges are found 
to be different and have different spatial localizations. 

Miller et al. [39.33] proposed a suppression through 
synchronization hypothesis, whereby diffuse cortical 
inputs originating from subcortical areas might func- 
tionally suppress large regions of the cortex, the ad- 
vantage of which is to enable selective engagement 
of task-relevant and task-irrelevant brain circuits and 
for dynamical reallocation of metabolic resources. This 
shifting entrainment suggests that the -rhythm is not 
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simply a background process that is suppressed during 
movement, but rather that the beta rhythm plays an ac- 
tive and important role in motor processing. In recent 
years, there has been a growing focus on coupling be- 


tween neuronal firing and rhythmic brain activity, and 
this study provides substantial evidence and methodol- 
ogy to support the important role of brain rhythms in 
neuronal functioning. 


39.7 MEG and EEG - Extra Cerebral Magnetic and Electric Fields 


of the Motor System 


The previous section highlighted a number of the most 
recent examples of ECoG-based studies that are shed- 
ding more light on the way in which the motor system 
processes information and is activated during imagery 
and movement. As electrocorticography is a highly in- 
vasive procedure involving surgery, a key question that 
has been addressed is whether the spectral findings 
and spatial specificity of ECoG will ever be possible 
using non-invasive extracerebrally acquired EEG, or 
whether ECoG findings can be used to develop better 
EEG-based processing methodologies to extract ECoG 
information. At the International BCI meeting in 2010 
a workshop addressed the critical questions around 
the state-of-the-art BCI signal processing, in particu- 
lar, should future BCI research emphasize a shift from 
scalp-recorded EEG to ECoG, and how are the signals 
from the two modalities related? [39.35]. 

There is still much debate around the future of EEG 
for BCI due to its limited spatial resolution and vari- 
ous noise-related issues, whereas ECoG shows much 
promise in addressing both of these issues. However, 
ECoG requires surgical implantation and the long-term 
effectiveness remains to be verified in humans. A step 
toward answering this question is to better understand 
the relationship between EEG and ECoG. In a work- 
shop summary the question was addressed by com- 
paring and contrasting the contribution of population 
synchronized (rhythmic) and asynchronous changes in 
the EEG and ECoG potential measurements [39.35]. 
The beta rhythm is robust in extracerebral EEG record- 
ings, spatially synchronous across the pre and postcen- 
tral gyri, so this coherent rhythm is augmented with 
respect to background spatial averaging. The different 
states of the surface rhythms may represent switching 
between the stable modes observed in on-going surface 
oscillations. 

In contrast, the broadband spectral change that ac- 
companies movement is asynchronous at the local level 
and unrelated across cortical regions, so it is dis- 
torted or diminished by spatial averaging. Krusienski 
et al. [39.35] compared the contribution of popula- 


tion synchronized (rhythmic) and asynchronous (broad- 
band, 1/f) changes in the EEG and ECoG poten- 
tial measurements using a number of simplifications 
and approximations. These approximations suggest that 
synchronized cortical oscillations may be differently re- 
flected at the EEG scale than the ECoG scale. Krusien- 
ski et al. [39.35] show that to have the same contribution 
to EEG that a single cortical column has on ECoG, the 
spatial extent of cortical activity would have to span 
nearly the full width of a gyrus, and nearly a centime- 
ter longitudinally. Based upon ECoG measurements of 
the 1/f change in the visual cortex, the findings con- 
firm the possibility of detecting 1/f change in EEG 
during visual input directly over the occipital pole as an 
event-related potential. In the pre-central motor cortex, 
the movement of several digits in concert can produce 
a widespread change, which is dramatic enough to be 
measured in the EEG; however, based on these find- 
ings, the detection of single finger digit movement in 
EEG is not possible. This finding has been supported in 
other recent studies, a number of which involved mag- 
netoencephalography (MEG). As with EEG, the mag- 
netoencephalogram (MEG) is recorded non-invasively; 
however MEG is a record of magnetic fields, measured 
outside the head, produced by electrical activity within 
the brain, whereas EEG is a measure of the electrical 
potentials. Synchronized neuronal currents, produced 
primarily by the intracellular electrical currents within 
the dendrites of pyramidal cells in the cortex, induce 
weak magnetic fields. Neuronal networks of 50000 or 
more active neurons are needed to produce a detectable 
signal using MEG. MEG has a number of advantages 
over EEG, most notably its spatial specificity as the 
magnetic flux detected at the surface of the head with 
MEG penetrates the skull and tissues without signifi- 
cant distortion, unlike the secondary volume currents 
detected outside the head with EEG [39.36]. MEG, 
however, is also less practical, requiring significant 
shielding from environmental electric magnetic inter- 
ference and is not a wearable or mobile technology like 
EEG and, therefore, cannot be used for bedside record- 
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ings in a clinical setting or mobile BCI applications. 
MEG has been used in a range of clinical applica- 
tions (cf. [39.36] for a review) and for research. Below 
we describe a number of studies with focus on mo- 
tor cortical investigations in the context of developing 
brain—computer interfaces. 

Quandt et al. [39.30] have investigated single trial 
brain activity in MEG and EEG recordings elicited 
by finger movement on one hand. The muscle mass 
involved in finger movement is smaller than in limb 
or hand movement, and neuronal discharges of mo- 
tor cortex neurons are correspondingly smaller in fin- 
ger movement than in arm or wrist movements. This 
makes detection of finger movement more difficult 
from non-invasive recordings. Using MEG Kauhanen 
et al. [39.37, 38] showed that left and right-hand index 
finger movement can be discriminated and that single 
trial brain activity recorded non-invasively can be used 
to decode finger movement; however, there are sig- 
nificant obstacles in non-invasive recordings in terms 
of the substantial overlapping activations in M1 when 
decoding individual finger movements on the same 
hand. Miller et al. [39.39] and Wang et al. [39.40] have 
shown that real-time representation of individual fin- 
ger movements is possible using ECoG; however, fin- 
ger movement discrimination from extracerebral neural 
recordings has only recently been shown to be possi- 
ble. Quandt et al. [39.30] found using simultaneously 
recorded EEG and MEG that finger discrimination on 
the same hand is possible with MEG but EEG is not 
sufficient for robust classification. The lower spatial res- 
olution of scalp signal EEG is due to the spatial blurring 
at the interface of tissues with different conductance. 
The issue cannot be overcome by increasing the den- 
sity of EEG electrodes. It is speculated that the strong 
curvature of the cortical sheet in the finger knob (an 
omega-shaped knob of the central sulcus) contributes 
to the high decoding accuracy of MEG, whereby ori- 
entation change in the active tissue may change spatial 
patterns of magnetic flux measured in sensor space, but 
potentials caused by the same processes are not de- 
tectable at the scalp. Using different approaches four 
fingers on the same hand could be decoded with circa 
57% accuracy using MEG and across all cases MEG 
performs better than EEG (p < 0.005), whilst EEG of- 
ten only produced accuracies slightly above the upper 
confidence interval for guessing. 

Analysis of the oscillations from MEG correspond 
to ECoG studies where the power of the lower os- 
cillations (< 60 Hz) decreases around the movement, 
whereas power in the high gamma band increases 


and that the effects in the high gamma band are 
more spatially focused than in the lower frequency 
bands [39.30]. Interestingly, the discrimination accu- 
racy from the band power of the most informative 
frequency band between 6 and 11 Hz was clearly in- 
ferior to the accuracy derived from the time-series data, 
indicating that slow movement-related neural activation 
modulations are most informative about which finger 
of a hand moves, and the inferior accuracy given by 
the band power is likely to be due to the lack of phase 
information contained in band-power features [39.30] 
(time embedding, temporal sequence information and 
exploiting phase information in discriminating motor 
signals are revisited in Sect. 39.9). 

The above are just a few examples of what has been 
shown not to be possible to characterize sufficiently 
in EEG. The following section provides an overview 
of the known sensorimotor phenomena detectable from 
EEG and some of the most recent advances in decoding 
hand/arm movements non-invasively. 


39.7.1 Sensorimotor Rhythms and Other 
Surrounding Oscillations 


There are a number of rhythms and potentials that have 
been strongly linked with motor control, many of which 
have been exploited in EEG-based non-invasive BCI 
devices. As outlined previously, the sensorimotor area 
(SMA) generates a variety of rhythms that have spe- 
cific functional and topographic properties. To reiterate, 
distinct rhythms are generated by hand movements 
over the post central somatosensory cortex. The u 
(8—12 Hz) and $ (13—30 Hz) bands are altered dur- 
ing sensorimotor processing [39.41—43]. Attenuation of 
the spectral power in these bands indicates an event- 
related desynchronization (ERD), whilst an increase in 
power indicates event-related synchronization (ERS). 
ERD of the jz and f bands are commonly associated 
with activated sensorimotor areas and ERS in the u 
band is associated with idle or resting sensorimotor 
areas. ERD/ERS has been studied widely for many cog- 
nitive studies and provides very distinctive lateralized 
EEG pattern differences, which form the basis of left 
hand vs. right hand or foot MI-based BCIs [39.44, 45]. 
However, as outlined above, later studies have shown 
the actual rhythmic activity generated by the sensori- 
motor system can be much more detailed. The œ or u 
component of the SMR also has a phase-coupled sec- 
ond peak in the beta band. Both the alpha and beta 
peaks can become independent at the offset of a move- 
ment, after which the beta band rebounds faster and 
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with higher amplitude than the alpha band. Desynchro- 
nization of the beta band during a motor task can occur 
in different frequency bands than the subsequent resyn- 
chronization (rebound) after the motor task [39.41]. As 
previously outlined, many studies have shown that neu- 
ral networks similar to those of executed movement are 
activated during imagery and observation of movement 
and thus similar sensorimotor rhythmic activity can be 
observed during motor imagery and execution. 

Gamma oscillations of the electromagnetic field of 
the brain are known to be involved in a variety of cogni- 
tive processes and are believed to be fundamental for in- 
formation processing within the brain. Gamma oscilla- 
tions have been shown to be correlated with other brain 
rhythms at different frequencies and a recent study 
has shown the causal influences of gamma oscillation 
on sensorimotor rhythms (SMR) in healthy subjects 
using magnetoencephalography [39.46]. It has been 
shown that the modulation of sensorimotor rhythms is 
positively correlated with the power of frontal and oc- 
cipital gamma oscillations, negatively correlated with 
the power of centro-parietal gamma oscillations and 
that simple causal structure can be attributed to a causal 
relationship or influence of gamma oscillations on the 
SMR. The behavioral correlate of the topographic alter- 
ations of gamma power, a shift of gamma power from 
centro-parietal to frontal and occipital regions, remains 
elusive, although increased gamma power over frontal 
areas has been associated with selective attention in 
auditory paradigms. Grosse-Wentrup et al. [39.46] pos- 
tulated that neurofeedback of gamma activity may be 
used to enhance BCI performance to help low aptitude 
BCI users, i. e., those who appear incapable of BCI con- 
trol using SMR. 


39.7.2 Movement-Related Potentials 


Signals observed during and before the onset of move- 
ment signify motor planning and preparation. For ex- 
ample, the bereitschafts potential or BP (from German, 
readiness potential), also called the pre-motor potential 
or readiness potential (RP), is a measure of activity in 
the motor cortex of the brain leading up to voluntary 
muscle movement [39.47, 48]. The BP is a manifesta- 
tion of cortical contribution to the pre-motor planning 
of volitional movement. Krauledat et al. [39.49, 50] re- 
port on experiments carried out using the lateralized 
readiness potential (LRP) (i. e., Bereitschafts potential) 
for brain—computer interfaces. Before accomplishing 
motor tasks a negative readiness potential which re- 
flects the preparation can be observed. They showed 


it is possible to distinguish the pre-movement poten- 
tials from finger tapping experiments, even before the 
movement occurs or the onset of the movement, thus 
potentially improving accuracy and reducing latency 
in the BCI system. The BP is ten to a hundred times 
smaller than the a-rhythm of the EEG and it can only be 
identified by averaging across trials and has two compo- 
nents: an early component referred to as BP1 (sometime 
NS1) lasting from about —1.2 to —0.5s before move- 
ment onset (negative slope (NS) of early BP) and a late 
component (BP2 or NS2) from —0.5 to shortly before 
Os (steeper negative slope of late BP) [39.48, 51,52]. 
A pre-movement positivity can be observed along with 
a motor-potential which starts about 50 to 60 ms before 
the onset of movement and has its maximum over the 
contralateral precentral hand area. 


39.7.3 Decoding Hand Movements from EEG 


Movement-related cortical potentials (MRCP) have 
been used as control signals for BCIs [39.53]. MRCP 
and SMR have distinct changes during execution or 
imagination of voluntary movements. MRCP is con- 
sidered a slow cortical potential where the surface 
negativity which develops 2 s before the movement on- 
set is the Beireitschaftspotential referred to above. Gu 
et al. [39.53] studied MRCP and SMRs in the con- 
text of discriminating the movement type and speed 
from the same limb based on the hypothesis that if 
the imagined movements are related to the same limb, 
the control could be more natural than associating 
commands to movements of different limbs for BCIs. 
They focused on fast slow wrist rotation and exten- 
sion and they found that average MRCPs rebounded 
more strongly when fast-speed movements were imag- 
ined compared with slow-speed movements; however, 
the rebound rate of MRCP was not substantially differ- 
ent between movement types. The peak negativity was 
more pronounced in the frontal (Fz) and central region 
(C1) than in the occipital region (Pz). The rebound rate 
of MRCP was greater in the central region (C1) when 
compared to the occipital region (Pz). MRCP and SMR 
are independent of each other as they originate from 
different brain sources and they occupy different fre- 
quency bands [39.52—54]. This renders them useful for 
multi-dimensional control in BCIs. 

In accordance with the analysis of averaged 
MRCPs, the single-trial classification rate between two 
movements performed at the same speed was lower 
than when combining movements at different speeds. 
Gu et al. [39.53] suggest that selecting different speeds 
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rather than different movements when these are exe- 
cuted at the same joint may be best for BCI applica- 
tions. However, the task pair that was optimal in terms 
of classification accuracy is subject-dependent, and thus 
a subject-specific evaluation of the task pair should be 
conducted. The study by Gu et al. [39.53] is important 
as it is one of a limited number of studies that focus on 
discriminating different movements of the same limb as 
opposed to moving different limbs from EEG, which is 
much more common practice in BCI designs. However, 
Lakany and Conway [39.55] investigated the difference 
between imagined and executed wrist movements in 20 
different directions using machine learning and found 
that the accuracy of discriminating wrist movement 
imagination is much less than for actual movement; 
however, they later found [39.56] time-frequency EEG 
features modulated by force direction in arm isometric 
exertions to four different directions in the horizontal 
plane can give better directional discrimination infor- 
mation and that t—f features from the planning and 
execution phase may be most appropriate. 

Although a limited number of works demonstrating 
EEG-based 2-D and 3-D continuous control of a cur- 
sor through biofeedback have been reported [39.57, 
58] along with a few studies of classification of the 
direction/speed of 2-D hand/wrist movements outlined 
above, there are very few studies that have demon- 
strated continuous decoding of hand kinematics from 
EEG. Classification of different motor imagery tasks 
on single trial basis is more commonly reported. The 
signal-to-noise ratio, the bandwidth, and the informa- 
tion content of electroencephalography are generally 
thought to be insufficient to extract detailed infor- 
mation about natural, multi-joint movements of the 
upper limb. However, evidence from a study by Brad- 
berry et al. [39.59] investigating whether the kinematics 
of natural hand movements are decodable from EEG 
challenges this assumption. They continuously extract 
hand velocity from signals collected during a three- 
dimensional (3-D) center-out reaching task and found 
that a linear EEG decoding model could reconstruct 3-D 
hand-velocity profiles reasonably well and that sensor 
CP3, which lies roughly above the primary sensori- 


motor cortex contralateral to the reaching hand, made 
the greatest contribution. Using a time-lagged approach 
they found that EEG data from 60 ms in the past sup- 
plied the most information with 16.0% of the total 
contribution suggesting a linear decoding method such 
as the one used [39.59] rely on a sub-seconds history 
of neural data to reconstruct hand kinematics. Using 
a source localization technique they found that the pri- 
mary sensorimotor cortex (pre-central gyrus and post- 
central gyrus) was indeed a major contributor along 
with the inferior parietal lobule (IPL), all of which 
have been found to be activated during motor execu- 
tion and imagery in other investigations [39.12, 13, 16]. 
Bradberry et al. [39.59] also found that the movement 
variability is negatively correlated with decoding accu- 
racy, suggesting two reasons; 1) increased movement 
variability could degrade decoding accuracy due to less 
similar pairs of EEG-kinematic exemplars, i.e., less 
movement variability results in reduced intra-class vari- 
ability for training, and 2) subjects differ in their ability 
to perform the task without practice (motor learning 
is important for improving predictions of movement). 
Hence, the strengths of a priori neural representations of 
the required movements vary until learned or practiced, 
and these differences could directly relate to the accu- 
racy with which the representations can be extracted. 
This study provides important evidence that decodable 
information about detailed, complex hand movements 
can be derived from the brain non-invasively using 
EEG; however, it remains to be determined whether 
these findings are consistent when using the same 
methodology in an imagined 3-D center-out task. 

Although we know a lot about brain structure asso- 
ciated with sensorimotor activity, as well as the rhythms 
and potentials surrounding this activity, we have not 
yet systematically linked the neural correlates of these 
to specific motion primitives or motor control models. 
Modeling, using biological plausible neural models, the 
findings in relation to motor cortical structure, function, 
and dynamics along with linkage to the underlying mo- 
tor psychophysics and advanced signal processing in 
BCI may help advance our knowledge on motion prim- 
itives, sensorimotor learning, and control. 


39.8 Extracellular Recording — Decoding Hand Movements 
from Spikes and Local Field Potential 


Although fMRI, MEG, and EEG offer low risk, non- 
surgical recording procedures they have inherent limita- 


tions which many expect can be overcome with invasive 
approaches such as ECoG (described in Sect. 39.6) and 
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by implanting electrodes to record the electrical activity 
of single neurons extracellularly (single unit record- 
ings). Here we focus on some recent studies aimed 
at testing this scale for use in sensorimotor-related 
BCIs. 

Extracellular recording has many advantageous, in- 
cluding high signal amplitude (up to 500 uV), low 
susceptibility to external noise and artefact (eye move- 
ments, cardiac activity, muscle activity) leading to high 
signal-to-noise ratio, high spatial resolution (50 m7), 
high temporal resolution (~ 1 ms), and high spectral 
content (up to 2kHz) due to the close vicinity to 
the electric source. As a consequence, there is a high 
correlation between the neural signals recorded and 
the generated/imagined hand movements, resulting in 
a short learning duration when employed in a mo- 
tor BCI system. The disadvantages of the invasive 
recording technique include a complex and expensive 
medical procedure, susceptibility to infections (possi- 
bly leading to meningitis, epilepsy), pain, prolonged 
hospitalization, direct damage to the neural tissue (e.g., 
a flat 15 um x 60 um electrode penetrating 2 mm deep 
hits, on average, 5 neurons and 40000 synapses), indi- 
rect damage to the neural tissue (small blood vessels 
are hit by the electrode causing ischemia for distant 
neurons and synapses and the evolution of an inflam- 
matory response), and evolvement of a scarred tissue 
which electrically isolates the electrodes from the sur- 
roundings and render the system non-responsive after 
being implanted for extended durations. Furthermore, 
the electrode material itself, however biocompatible, 
sooner or later causes an inflammatory reaction and the 
evolvement of scarred tissue. 

Theoretically, however, extracellular recordings of- 
fer more accurate information that may enable us to 
devise realistic BCI systems that allow for additional 
degrees of freedom and natural control of prosthetic 
devices, such as a hand and arm prostheses. To this 
end, substantial efforts have been put into devising 
novel biocompatible electrodes (e.g., platinum, irid- 
ium oxide, carbonic polymers) that will delay immune 
system stimulation, devising multi-functional micro- 
electrodes that allow for recording/stimulating while 
injecting anti-inflammatory agents to suppress inflam- 
matory response, devising hybrid microelectrodes that 
allow for the inclusion of pre-amplifier and multi- 
plexer on the electrode chip to allow wireless trans- 
mission of the data, thus avoiding the necessity for 
a scalp drill hole used for taking out the flat cable 
carrying the neural data, which is prone to causing 
infections. 


Extracellular recording, being the most invasive 
recording technique (compared to non-invasive EEG 
and MEG recording and partially invasive ECoG 
recording) allows recording both the high-frequency 
content neural output activity, i. e., spikes, and the low 
frequency content neural input activity, denoted as lo- 
cal field potential (LFP), which is the voltage caused 
by electrical current flowing from all nearby dendritic 
synaptic activity across the resistance of the local ex- 
tracellular space. In the following section, the neural 
coding schemes, in general, and the cortical correlates 
of kinematic and dynamic motion attributes, in specific, 
will be presented along with their suggested use for cur- 
rent and future BCI systems. 


39.8.1 Neural Coding Schemes 


A sequence, or train, of spikes may contain information 
based on different coding schemes. In motor neurons, 
for example, the strength at which an innervated muscle 
is flexed depends solely on the firing rate, the average 
number of spikes per unit time (a rate code). At the 
other end, a complex temporal code is based on the pre- 
cise timing of single spikes. They may be locked to an 
external stimulus such as in the auditory system or be 
generated intrinsically by the neural circuitry. Whether 
neurons use rate coding or temporal coding is a topic 
of intense debate within the neuroscience community, 
even though there is no clear definition of what these 
terms mean. Neural schemes include rate coding, spike 
count rate, time-dependent firing rate, temporal coding, 
and population coding. 


Rate coding 
Rate coding is a traditional coding scheme, assuming 
that most, if not all, information about the stimulus is 
contained in the firing rate of the neuron. The concept 
of firing rates has been successfully applied during the 
last 80 years. It dates back to the pioneering work of 
Adrian and Zotterman who showed that the firing rate 
of stretch receptor neurons in the muscles is related to 
the force applied to the muscle [39.60]. In the following 
decades, measurement of firing rates became a standard 
tool for describing the properties of all types of sensory 
or cortical neurons, partly due to the relative ease of 
measuring rates experimentally. 

Because the sequence of action potentials gener- 
ated by a given stimulus varies from trial to trial, 
neuronal responses are typically treated statistically or 
probabilistically. They may be characterized by firing 
rates, rather than as specific spike sequences. In most 
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sensory systems, the firing rate increases, generally 
non-linearly, with increasing stimulus intensity. Any in- 
formation possibly encoded in the temporal structure of 
the spike train is ignored. Consequently, rate coding is 
inefficient but highly robust with respect to the inter- 
spike interval (ISI) noise. During recent years, more 
and more experimental evidences have suggested that 
a straightforward firing rate concept based on tempo- 
ral averaging may be too simplistic to describe brain 
activity [39.61]. In rate coding, learning is based on 
activity-dependent synaptic weight modifications. 


Spike-count rate 
Spike-count rate also referred to as temporal average, 
is obtained by counting the number of spikes that ap- 
pear during a trial and dividing by the duration of the 
trial. The length T of the time window is set by the ex- 
perimenter and depends on the type of neuron recorded 
from and the stimulus. In practice, to obtain sensible 
averages, several spikes should occur within the time 
window. Typical values are T = 100 ms or T = 500 ms, 
but the duration may also be longer or shorter. 

The spike-count rate can be determined from a sin- 
gle trial, but at the expense of losing all temporal 
resolution about variations in neural response during 
the course of the trial. Temporal averaging can work 
well in cases where the stimulus is constant or slowly 
varying and does not require a fast reaction of the or- 
ganism — and this is the situation usually encountered 
in experimental protocols. Real-world input, however, 
is hardly stationary, but often changing on a fast time 
scale. For example, even when viewing a static im- 
age, humans perform saccades, rapid changes of the 
direction of gaze. The image projected onto the retinal 
photoreceptors changes, therefore, every few hundred 
milliseconds. Despite its shortcomings, the concept of 
a spike-count rate code is widely used not only in exper- 
iments, but also in models of neural networks. It has led 
to the idea that a neuron transforms information about 
a single input variable (the stimulus strength) into a sin- 
gle continuous output variable (the firing rate). 


Time-dependent firing rate 
Time-dependent firing rate is defined as the average 
number of spikes (averaged over trials) appearing dur- 
ing a short interval between times ¢ and t + At, divided 
by the duration of the interval. It works for stationary as 
well as for time-dependent stimuli. To experimentally 
measure the time-dependent firing rate, the experi- 
menter records from a neuron while stimulating with 
some input sequence. The same stimulation sequence is 


repeated several times and the neuronal response is re- 
ported in a peri-stimulus-time histogram (PSTH). The 
time ¢ is measured with respect to the start of the stimu- 
lation sequence. The At must be large enough (typically 
in the range of 1 or a few milliseconds) so there is 
a sufficient number of spikes within the interval to ob- 
tain a reliable estimate of the average. The number of 
occurrences of spikes nx (t; t+ At) summed over all rep- 
etitions of the experiment divided by the number K of 
repetitions is a measure of the typical activity of the 
neuron between time f and t+ At. A further division 
by the interval length Ar yields the time-dependent fir- 
ing rate r(t) of the neuron, which is equivalent to the 
spike density of PSTH. 

For sufficiently small Az, r(t) At is the average num- 
ber of spikes occurring between times ¢ and t+ At 
over multiple trials. If At is small, there will never 
be more than one spike within the interval between t 
and t+ At on any given trial. This means that r(rt) At 
is also the fraction of trials on which a spike occurred 
between those times. Equivalently, r() At is the proba- 
bility that a spike occurs during this time interval. As 
an experimental procedure, the time-dependent firing 
rate measure is a useful method to evaluate neuronal 
activity, in particular in the case of time-dependent 
stimuli. The obvious problem with this approach is that 
it cannot be the coding scheme used by neurons in the 
brain. Neurons cannot wait for the stimuli to repeat- 
edly present in exactly the same manner as observed 
before generating the response. Nevertheless, the ex- 
perimental time-dependent firing rate measure makes 
sense, if there are large populations of independent neu- 
rons that receive the same stimulus. Instead of recording 
from a population of N neurons in a single run, it is ex- 
perimentally easier to record from a single neuron and 
average over N repeated runs. Thus, the time-dependent 
firing rate coding relies on the implicit assumption that 
there are always populations of neurons. 


Temporal coding 
When precise spike timing or high-frequency firing-rate 
fluctuations are found to carry information, the neural 
code is often identified as a temporal code. A number 
of studies have found that the temporal resolution of the 
neural code is on a millisecond time scale, indicating 
that precise spike timing is a significant element in neu- 
ral coding [39.62]. Temporal codes employ those fea- 
tures of the spiking activity that cannot be described by 
the firing rate. For example, the time to first spike after 
the stimulus onset, characteristics based on the second 
and higher statistical moments of the ISI probability dis- 
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tribution, spike randomness, or precisely timed groups 
of spikes (temporal patterns) are candidates for temporal 
codes. As there is no absolute time reference in the ner- 
vous system, the information is carried either in terms of 
the relative timing of spikes in a population of neurons 
or with respect to an ongoing brain oscillation. 

The temporal structure of a spike train or firing rate 
evoked by a stimulus is determined both by the dy- 
namics of the stimulus and by the nature of the neural 
encoding process. Stimuli that change rapidly tend to 
generate precisely timed spikes and rapidly changing 
firing rates no matter what neural coding strategy is 
being used. Temporal coding refers to temporal preci- 
sion in the response that does not arise solely from the 
dynamics of the stimulus, but that nevertheless relates 
to properties of the stimulus. The interplay between 
stimulus and encoding dynamics makes the identifica- 
tion of a temporal code difficult. The issue of temporal 
coding is distinct and independent from the issue of 
independent-spike coding. If each spike is independent 
of all the other spikes in the train, the temporal charac- 
ter of the neural code is determined by the behavior of 
the time-dependent firing rate r(t). If r(t) varies slowly 
with time, the code is typically called a rate code, and 
if it varies rapidly, the code is called temporal. 


Population coding 
Population coding is a method to represent stimuli by 
using the joint activities of a number of neurons. In 
population coding, each neuron has a distribution of 
responses over some set of inputs, and the responses 
of many neurons may be combined to determine some 
value about the inputs. 

Currently, BCI and BMI (brain machine interface) 
systems rely mostly on population coding. The descrip- 
tion of one of the most famous population codes — 
the motor population vector along with its use in cur- 
rent and future BCI and BMI systems is presented in 
Sect. 39.8.2. 


39.8.2 Single Unit Activity Correlates 
of Hand Motion Attributes 


In 1982, Georgopoulos et al. [39.63] found that the ac- 
tivity of single cells in the motor cortex of monkeys, 
who were making arm movements in eight directions 
(at 45° intervals) in a two-dimensional apparatus, var- 
ied in an orderly fashion with the direction of the move- 
ment. Discharge was most intense with movements in 
a preferred direction and was reduced gradually when 
movements were made in directions farther and farther 


away from the preferred movement. This resulted in 
a bell-shaped directional tuning curve. These relations 
were observed for cell discharge during the reaction 
time, the movement time, and the period that preceded 
the earliest changes in the electromyographic activity 
(approximately 80ms before movement onset) (elec- 
tromyography (EMG) is a technique for evaluating and 
recording the electrical activity produced by skeletal 
muscles). In about 75% of the 241 directionally tuned 
cells, the frequency of discharge D was a sinusoidal 
function of the direction of movement 


D = bọ + bı sin O + bz cos O , (39.3) 
or, in terms of the preferred direction © 
D = bọ + cı cos(O — Oo) , (39.4) 


where bo, bj, b2, and c; are regression coefficients. Pre- 
ferred directions differed for different cells so that the 
tuning curves partially overlapped. The orderly varia- 
tion of cell discharge with the direction of movement 
and the fact that cells related to only one of the eight 
directions of movement tested were rarely observed, 
indicated that movements in a particular direction are 
not subserved by motor cortical cells uniquely re- 
lated to that movement. It was suggested, instead, that 
a movement trajectory in a desired direction might be 
generated by the cooperation of cells with overlapping 
tuning curves. The orderly variation in the frequency of 
discharge of a motor cortical cell with the direction of 
movement is shown in Fig. 39.8. 

Later on, Amirikian et al. systematically examined 
the variation in the shape of the directional tuning pro- 
files among a population of cells recorded from the arm 
area of the motor cortex of monkeys using movements 
in 20 directions, every 18° [39.64]. This allowed the 
investigation of tuning functions with extra parameters 
to capture additional features of the tuning curve (i. e., 
tuning breadth, symmetry, and modality) and determine 
an optimal tuning function. It was concluded that mo- 
tor cortical cells are more sharply tuned than previously 
thought. 

Paninski et al. [39.65] using a pursuit-tracking task 
(PTT) in which a monkey had to continuously track 
a randomly moving visual stimulus (thus providing 
a broad sample of velocity and position space) with 
invasive recordings from the M1 region showed that 
there is heterogeneity of position and velocity coding in 
that region, with markedly different temporal dynamics 
for each — velocity-tuned neurons were approximately 
sinusoidally tuned for direction, with linear speed scal- 
ing; other cells showed sinusoidal tuning for position, 
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Fig. 39.8 Orderly variation in the frequency of 
discharge of a motor cortical cell with the di- 
rection of movement. Upper half: rasters are 
oriented to the movement onset M and show 
impulse activity during five repetitions of move- 
ments made in each of the eight directions indi- 
cated by the center diagram. Notice the orderly 


variation in cell’s activity during the RT (re- 
action time), MOT (movement time) and TET 
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with linear scaling by distance. Velocity encoding led 
behavior by about 100 ms for most cells, whereas posi- 
tion tuning was more broadly distributed, with leads and 
lags suggestive of both feed-forward and feedback cod- 
ing. Linear regression methods confirmed that random, 
2-D hand trajectories can be reconstructed from the fir- 
ing of small ensembles of randomly selected neurons 
(3—19 cells) within the M1 arm area. These findings 
demonstrate that M1 carries information about evolving 
hand trajectory during visually guided pursuit tracking, 
including information about arm position both during 
and after its specification. 

Georgopoulos et al. formulated a population vector 
hypothesis to explain how populations of motor cor- 
tex neurons encode movement direction [39.66]. In the 
population vector model, individual neurons vote for 
their preferred directions using their firing rate. The 
final vote is calculated by vectorial summation of indi- 
vidual preferred directions weighted by neuronal rates. 
This model proved to be successful in description of 


(total experiment time; TET = RT + MOT). 
Lower half: directional tuning curve of the same 
cell. The discharge frequency is for TET. The 
data points are mean + SEM. The regression 
equation for the fitted sinusoidal curve is D = 
32.37 + 7.281 sin O — 21.343 cos ©, where D 

is the frequency of discharge and © is the direc- 
tion of movement, or, equivalently, D = 32.37 + 
22.5 cos (© — Oo), where po is the preferred di- 
rection (Op = 161°) (after [39.63], courtesy of 
A.P. Georgopoulos) 


motor-cortex encoding of 2-D and 3-D reach direc- 
tions, and was also capable of predicting new effects, 
e.g., accurately describing mental rotations made by 
the monkeys that were trained to translate locations of 
visual stimuli into spatially shifted locations of reach 
targets [39.67, 68]. 

The population vector study actually divided the 
field of motor physiologists between Evarts’ upper mo- 
tor neuron group, which followed the hypothesis that 
motor cortex neurons contributed to control of single 
muscles [39.69] and the Georgopoulos group studying 
the representation of movement directions in the cortex. 
From the theoretical point of view, population coding 
is one of a few mathematically well-formulated prob- 
lems in neuroscience. It grasps the essential features of 
neural coding and, yet, is simple enough for theoretic 
analysis. Experimental studies have revealed that this 
coding paradigm is widely used in the sensor and motor 
areas of the brain. For example, in the visual area me- 
dial temporal (MT) neurons are tuned to the movement 
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direction. In response to an object moving in a particu- 
lar direction, many neurons in MT fire, with a noise- 
corrupted and bell-shaped activity pattern across the 
population. The moving direction of the object is re- 
trieved from the population activity, to be immune from 
the fluctuation existing in a single neuron’s signal. 

Population coding has a number of advantages, 
including reduction of uncertainty due to neuronal 
variability and the ability to represent a number of 
different stimulus attributes simultaneously. Population 
coding is also much faster than rate coding and can 
reflect changes in the stimulus conditions nearly in- 
stantaneously. Individual neurons in such a population 
typically have different but overlapping selectivities, 
so that many neurons, but not necessarily all, respond 
to a given stimulus. The Georgopoulos vector coding 
is an example of simple averaging. A more sophis- 
ticated mathematical technique for performing such 
a reconstruction is the method of maximum likelihood 
based on a multi-variate distribution of the neuronal 
responses. These models can assume independence, 
second-order correlations [39.70], or even more de- 
tailed dependencies such as higher-order maximum 
entropy models [39.71] 

The finding that arm movement is well represented 
in populations of neurons recorded from the motor cor- 
tex has resulted in a rapid advancement in extracellular 
recording-based BCI in non-human primates and in 
a limited number of human studies. Several groups have 
been able to capture complex brain motor cortex signals 
by recording from neural ensembles (groups of neu- 
rons) and using these to control external devices. First, 
cortical activity patterns have been used in BCIs to 
show how cursors on computer displays can be moved 
in two and three-dimensional space. It was later real- 
ized that the ability to move a cursor can be useful in 
its own right and that this technology could be applied 
to restore arm and hand function for amputees and the 
physically impaired. 

Miguel Nicolelis has been a prominent proponent 
of using multiple electrodes spread over a greater area 
of the brain to obtain neuronal signals to drive a BCI. 
Such neural ensembles are said to reduce the variability 
in output produced by single electrodes, which could 
make it difficult to operate a BCI. After conducting 
initial studies in rats during the 1990s, Nicolelis et al. 
succeeded in building a BCI that reproduced owl mon- 
key movements while the monkey operated a joystick 
or reached for food [39.72]. The BCI operated in real 
time and could also control a separate robot remotely 
over internet protocol. However, the monkeys could not 


see the arm moving and did not receive any feedback, 
a so-called open-loop BCI. 

Other laboratories which have developed BCIs and 
algorithms that decode neuron signals include those run 
by John Donoghue, Andrew Schwartz, and Richard An- 
dersen. These researchers have been able to produce 
working BCIs, even using recorded signals from far 
fewer neurons than Nicolelis used (15—30 neurons ver- 
sus 50—200 neurons). Donoghue et al. reported training 
rhesus monkeys to use a BCI to track visual targets on 
a computer screen (closed-loop BCI) with or without 
the assistance of a joystick [39.73]. 

Later experiments by Nicolelis using rhesus mon- 
keys succeeded in closing the feedback loop and re- 
produced monkey reaching and grasping movements 
in a robot arm. With their deeply cleft and furrowed 
brains, rhesus monkeys are considered to be better mod- 
els for human neurophysiology than owl monkeys. The 
monkeys were trained to reach and grasp objects on 
a computer screen by manipulating a joystick, while 
corresponding movements by a robot arm were hid- 
den [39.74, 75]. The monkeys were later shown the 
robot directly and learned to control it by viewing its 
movements. The BCI used velocity predictions to con- 
trol reaching movements and simultaneously predicted 
hand gripping force. 

The use of cortical signals to control a multi- 
jointed prosthetic device for direct real-time interaction 
with the physical environment (embodiment) was first 
demonstrated by Schwartz et al. [39.76]. Schwartz et al. 
implanted 96 intracortical microelectrodes in the proxi- 
mal arm region of the primary motor cortex of monkeys 
(Macaca mulatta) and used their motor cortical activ- 
ity to control a mechanized arm replica and control 
a gripper on the end of the arm. The monkey could feed 
itself pieces of fruit and marshmallows using a robotic 
arm controlled by the animal’s own brain signals. Ow- 
ing to the physical interaction between the monkey, the 
robotic arm, and the objects in the workspace, this new 
task presented a higher level of difficulty than previous 
virtual (cursor control) experiments. 

In 2012 Schwartz et al. [39.68] showed that a 52- 
year-old individual with tetraplegia who was implanted 
with two 96-channel intracortical microelectrodes in the 
motor cortex could rapidly achieve neurological control 
of an anthropomorphic prosthetic limb with seven de- 
grees of freedom (three-dimensional translation, three- 
dimensional orientation, one-dimensional grasping). 
The participant was able to move the prosthetic limb 
freely in the three-dimensional workspace on the sec- 
ond day of training. After 13 weeks, robust seven- 
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dimensional movements were performed routinely. The 
participant was also able to use the prosthetic limb to 
do skillful and coordinated reach and grasp movements 
that resulted in clinically significant gains in tests of up- 
per limb function. No adverse events were reported. 

In addition to predicting kinematic and kinetic pa- 
rameters of limb movements, BCIs that predict elec- 
tromyographic or electrical activity of the muscles of 
primates are being developed [39.77]. Such BCIs may 
be used to restore mobility in paralyzed limbs by 
electrically stimulating muscles. Miguel Nicolelis and 
colleagues demonstrated that the activity of large neu- 
ral ensembles can predict arm position. This work made 
possible the creation of BCIs that read arm movement 
intentions and translate them into movements of artifi- 
cial actuators. Carmena et al. [39.74] programmed the 
neural coding in a BCI that allowed a monkey to con- 
trol reaching and grasping movements by a robotic arm. 
Lebedev et al. [39.75] argued that brain networks re- 
organize to create a new representation of the robotic 
appendage in addition to the representation of the ani- 
mal’s own limbs. 


The biggest impediment to BCI technology at 
present is the lack of a sensor modality that provides 
safe, accurate, and robust access to brain signals. It is 
conceivable or even likely, however, that such a sen- 
sor will be developed within the next 20 years. The 
use of such a sensor should greatly expand the range 
of communication functions that can be provided using 
a BCL 

To conclude, this demonstration of multi-degree-of- 
freedom embodied prosthetic control paves the way to- 
wards the development of dexterous prosthetic devices 
that could ultimately achieve arm and hand function at 
a near-natural level. 


39.8.3 Local Field Potential Correlates 
of Hand Motion Attributes 


Local field potentials can be recorded with extracellular 
recordings, and a number of studies have shown their 
application; however, as ECoG and EEG (covered in 
Sects. 39.5 and 39.6) are indirect measures of LFPs we 
do not cover LFPs here again for brevity. 


39.9 Translating Brainwaves into Control Signals — BCIs 


Heretofore the chapter has focused on the characteris- 
tics of the neural correlates of motor control and how 
these might be deployed in SMR-based EEG, ECoG, 
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Brain-computer interfaces, however, require a number 
of stages of signal processing and components to be 
effective and robust. Figure 39.9 shows common com- 
ponents of a complete BCI system. Although not all 
components shown are deployed together in every sys- 
tem there is increasing evidence that combining the best 
approaches deployed for each component and process 
in a multi-stage framework as well as ensemble meth- 
ods or multi-classifier approaches can lead to significant 
performance gains when discriminating sensorimotor 
rhythms and translating brain oscillations into stable 
and accurate control signals. Performance here can be 
considered from various perspectives, including sys- 
tem accuracy in producing the correct response, the 
speed at which a response is detected (or the number 
of correct detections possible in a given period), the 
adaptability to each individual and the inherent non- 
stationary dynamics of the mutual interaction between 
the brain and the translating algorithm, the length of 
training required to reach an acceptable performance, 
the number of sensors required to derive a useful con- 
trol signal, and the amount of engagement needed by 
the participant, to name but a few. The following sec- 
tions highlight some of the methods which have been 
tried and tested in sensorimotor rhythm BCIs; how- 
ever the coverage is by no means exhaustive. Also, the 
main emphasis is on EEG-based BCI designs as EEG- 
based BCI has been the driving force behind much of 
the novel signal processing research conducted in the 
field over the last 20 years, with some of the more in- 
vasive approaches considered less usable in the short 
term, high risk for experimentation and deployment in 
humans, with less funding to develop invasive strategies 
and less data availability. 

EEG being the least informative, spectrally and spa- 
tially, about the underlying brain processes and subject 
to deterioration and spatial diffusion by the physi- 
cal properties of the cerebrospinal fluid, skull, and 
skin, as well as the ominous susceptibility to contam- 
ination from other sources such as muscle and eye 
movements, poses the most challenges for engineers, 
mathematicians, and computer scientists. Researchers 
in these, among many other disciplines, are eager to 
solve a problem which has dogged the field for long 
namely, creating an EEG-based BCI which is accu- 
rate and robust across time for individual subjects and 
can be deployed across multiple subjects easily to offer 
a communication channel which matches or surpasses, 
at least, other basic, tried and tested computer periph- 
eral input devices and/or basic assistive communication 
technologies. Signal processing, as shown in Fig. 39.10, 


is only one piece of the puzzle with a range of other 
components being equally as important, including elec- 
trode technologies and hardware being critical to data 
quality, usability, and acceptability of the system. Addi- 
tionally, the technologies and devices under the control 
of the BCI are another aspect, which is not dealt with 
here but is a topic which requires investigation to de- 
termine how applications can be adapted to cope with 
the, as yet, inevitable inconsistencies in the communi- 
cation and control signals derived from the BCI. Here 
our intention is not to deal with these elements of 
brain computer interface but only to provide the reader 
with an indicative overview of key signal processing 
and discrimination topics under consideration in the 
area, perhaps not topics that have received the atten- 
tion deserved, but show promise. Interested readers are 
referred to [39.78-87] for comprehensive surveys of 
BCl-control strategies and signal-processing strategies. 


39.9.1 Pre-Processing 
and Feature Extraction/Selection 


Oscillatory and rhythmic activity in various frequency 
bands are a predominant feature in sensorimotor 
rhythm-based BCIs, as outlined in Sect. 39.7.1. Whilst 
amplitude of power in subject-specific sub-bands has 
proven to be a reliable feature to enable discrimina- 
tion of the lateralized brain activity associated with 
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gross arm movement imagination from EEG, there is 
a general consensus that there is a necessity to ex- 
tract much more information about spatial and temporal 
relationships by correlating the synchronicity, ampli- 
tude, phase, and coherence of oscillatory activity across 
distributed brain regions. To that end, spectral filter- 
ing is often accompanied with spatial pattern estima- 
tion techniques, channel selection techniques, along 
with other preprocessing techniques to detect signal 
sources and for noise removal. These include principle 
component analysis (PCA) and independent compo- 
nent analysis (ICA), among others, whilst the most 
commonly used is the common spatial patterns (CSP) 
approach [39.88-91]. 

Many of these methods involve linear transforma- 
tions where a set of possibly correlated observations are 
transformed into a set of uncorrelated variables and can 
be used for feature dimensionality reduction, artifact re- 
moval, channel selection, and dimensionality reduction. 
CSP is by far the most commonly deployed of all these 
filters in sensorimotor rhythm-based BCIs. 

CSP maximizes the ratio of class-conditional vari- 
ances of EEG sources [39.88, 89]. To utilize CSP, $; 
and >, are the pooled estimates of the covariance ma- 
trices for two classes, as follows 


Ic 
= D ce{1,2}, 


c i=l 


(39.5) 


where /, is the number of trials for class c and X; is the 
M x N matrices containing the i-th windowed segment 
of trial i; N is the window length and M is the num- 
ber of EEG channels. The two covariance matrices, Xi 
and }`,, are simultaneously diagonalized such that the 
eigenvalues sum to 1. This is achieved by calculating 
the generalized eigenvectors W 


zr- (E+E) 


where the diagonal matrix D contains the eigenvalue 
of X; and the column vectors of W are the filters for 
the CSP projections. With this projection matrix the de- 
composition mapping of the windowed trials X is given 
as 


(39.6) 


E=WxX. (39.7) 


To generalize CSP to three or more classes (the multi- 
class paradigm), spatial filters can be produced for 
each class vs. the remaining classes (one vs. rest ap- 
proach). If q is the number of filters used then there 


are qx C surrogate channels from which to extract 
features. To illustrate how CSP enhances separability 
among four classes the hypothetical relative variance 
level of the data in each of the four classes are shown in 
Fig. 39.10. 

CSP has been modified and improved substan- 
tially using numerous techniques and deployed and 
tested in BCIs [39.88-92]. CSP is commonly ap- 
plied with spectral filters. One of the more successful 
approaches to spectral filtering combined with CSP 
is the filter bank CSP approach [39.93, 94]. Another 
promising technique for the analysis of multi-modal, 
multi-channel, multi-tasks, multi-subject, and multi- 
dimensional data is multi-way (array) tensor factor- 
ization/decomposition [39.95]. The technique has been 
shown to have the ability to discriminate between dif- 
ferent conditions, such as right hand motor imagery, 
left hand motor imagery, or both hands motor imagery, 
based on the spatiotemporal features of the different 
EEG tensor factorization components observed. 

Due to the short sequences of events during motor 
control it is likely that assessment of activity at a fine 
granularity such as the optimal embedding parameters 
for prediction as well as the predictability of EEG over 
short and long time spans and across channels will also 
provide clues about the temporal sequences of motor 
planning and activations and the motion primitives in- 
volved in different hand movement trajectories. Work 
has shown that subject, channel, and class-specific op- 
timal time embedding parameter selection using partial 
mutual information improves the performance of a pre- 
dictive framework for EEG classification in SMR-based 
BCIs [39.92, 96-102]. Many other time series model- 
ing, embedding, and prediction through traditional and 
computational intelligence techniques such as fuzzy 
and recurrent neural networks (FNN and RNN) have 
been promoted for EEG preprocessing and feature ex- 
traction to maximize signal separability [39.19, 66-69, 
103]. 

The above preprocessing or filtering frameworks 
have been used extensively, yet rarely independently, 
but in conjunction with a stream of other signal pro- 
cessing methodologies to extract reliable information 
from neural data. It is well known that the amplitude 
and the phase of neural oscillations are spatially and 
temporally modulated during sensorimotor processing 
(see Sects. 39.6 and 39.7 for further details). Spec- 
tral information and band power extraction have been 
commonly used as features ((39.82, 83] for reviews); 
however, phase and cross frequency coupling less so, 
even though a number of non-invasive and intracortical 
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studies have emphasized the importance of phase infor- 
mation [39.33, 35, 104, 105]. Furthermore, amplitude- 
phase cross-frequency coupling has been suggested 
to play an important role in neural coding [39.106]. 
While neural representations of movement kinematics 
and movement imagination by amplitude information 
in sensorimotor cortex have been extensively reported 
using different oscillatory signals (LFP, ECoG, MEG, 
EEG) [39.33, 35, 39, 107-111] and used extensively in 
non-invasive motor imagery-based BCI designs phase 
information has not been given as much attention as 
possibly deserved [39.35]. As reported in [39.35] there 
have been some recent developments describing syn- 
chronized activity between M1 and hand speed [39.8 1, 
82], corticomuscular coupling [39.112], and the LFP 
beta oscillations phase locked to target cue onset in 
an instructed-delay reaching task [39.113], in addition 
to the studies covered in Sects. 39.6 and 39.7, among 
others. The role of phase coding in the sensorimotor 
cortex should be further explored to fully exploit the 
complementary information encoded by amplitude and 
phase [39.35]. 

Parameter optimization can be made more profi- 
cient through global searches of the parameter space 
using evolutionary computation-based approaches such 
as particle swarm optimization (PSO) and genetic al- 
gorithms (GAs). The importance of features can be 
assessed and ranked for different tasks using various 
feature selection techniques using information theoretic 
approaches such as partial mutual information-based 
(PMI) input variable selection [39.98, 114]. Parame- 
ter optimization and feature selection approaches such 
as these enable coverage of a large parameter space 
when additional features are identified to enhance per- 
formance. Heuristic-based approaches can be used to 
determine the relative increase in classification associ- 
ated with each variable along with other more advanced 
methods for feature selection such as Fisher’s crite- 
rion and partial mutual information to estimate the level 
of redundancy among features. Verifying the feature 
landscape using global heuristic searches is important 
initially and automated intelligent approaches enable 
efficient and automated system optimization during ap- 
plication at a later time and easy application to a large 
sample of participant data, i. e., removing the necessity 
to conduct global parameter searches. 


39.9.2 Classification 


Various classifier techniques can be applied to the sam- 
pled data to determine classification/prediction perfor- 


mance, including standard linear methods such as linear 
discriminant analysis (LDA), support vector machines 
(SVM), and probabilistic-based approaches [39.115], 
as well as non-linear approaches such as backprop- 
agation neural networks (NN) and self-organizing 
fuzzy neural networks (SOFNN) [39.116]. Other adap- 
tive methods and approaches to classifier combination 
have been investigated [39.87, 88, 117, 118] along with 
Type-2 fuzzy logic approaches to deal with uncer- 
tainty [39.119, 120]. Recent evidence has shown that 
probabilistic classifier vector machines (PCVM) have 
significant potential to outperform other tried and tested 
classifiers [39.121,122]. These are just a few of the 
available approaches (see [39.82] for a more detailed re- 
view). Here we focus on one of the latest trends in BCI 
translation algorithms, i.e., automated adaptation to 
non-stationary changes in the EEG dynamics over time. 


39.9.3 Unsupervised Adaptation 
in Sensorimotor Rhythms BCIs 


EEG signals deployed in BCI are inherently non- 
stationary resulting in substantial change over time, 
both within a single session and between sessions, re- 
sulting in significant challenges in maintaining BCI 
system robustness. There are various sources of non- 
stationarities: short-term changes related to modifica- 
tion to the strategy that users apply to motor imagery 
to enhance performance, drifts in attention, attention to 
different stimuli or processing other thoughts or stim- 
uli/feedback, slow cortical potential drifts and less spe- 
cific long-term changes related to fatigue, small day to 
day differences in the placement of electrodes, among 
others. However, one which is considered a potential 
source of change over time is user adaption through 
motor learning to improve BCI performance over time, 
sometimes referred to as the effects of feedback train- 
ing [39.123, 124] and sensorimotor learning. 

The effects of feedback on the user’s ability to pro- 
duce consistent EEG, as he/she begins to become more 
confident and learns to develop more specific com- 
munication and control signals, can have a negative 
effect on the BCI’s feature extraction procedure and 
classifier. During sensorimotor learning the temporal 
and spatial activity of the brain continually adapts and 
the features which were initially suited to maximizing 
the discrimination accuracy may not remain stable as 
time evolves, thus degradation in communication oc- 
curs. For this reason, the BCI must have the ability to 
adapt and interact with the adaptations that the brain 
makes in response to the feedback. According to Wol- 


757 


6°6€ | d Hed 


758 PartD 


Neural Networks 


6°6€ | d Hed 


paw etal. [39.125] the BCI operation depends on the 
interaction of two adaptive controllers, the user’s brain, 
which produces the signals measured by the BCI, and 
the BCI itself, which translates these signals into spe- 
cific commands [39.125]. 

With feedback, even though classification accuracy 
is expected to improve with an increasing number of 
experiments, the performance has been shown to de- 
crease with time if the classifier is not updated [39.43]. 
This has been referred to as the man—machine learning 
dilemma (MMLD), meaning that the two systems in- 
volved (man and machine) are strongly interdependent, 
but cannot be controlled or adapted in parallel [39.43]. 
The experiments shown in many studies show that feed- 
back results in changing EEG patterns, and thus adap- 
tation of the pattern recognition methods is required. 
It is, therefore, paramount to adapt a BCI periodically 
or continuously if possible. Autonomous adaptive sys- 
tem design is required but a challenge. The recognition 
and productive engagement of adaptation will be impor- 
tant for successful BCI operation. According to Wolpaw 
et al. [39.125] there are three levels of adaptation which 
are not always accounted for but have great importance 
for future adoption of BCI systems: 


1. When a new user first accesses the BCI, the algo- 
rithm adapts to the user’s signal features. 

@ No two people are the same physiologically 
or psychologically, therefore brain topography 
differs among individuals, and the electrophysi- 
ological signals that are produced from different 
individuals are unique to each individual, even 
though they may be measured from the same lo- 
cation on the scalp whilst performing the same 
mental tasks at the same time. For each new 
user the BCI has to adapt specifically to the 
characteristics of each particular person’s EEG. 
This adaptation may be to find subject-specific 
frequency bands which contain frequency com- 
ponents that enable maximal discrimination ac- 
curacy between two mental tasks or train a static 
classifier on a set of extracted features. 

2. The second level of adaptation requires that the 
BCI system components be periodically adjusted or 
adapted online to reduce the impact of spontaneous 
variations in the EEG. 

@ Any BCI system which only possesses the first 
level of adaptation will continue to be effective 
only if the user’s performance is very stable. 
Most electrophysiological signals display short 
and long-term variations due to the complexity 


of the physiological functioning of the underly- 
ing processes in the brain, among other sources 
of change as outlined above. The BCI sys- 
tem should have the ability to accommodate 
these variations by adapting to the signal fea- 
ture values which maximally express the user’s 
intended communication. 
3. The third level of adaptation accommodates and en- 

gages the adaptive capacities of the brain. 

@ The BCI depends on the interaction of two 
adaptive controllers, the BCI and the user’s 
brain [39.125]: 


When an electrophysiological signal feature 
that is normally merely a reflection of brain 
function becomes the end product of that func- 
tion, that is, when it becomes an output that 
carries the user intent to the outside world, it 
engages the adaptive capacities of the brain. 


This means that, as the user develops the skill 
of controlling their EEG, the brain has learned 
a new function and, hopefully, the brain’s newly 
learned function will modify the EEG so as to 
improve BCI operation. The third level of adap- 
tation should accommodate and encourage the 
user to develop and maintain the highest pos- 
sible level of correlation between the intended 
communication and the extracted signal features 
that the BCI employs to decipher the intended 
communication. Due to the nature of this adap- 
tation (the continuous interaction between the 
user and the BCI) it can only be assessed on- 
line and its design is among the most difficult 
problems confronting BCI research. 


McFarland et al. [39.126] further categorize adapta- 
tion into system adaptation, user adaptation, and system 
and user co-adaption, asking the question: is it neces- 
sary to continuously adapt the parameters of the BCI 
translation algorithm? Their findings show that for sen- 
sorimotor rhythms BCI it is, whereas perhaps it is not 
for other stimulus-based BCIs. 

A review of adaptation methods is included by 
Hasan [39.127] focusing on questions: what, how, and 
when to adapt and how to evaluate adaptation suc- 
cess. A range of studies has been aimed at address- 
ing the adaption requirements [39.123, 124, 128-137]. 
Krusienski et al. [39.35] define the various types of pos- 
sible adaptation as follows: 


© Covariate shift adaptation/minimization: Covariate 
shift refers to when the distribution of the training 
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features and test features follow different distri- 
butions, while the conditional distribution of the 
output values (of the classifier) and the features 
is unchanged [39.138]. The shift in feature distri- 
bution from session to session can be significant 
and can result in substantive biasing effects. With- 
out some form of adaption to the features and/or 
classifier, the classifier trained on a past session 
would perform poorly in a more recent session. 
Satti et al. [39.139] proposed a method for covari- 
ate shift minimization (CSM), where features can 
be adapted so that the feature distribution is always 
consistent with the distribution of the features that 
were used to train the classifier in the first session. 
This can be achieved in an unsupervised manner 
by estimating the shift in distribution using a least 
squares fitting polynomial for each feature and re- 
moving the shift by adding the common mean of 
the training feature distribution so that the feature 
space distribution remains constant over time as 
described in [39.139]. Mohammadi et al. [39.140] 
applied CSM in self-paced BCI updated features to 
account for short terms (within trial) drifts in sig- 
nal dynamics. In [39.138] an importance-weighted 
cross-validation for accommodating covariate shift 
under a number of assumptions is described but 
is not adaptively updated online in an unsuper- 
vised manner whereas other offline approaches have 
been investigated to enable feature extraction meth- 
ods to accommodate non-stationarity and covariate 
shifts [39.90, 91]. 

Feature adaptation/regression: Involves adapting 
the parameters of the feature extraction methods 
to account for subject learning, e.g., modifying 
the subject-specific frequency bands can be easily 
achieved in a supervised manner but this is not nec- 
essarily easily achieved online, unsupervised. An 
approach to adaptively weight features based on mu 
and beta rhythm amplitudes and their interactions 
using regression [39.4] resulted in significant per- 
formance improvements and may be adapted for 
unsupervised feature adaptation. 

Covariate shift adaptation/minimization can be con- 
sidered an anti-biasing method because it pre- 
vents the classifier biasing, whereas feature adap- 
tion/regression is likely to result in the need to adapt 
the classifier to suit the new feature distributions. 
Both methods help to improve the performance over 
time, but it is uncertain if feature adaption followed 
by covariate shift minimization (to shift features to- 
wards earlier distribution) would limit the need for 


classifier adaptation and/or provide stable perfor- 
mance or negate the benefits of feature regressions. 
An interesting discussion on the interplay between 
feature regression and adapting bias and gain terms 
in the classifier is presented in [39.4]. 

Classifier adaptation: Unsupervised classifier adap- 
tation has received more attention than feature 
adaptation with a number of methods having been 
proposed [39.124, 128]. Classifier adaptation is re- 
quired when significant learning (or relearning)- 
induced plasticity in the brain significantly alters 
the brain dynamics, resulting in a shift in feature 
distribution, as well as significant changes in the 
conditional distribution between features and classi- 
fier output as opposed to cases where only covariate 
shift has occurred. In such cases, classifier adap- 
tation can neither be referred to as anti-biasing or 
de-biasing. 

Post-processing adaptation: De-biasing the clas- 
sifier output, in its simplest form, can be per- 
formed in an unsupervised manner by removing 
the mean calculated from a window of recent clas- 
sifier outputs from the instantaneous value of the 
classifier [39.141], also referred to as normalization 
in [39.142], where the data from recent trials are 
used to predict the mean and standard deviation of 
the next trial and the data of the next trial is then 
normalized by these estimates to produce a control 
signal which is assumed to be stationary. De-biasing 
is suitable when covariate shift has not been ac- 
counted for and can improve the online feedback 
response but may only provide a slight performance 
improvement. 

EEG data space adaptation (EEG-DSA): acts on 
the raw data space and is a new approach to lin- 
early transform the EEG data from the target space 
(evaluation/testing session), such that the distribu- 
tion difference to the source space (training session) 
is minimized [39.143]. The Kullback—Leibler (KL) 
divergence criterion is the main method deployed 
in this approach and it can be applied in a super- 
vised or unsupervised manner either periodically or 
continuously. Other adaptations (feature space or 
classifier) can be applied in tandem but accurate 
minimization of feature space adaptation should 
negate the need for further anti-biasing and or de- 
biasing adaptations. 


Classifier adaptation (anti-biasing) negates the need 


for post processing (de-biasing) if the classifier is up- 
dated continuously, which is a challenging task to 
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undertake in an unsupervised manner (with no class 
labels) and may result in maladaptation, whereas de- 
biasing can be conducted easily, unsupervised, regard- 
less of the classifier used. Because post-processing- 
based de-biasing only results in removal of bias (shifts it 
to mean zero) in the feedback signal and not necessarily 
a change in the dynamics of the feedback signal, feature 
adaptation or classifier adaptation is necessary during 
subject learning and adaptation as the conditional dis- 
tribution between features and classifier output evolves 
as outlined above. 

All of the above methods are heavily dependent 
upon the context in which the BCI is used. For exam- 
ple, for a BCI applied in alternative communication the 
objective is to maximize the probability of interpret- 
ing the user’s intent correctly; therefore the adaptation 
is performed with that objective, whereas, if the BCI 
is aimed at inducing neuroplastic changes in specific 
cortical areas, e.g., a BCI which is aimed at support- 
ing stroke survivors perform motor imagery as means 
of enhancing the speed or level of rehabilitation post 
stroke, the objective is to not only provide accurate 
feedback but to encourage the user to activate regions 
of cortex which do not necessarily provide optimal 
control signals [39.35]. The latter may require elec- 
trode/channel adaptation strategies but not necessarily 
in a fast online unsupervised manner. Abrupt changes 
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Fig. 39.11 Comparison of communication rates between humans 
and the external world: (a) speech received auditorily; (b) speech 
received visually using lip reading and supplemented by cues; 
(c) Morse code received auditorily; (d) Morse code received 
through vibrotactile stimulation (figure adapted from [39.144] with 
permission and other sources [39.145, 146]) 


to classifier performance may also lead to negative 
learning where the user cannot cope with the rate at 
which the feedback dynamics change, in such cases 
consistent feedback, even though less accurate, may 
be appropriate. As outlined in [39.126], there is still 
debate around whether mutual adaptation of a sys- 
tem and user is a necessary feature of successful BCI 
operation or if fast adaptation of parameters during 
training is not necessary. A recent study in animal 
models suggests that there is no negative correlation 
between decoding performance and the time between 
model generation and model testing, which suggests 
that the neural representations that encode kinematic 
parameters of reaching movements are stable across 
the months of study [39.147, 148], which further sug- 
gests little adaptation is needed for ECoG decoding in 
animal models, but this may not necessarily translate 
to humans and non-invasive BCIs involving motor im- 
agery. Much more research on the issue of what type of 
adaption methods to apply and at what rate adaptation 
is necessary. Another important factor is to consider 
a person’s level of ability to control a BCI and those 
persons close to chance levels may actually benefit from 
an incorrect belief on their performance level [39.149]. 
This would imply adapting the classifier output based 
on knowledge of the targets in a supervised manner, 
such that the user thinks they are performing better — 
a method which may help in the initial training phases 
to improve BCI performance [39.149]. Most of the 
techniques outlined above have been tested offline and, 
therefore, there is need to assess how the techniques 
improve performance as the user and BCI are mutu- 
ally adapted. Table 39.1 provides a summary of the 
categories of adaptation and their interrelationships and 
requirements. 


39.9.4 BCI Outlook 


Translating brain signals into control signals is a com- 
plex task. The communication bandwidth given by BCI 
is still lagging behind most other communication meth- 
ods rates between humans and the external world where 
the maximum BCI communication rate is ~ 0.41 bit/s 
(~ 25 bit/min) [39.144] (see Fig. 39.11 for an illus- 
tration that nicely illustrates the gap in communica- 
tion bandwidth between BCI and other communication 
methods, as well as the relatively low communica- 
tion bandwidth across all human—human and human- 
computer interaction methods). 

Nevertheless with the many developments and stud- 
ies highlighted throughout this chapter (a selected few 
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Table 39.1 BCI components that can be adapted and the way in which they can be adapted. Interrelationship between 
components, i.e., indicating when one is adapted which other component or stages of the signal processing pipeline it 
might be necessary to adapt (whether a calibration session is needed for offline setup or there is certain number of trials 
needed before adaption begins is not specified in the criteria but is another consideration) 


Adaptation type Anti- De- Subject Feature 
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Fig. 39.12 Distributions of target-acquisition times (i.e., 
time from target appearance to target hit) on a 2-D 
center-out cursor-movement task for joystick control, 
EEG-based BCI control, and cortical neuron-based BCI 
control. The EEG-based and neuron-based BCIs per- 
form similarly and both are slower and much less 
consistent than the joystick. For both BCIs in a sub- 
stantial number of trials, the target is not reached 
even in the 7s allowed. Such inconsistent perfor- 
mance is typical of movement control by present-day 
BCIs, regardless of what brain signals they use. (Joy- 
stick data and neuron-based BCI data from Hochberg 
et al. [39.150]; EEG-based BCI data from Wolpaw and 
McFarland [39.57]; figure after [39.58], courtesy of Mc- 
Farland et al.) 
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among many) there has been progress, yet there is 
still debate around whether invasive recordings are 
more appropriate for BCI, with findings showing that 
performance to date is not necessarily better or commu- 
nications faster with invasive or extracellular recordings 
compared to EEG (see Fig. 39.12 for an illustra- 
tion [39.58]). As shown, performance is far less con- 
sistent than a joystick for 2-D center-out tasks using 
both methods, however the performance is remark- 
ably similar even though the extracellular recordings 
are high resolution and EEG is low resolution. Train- 
ing rates/durations with invasive BCI are probably less 
onerous on the BCI user compared to EEG-based ap- 
proaches, which often require longer durations, how- 
ever only a select few are willing to undergo surgery 
for BCI implants due to the high risk associated with 
the surgery required, at least with the currently available 
technology. This is likely to change in the future and 
information transfer between humans and machines is 
likely to increase to overcome the communication bot- 
tleneck human—human and human—computer interac- 
tion by directly interfacing brain and machine [39.144]. 
There is one limitation that dogs many movement or 
motor-related BCI studies and that is that in a large 
part control relies only a signal from single cortical 
area [39.58]. Exploiting multiple cortical areas may of- 
fer much more and this may be achieved more easily 
and successfully by exploiting information acquired at 
different scales using both invasive and non-invasive 
technologies (many of the studies reported in this chap- 
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ter have shown advantages that are unique at the various 
scales of recording). Carmena [39.151] recommends 
that non-invasive BCIs should not be pitted against 
invasive ones as both have pros and cons and have 
gone beyond pitching resolution as an argument to use 
one type or the other. In the future, BCI systems may 
very well become a hybrid of different kinds of neural 
signals, be able to benefit from local, high-resolution in- 
formation (for generating motor commands) and more 
global information (arousal, level of attention, and other 
cognitive states) [39.151]. 

In summary, BCI technology is developing through 
a better understanding of the motor system and sen- 
sorimotor control, better recording technologies, better 
signal processing, more extensive trials with users, 
long-term studies, more multi-disciplinary interactions, 
among many other reasons. According to report con- 
ducted by Berger et al. [39.152] the magnitude of BCI 
research throughout the world will grow substantially, 
if not dramatically, in future years with multiple driv- 
ing forces: 


© Continued advances in underlying science and tech- 
nology 

@ Increasing demand for solutions to repair the ner- 
vous system 

@ Increase in the aging population worldwide; a need 
for solutions to age-related, neurodegenerative dis- 
orders, and for assistive BCI technologies 

@ Commercial demand for non-medical BCIs. 


BCI has the potential to meet many of these chal- 
lenges in healthcare and is already growing in popular- 
ity for non-medical applications. BCI is considered by 
many as a revolutionary technology. 

An analysis of the history of technology shows 
that technological change is exponential, and accord- 


39.10 Conclusion 


The scientific approaches described throughout this 
chapter often overlook the underpinning processes and 
rely on correlations between a minimal number of fac- 
tors only. As a result, current sensorimotor rhythms 
BCIs are of limited functionality and allow basic mo- 
tor functions (a two degrees-of-freedom (DOF) lim- 
ited control of a wheelchair/mouse cursor/robotic arm) 
and limited communication abilities (word dictation). 
It is assumed that BCI systems could greatly benefit 


ing to the law of accelerating returns as the technology 
performance increases, more and more users groups 
begin to adopt the technology and prices begin to 
fall [39.153]. In terms of BCI there has been sig- 
nificant progress over recent years, and these trends 
are being observed with increasing technology diffu- 
sion [39.144]. In terms of research there has been an 
exponential growth in the number of peer reviewed pub- 
lications since 2000 [39.84]. 

Many studies over the past two decades have 
demonstrated that non-muscular communication, based 
on brain—computer interfaces (BCIs), is possible and, 
despite the nascent nature of BCIs there is already 
a range of products, including alternative commu- 
nication and control for the disabled stroke reha- 
bilitation, electrophysiologically interactive computer 
systems, neurofeedback therapy, and BCI-controlled 
robotics/wheelchairs. A range of case studies have also 
shown that head trauma victims diagnosed as being 
in a persistent vegetative state (PVS) or a minimally 
conscious state and patients suffering locked-in syn- 
drome as a result of motor neuron disease or brainstem 
stroke can specifically benefit from current BCI sys- 
tems, although, as BCIs improve and surpass existing 
assistive technologies, they will be beneficial to those 
with less severe disabilities. In addition, the possibil- 
ity of enriching computer game play through BCI also 
has immense potential, and computer games as well 
as other forms of interactive multi-media are currently 
an engaging interface techniques for therapeutic neuro- 
feedback and improving BCI performance and training 
paradigms. Brain—computer games interaction provides 
motivation and challenge during training, which is used 
as a stepping stone towards applications that offer en- 
ablement and assistance. Based on these projections and 
the ever-increasing knowledge of the brain the future 
looks bright for BCIs. 


from the inclusion of multi-modal data and multi- 
dimensional signal processing techniques, which would 
allow the introduction of additional data sources and 
data from multiple brain scales, and enable detection 
of more subtle features embedded in the signal. Fur- 
thermore, using knowledge about sensorimotor control 
will be critical in understanding and developing suc- 
cessful learning and control models for robotic devices 
and BCI, fully closing the sensorimotor learning loop 
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to enable finer manipulation abilities using BCIs and 
for retraining or enabling better relearning of motor ac- 
tions after cortical damage. As demonstrated through- 
out the chapter, many remarkable studies have been 
conducted with truly inspirational engineering and sci- 
entific methodologies resulting in many very useful and 
interesting findings. 

There are many potential advantages of understand- 
ing motor circuitry, not to mention the many clinical and 
quality of life benefits that a greater understanding of 
the motor systems may provide. Such knowledge may 
offer better insights into treating motor pathologies that 
occur as a result of injury or diseases such as spinal 
cord injury, stroke, Parkinson’s disease, Guillain Barre 
syndrome, motor diseases, and Alzheimer’s disease, to 
mention just a few. Understanding sensorimotor sys- 
tems can provide significant gains in developing more 
intelligent systems that can provide multiple benefits for 
humanity in general. However, there are still lacunae in 
our biological account of how the motor system works. 

Animals have superb innate abilities to choose and 
execute simple and extended courses of action and the 
ability to adapt their actions to a changing environment. 
We are still a long way from understanding how that 
is achieved and are exploiting this to tackle the issues 
outlined above comprehensively. There are number of 
key questions that need to be addressed [39.154]: 


@ What are the roles of the cortex, the basal ganglia, 
and the cerebellum — the three major neural control 
structures involved in movement planning and gen- 
eration? 

@ How do these structures in the brain interact to de- 
liver seamless adaptive control? 

@ How do we specify how hierarchical control struc- 
tures can be learned? 

@ What is the relationship between reflexes, habits, 
and goal-directed actions? 

© Is there anything to be gained for robotic control 
by thinking about how interactions are organized in 
sensorimotor regions? 

@ Is it essential to replicate this lateralized structure in 
sensorimotor areas to produce better motor control 
in an artificial cognitive system? 

@ How can we create more accurate models of how 
the motor cortex works? Can such models be im- 
plemented to provide human-like motor control in 
an artificial system? 

@ How can we decode motor activity to undertake 
tasks that require accurate and robust three dimen- 
sional control under multiple different scenarios? 


Wolpert et al. [39.16] elaborate on some of these 
questions, in particular, one which has not been ad- 
dressed in this chapter, namely modeling sensorimotor 
systems. Although substantial progress has been made 
in computational sensorimotor control, the field has 
been less successful in linking computational mod- 
els to neurobiological models of control. Sensorimotor 
control has traditionally been considered from a con- 
trol theory perspective, without relation to neurobi- 
ology [39.155]. Although neglected in this chapter, 
computational motor cortical circuit modeling will be 
a critical aspect of research into understanding sen- 
sorimotor control and learning, and is likely to fill 
parts of the lacunae in our understanding that are 
not accessible with current imaging, electrophysiology, 
and experimental methodology. Likewise, understand- 
ing the computations undertaken in many of senso- 
rimotor areas will depend heavily on computational 
modeling. Doya [39.156] suggested the classical notion 
that the cerebellum and the basal ganglia are dedi- 
cated solely to motor control. This is now under dispute 
given increasing evidence of their involvement in non- 
motor functions. However, there is enough anatomical, 
physiological and theoretical evidence to support the 
hypotheses that the cerebellum is a specialized organ- 
ism that may support supervised learning, the basal 
ganglia may perform reinforcement learning role, and 
the cerebral cortex may perform unsupervised learn- 
ing. Alternative theories that enable us to comprehend 
the way the cortex, cerebellum, and the basal ganglia 
participate in motor, sensory or cognitive tasks are re- 
quired [39.156]. 

Additionally, as has been illustrated throughout this 
work, investigating brain oscillations is key to under- 
standing brain coordination. Understanding the coor- 
dination of multiple parts of an extremely complex 
system such as the brain is a significant challenge. 
Models of cortical coordination dynamics can show 
how brain areas may cooperate (integration) and at 
the same time retain their functional specificity (seg- 
regation). Such models can exhibit properties that the 
brain is known to exhibit, including self-organization, 
multi-functionality, meta-stability, and switching. Cor- 
tical coordination can be assessed by investigating the 
collective phase relationships among brain oscillations 
and rhythms in neurophysiological data. Imaging and 
electrophysiology can be used to tackle the challenge 
of understanding how different brain areas interact and 
cooperate. 

Ultimately better knowledge of the motor system 
through neuroengineering sensorimotor—-computer in- 
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terfaces may lead to better methods of understanding 
brain dysfunction and pathology, better brain—computer 
interfaces, biological plausible neural circuit models, 
and inevitably more intelligent systems and machines 
that can perceive, reason, and act autonomously. It 
is too early to know the overarching control mech- 
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40. Evolving Connectionist Systems: 
From Neuro-Fuzzy-, to Spiking- and Neuro-Genetic 


Nikola Kasabov 


This chapter follows the development of a class of 
neural networks (NN) called evolving connectionist 
systems (ECOS). The term evolving is used here in its 
meaning of unfolding, developing, changing, re- 
vealing (according to the Oxford dictionary) rather 
than evolutionary. The latter represents processes 
related to populations and generations of them. An 
ECOS is a neural network-based model that evolves 
its structure and functionality through incremen- 
tal, adaptive learning and self-organization during 
its lifetime. In principle, it could be a simple NN or 
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40.4 Computational Neuro-Genetic Modeling 


40.1 Principles of Evolving Connectionist Systems (ECOS) 


Everything in Nature evolves, develops, unfolds, re- 
veals, and changes in time. The brain is probably 
the ultimate evolving system, which develops during 
a lifetime, based on genetic information (Nature) and 
learning from the environment (nurture). Inspired by 
information principles of the developing brain, ECOS 
are adaptive, incremental learning and knowledge rep- 
resentation systems that evolve their structure and func- 


tionality from incoming data through interaction with 
the environment, where in the core of a system is 
a connectionist architecture that consists of neurons (in- 
formation processing units) and connections between 
them [40.1]. An ECOS is a system based on neural net- 
works and the use of also other techniques of computa- 
tional intelligence (CI), which operates continuously in 
time and adapts its structure and functionality through 
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continuous interaction with the environment and with 
other systems. The adaptation is defined through: 


1. A set of evolving rules. 

2. A set of parameters (genes) that are subject to 
change during the system operation. 

3. An incoming continuous flow of information, pos- 
sibly with unknown distribution. 

4. Goal (rationale) criteria (also subject to modifica- 
tion) that are applied to optimize the performance 
of the system over time. 


ECOS learning algorithms are inspired by brain-like 
information processing principles, e.g.: 


1. They evolve in an open space, where the dimensions 
of the space can change. 

2. They learn via incremental learning, possibly in an 
on-line mode. 


3. They learn continuously in a lifelong learning 
mode. 

4. They learn both as individual systems and as an evo- 
lutionary population of such systems. 

5. They use constructive learning and have evolving 
structures. 

6. They learn and partition the problem space locally, 
thus allowing for a fast adaptation and tracing the 
evolving processes over time. 

7. They evolve different types of knowledge represen- 
tation from data, mostly a combination of memory- 
based and symbolic knowledge. 


Many methods, algorithms, and computational in- 
telligence systems have been developed since the con- 
ception of ECOS and many applications across disci- 
plines. This chapter will review only the fundamental 
aspects of some of these methods and will highlight 
some principal applications. 


40.2 Hybrid Systems and Evolving Neuro-Fuzzy Systems 


40.2.1 Hybrid Systems 


A hybrid computational intelligent system integrates 
several principles of computational intelligence to en- 
hance different aspects of the performance of the sys- 
tem. Here we will discuss only hybrid connectionist 
systems that integrate artificial neural networks (NN) 
with other techniques utilizing the adaptive learning 
features of the NN. 

Early hybrid connectionist systems combined NN 
with rule-based systems such as production rules [40.3] 
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Trading rules 
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(crisp value) 


Political situation Fuzzy 
rule-based 
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Fig. 40.1 A hybrid NN-fuzzy rule-based expert system for financial 


decision support (after [40.2]) 


or predicate logic [40.4]. They utilized NN modules 
for a lower level of information processing and rule- 
based systems for reasoning and explanation at a higher 
level. 

The above principle is applied when fuzzy rules 
are used for higher-level information processing and 
for approximate reasoning [40.5-7]. These are expert 
systems that combine the learning ability of NN with 
the explanation power of linguistically plausible fuzzy 
rules [40.8-11]. A block diagram of an exemplar sys- 
tem is shown in Fig. 40.1, where at a lower level 
a neural network (NN) module predicts the level of 
a stock index and at a higher level a fuzzy reason- 
ing module combines the predicted values with some 
macro-economic variables representing the political 
and the economic situations using the following types 
of fuzzy rules [40.2] 


IF <the predicted by the NN module stock value 
in the future is high> AND 
<the economic situation is good> AND 
<the political situation is stable> 


THEN <buy stock> . (40.1) 


Along with the integration of NN and fuzzy rules 
for a better decision support, the system from Fig. 40.1 


includes an NN module for extracting recent rules form 
data that can be used by experts to analyze the dy- 
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40.2 Hybrid Systems and Evolving Neuro-Fuzzy Systems 


Rule (case) 


Fig. 40.2 A simple, feedforward EFuNN structure. The 
tule nodes evolve from data to capture cluster centers in the 
input space, while the output nodes evolve local models to 
learn and approximate the data in each of these clusters 


namics of the stock and to possibly update the trading 
fuzzy rules in the fuzzy rule-based module. This NN 
module uses a fuzzy neural (FNN) network for the rule 
extraction. 

Fuzzy neural networks (FNN) integrate NN and 
fuzzy rules into a single neuronal model tightly cou- 
pling learning and fuzzy reasoning rules into a con- 
nectionist structure. One of the first FNN models was 
initiated by Yamakawa and other Japanese scientists 
and promoted at a series of IZUKA conferences in 
Japan [40.12, 13]. Many models of FNNs were devel- 
oped based on these principles [40.2, 14, 15]. 


40.2.2 Evolving Neuro-Fuzzy Systems 
The evolving neuro-fuzzy systems further extended the 


principles of hybrid neuro-fuzzy systems and the FNN, 
where instead of training a fixed connectionist structure, 


Outputs 


the structure and its functionality evolve from incom- 
ing data, often in an on-line, one-pass learning mode. 
This is the case with evolving connectionist systems 
(ECOS) [40.1, 16-19]. 

ECOS are modular connectionist-based systems 
that evolve their structure and functionality in a contin- 
uous, self-organized, on-line, adaptive, and interactive 
way from incoming information [40.17]. They can pro- 
cess both data and knowledge in a supervised and/or 
unsupervised way. ECOS learn local models from data 
through clustering of the data and associating a local 
output function for each cluster represented in a connec- 
tionist structure. They can learn incrementally single 
data items or chunks of data and also incrementally 
change their input features [40.18]. 

Elements of ECOS have been proposed as part of 
the early, classical NN models, such as Kohonen’s self 
organising maps (SOM) [40.20], redical basis func- 
tion(RBF) [40.21], Fuzy ARTMap [40.22] by Carpenter 
et al. and Fritzke’s growing neural gas [40.23], Platt’s 
resource allocation networks (RAN) [40.24]. 

Some principles of ECOS are: 


@ Neurons are created (evolved) and allocated as cen- 
ters of (fuzzy) data clusters. Fuzzy clustering, as 
a means to create local knowledge-based systems, 
was stimulated by the pioneering work of Bezdek, 
Yager and Filev [40.27-30]. 

@ Local models are evolved and updated in these clus- 
ters. 


Here we will briefly illustrate the concepts of 
ECOS on two implementations: evolving fuzzy neu- 


Fuzzy outputs 


W: 


Rule (case) 
layer 


Fuzzy input 
layer 


Fig. 40.3 An EFuNN structure with 
feedback connections (after [40.16]) 
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tral networks (EFUNN) [40.16] and dynamic neuro- 
fuzzy inference systems (DENFIS) [40.25]. Examples 
of EFuNN are shown in Figs. 40.2 and 40.3 and of 


C ESD 


Fig. 40.4a,b Learning in DENFIS uses the evolving clustering 
method illustrated on a simple example of 2 inputs and 1 output 
and 11 data clusters evolved. The recall of the DENFIS for two 
new input vectors x; and xp is illustrated with the use of the 3 clos- 
ets clusters to the new input vector (after [40.25]). (a) Fuzzy role 
group 1 for a DENFIS. (b) Fuzzy role group 2 for a DENFIS 


GFR-ECOS: Evolving Medical Decision Support System 
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Fig. 40.5 An example of the DENFIS model (after [40.26]) for 
medical renal function evaluation 


DENFIS in Figs. 40.4 and 40.5. In ECOS, clusters of 
data are created (evolved) based on similarity between 
data samples (input vectors) either in the input space 
(this is the case in some of the ECOS models, e.g., 
DENFIS), or in both the input and output space (this 
is the case, e.g., in the EFuNN models). Samples that 
have a distance to an existing node (cluster center, rule 
node, neuron) less than a certain threshold are allocated 
to the same cluster. Samples that do not fit into existing 
clusters, form (generate, evolve) new clusters. Cluster 
centers are continuously adjusted according to new data 
samples, others are created incrementally. ECOS learn 
from data and automatically create or update a local 
(fuzzy) model/function in each cluster, e.g., 


IF < data is in a (fuzzy) cluster Ci > 


THEN < the model is Fi>, (40.2) 


where Fi can be a fuzzy value, a linear or logistic re- 
gression function (Fig. 40.5), or an NN model [40.25]. 

ECOS utilize evolving clustering methods. There 
is no fixed number of clusters specified a priori, 
but clusters are created and updated incrementally. 
Other ECOS that use this principle are: evolving self- 
organized maps (ESOM) [40.17], evolving classifica- 
tion function [40.18,26], evolving spiking neural net- 
works (Sect. [40.4]). 

As an example, the following are the major steps for 
the training and recall of a DENFIS model: 


Training: 
1. Create or update a cluster from incoming data. 
2. Create or update a Takagi—Sugeno fuzzy rule for 
each cluster: 
IF x is in cluster Cj THEN yj = fj (x), 
where: yi = 60+ 61 x1+ B2x2+---+ Bq. 


The function coefficients are incrementally updated 
with every new input vector or after a chunk of data. 
Recall — fuzzy inference for a new input vector: 


1. For a new input vector x = [x1, x2, ..., xq] DEN- 
FIS chooses m fuzzy rules from the whole fuzzy 
rule set for forming a current inference system. 

2. The inference result is 

Di=1,m [wi fi(xl, x2,...,xq)] 


= = ‘ (40.3) 
y Xi=1,m Ol 


where i is the index of one of the m closets to 
the new input vector x clusters, wi = 1 — di is the 
weighted distance between this vector the cluster 
center, fi(x) is the calculated output for x according 
to the local model fi for cluster i. 
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40.2.3 From Local to Transductive 
(Individualized) Learning 
and Modeling 


A special direction of ECOS is transductive reason- 
ing and personalized modeling. Instead of building 
a set of local models fi (e.g., prototypes) to cover the 
whole problem space and then using these models to 
classify/predict any new input vector, in transductive 
modeling for every new input vector x a new model 
fx is created based on selected nearest neighbor vec- 
tors from the available data. Such ECOS models are 
neuro-fuzzy inference systems (NFI) [40.31] and the 
transductive weighted neuro-fuzzy inference system 
(TWNFI) [40.32]. In TWNFI for every new input vector 
the neighborhood of the closest data vectors is opti- 
mized using both the distance between the new vector 
and the neighboring ones and the weighted importance 
of the input variables, so that the error of the model is 
minimized in the neighborhood area [40.33]. TWNFI is 
a further development of the weighted-weighted nearest 
neighbor method (WWKNN) proposed in [40.34]. The 
output for a new input vector is calculated based on the 
outputs of the k-nearest neighbors, where the weighting 
is based on both distance and a priori calculated impor- 
tance for each variable using a ranking method such as 
signal-to-noise ratio or the t-test. 

Other ECOS were been developed as improvements 
of EFuNN, DENFIS, or other early ECOS models by 
Ozawa etal. and Watts [40.35-37], including ensem- 
bles of ECOS [40.38]. A similar approach to ECOS 


was used by Angelov in the development of the (ETS) 
models [40.39]. 


40.2.4 Applications 


ECOS have been applied to problems across domain ar- 
eas. It is demonstrated that local incremental learning 
or transductive learning are superior when compared 
to global learning models and when compared in terms 
of accuracy and new knowledge obtained. A review of 
ECOS applications can be found in [40.26]. The appli- 
cations include: 


Medical decision support systems (Fig. 40.5) 

Bioinformatics, e.g., [40.40] 

Neuroinformatics and brain study, e.g., [40.41] 

Evolvable robots, e.g., [40.42] 

Financial and economic decision support systems, 

e.g., [40.43] 

@ Environmental and ecological modeling, e.g., 
[40.44] 

© Signal processing, speech, image, and multimodal 
systems, e.g., [40.45] 

@ Cybersecurity, e.g., [40.46] 

@ Multiple time series prediction, e.g., [40.47]. 


While classical ECOS use a simple McCulloch 
and Pitts model of a neuron and the Hebbian learning 
rule [40.48], evolving spiking neural network (eSNN) 
architectures use a spiking neuron model, applying the 
same or similar ECOS principles. 


40.3 Evolving Spiking Neural Networks (eSNN) 


40.3.1 Spiking Neuron Models 


A single biological neuron and the associated synapses 
is a complex information processing machine that in- 
volves short-term information processing, long-term in- 
formation storage, and evolutionary information stored 
as genes in the nucleus of the neuron. A spiking neuron 
model assumes input information represented as trains 
of spikes over time. When sufficient input spikes are ac- 
cumulated in the membrane of the neuron, the neuron’s 
post-synaptic potential exceeds a threshold and the neu- 
ron emits a spike at its axon (Fig. 40.6a,b). Some of 
the-state-of-the-art models of spiking neurons include: 
early models by Hodgkin and Huxley [40.49], and Hop- 
field [40.50]; and more recent models by Maass, Gerst- 
ner, Kistler, Izhikevich, Thorpe and van Ruller [40.5 1— 
54]. Such models are spike response models (SRMs), 


the leaky integrate-and-fire model (LIFM) (Fig. 40.6), 
Izhikevich models, adaptive LIFM, and probabilistic 
IFM [40.55]. 


40.3.2 Evolving Spiking Neural Networks 
(eSNN) 


Based on the ECOS principles, an evolving spik- 
ing neural network architecture (eSNN) was proposed 
in [40.26], which was initially designed as a visual pat- 
tern recognition system. The first eSNNs were based on 
Thorpe’s neural model [40.54], in which the importance 
of early spikes (after the onset of a certain stimu- 
lus) is boosted, called rank-order coding and learning. 
Synaptic plasticity is employed by a fast supervised 
one-pass learning algorithm. An exemplar eSNN for 
multimodal auditory-visual information processing on 
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Fig. 40.7 An exemplar eSNN for multimodal auditory-visual information processing in the case study problem of speaker au- 
thentication (after [40.56]) 
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the case study problem of speaker authentication is 
shown in Fig. 40.7. 

Different eSNN models use different architec- 
tures. Figure 40.8 shows a reservoir-based eSNN for 
spatio-temporal pattern recognition where the reser- 
voir [40.57] uses the spike-time-dependent plasticity 
(STDP) learning rule [40.58], and the output classifier 
that classifies spatio-temporal activities of the reservoir 
uses rank-order learning rule [40.54]. 


40.3.3 Extracting Fuzzy Rules from eSNN 


Extracting fuzzy rules from an eSNN would make 
eSNN not only efficient learning models, but also 
knowledge-based models. A method was proposed in 
[40.59] and illustrated in Fig. 40.9a,b. Based on the con- 
nection weights w between the receptive field layer L1 
and the class output neuron layer L2 fuzzy rules are ex- 
tracted. 


40.3.4 eSNN Applications 


Different eSNN models and systems have been devel- 
oped for different applications, such as: 


@ eSNN for spatio- and spectro-temporal pattern 
recognition — http://ncs.ethz.ch/projects/evospike 

@ Dynamic eSNN (deSNN) for moving object recog- 
nition — [40.60] 

@ Spike pattern association neuron(SPAN) for gener- 

ation of precise time spike sequences as a response 

to recognized input spiking patterns — [40.61] 

Environmental and ecological modeling — [40.44] 

EEG data modeling — [40.62] 

Neuromorphic SNN hardware — [40.63, 64] 

Neurogenetic models (Sect. 40.4). 
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Fig. 40.9 (a) A simple structure of an eSNN for 2- 
class classification based on one input variable using six 
receptive fields to convert the input values into spike trains. 
(b) The connection weights of the connections to class Ci 
and Cj output neurons, respectively, are interpreted as 
fuzzy rules. IF(input variable v is SMALL) THEN class 
Ci; IF(v is LARGE)THEN class Cj 


A review of eSNN methods, systems and their ap- 
plications can be found in [40.65]. 
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40.4 Computational Neuro-Genetic Modeling (CNGM) 


40.4.1 Principles 


A neuro-genetic model of a neuron was proposed 
in [40.41, 66]. It utilizes information about how some 
proteins and genes affect the spiking activities of a neu- 
ron such as fast excitation, fast inhibition, slow exci- 
tation, and slow inhibition. An important part of the 
model is a dynamic gene/protein regulatory network 
(GRN) model of the dynamic interactions between 
genes/proteins over time that affect the spiking activity 
of the neuron — Fig. 40.10. 

A CNGM is a dynamical model that has two dy- 
namical sub-models: 


@ GRN, which models dynamical interaction between 
genes/proteins over time scale T1 

@ eSNN, which models dynamical interaction be- 
tween spiking neurons at a time scale T2. 


The two sub-models interact over time. 


40.4.2 The NeuroCube Framework 


A further development of the eSNN and the CNGM 
was achieved with the introduction of the NeuroCube 
framework [40.67]. The main idea is to support the cre- 
ation of multi-modular integrated systems, where dif- 
ferent modules, consisting of different neuronal types 
and genetic parameters correspond in a way to dif- 
ferent parts of the brain and different functions (e.g., 
vision, sensory information processing, sound recog- 
nition, motor-control) and the whole system works in 
an integrated mode for brain signal pattern recognition. 
A concrete model built with the use of the NeuroCube 
would have a specific structure and a set of algorithms 
depending on the problem and the application condi- 
tions, e.g., classification of EEG, recognition of func- 


Fig. 40.10 A schematic diagram of a computational neuro-genetic 
modeling (CNGM) framework consisting of a gene/protein regula- 
tory network (GRN) as part of an eSNN (after [40.41]) 


tional magneto-resonance imaging (fMRI) data, brain 
computer interfaces, emotional cognitive robotics, and 
modeling Alzheimer’s disease. 

A block diagram of the NeuroCube framework is 
shown in Fig. 40.11. It consists of the following mod- 
ules: 


e@ An input information encoding module 

@ A NeuroCube module 

© An output module 

e@ A gene regulatory network (GRN) module. 


The main principles of the NeuroCube framework 


1. NeuroCube is a framework to model brain data (and 
not a brain model or a brain map). 

2. NeuroCube is a selective, approximate map of rel- 
evant to the brain data brain regions, along with 
relevant genetic information, into a 3-D spiking 
neuronal structure. 

3. An initial NeuroCube structure can include known 
connections between different areas of the brain. 

4. There are two types of data used for both training 
a particular NeuroCube and to recall it on new data: 
(a) data, measuring the activity of the brain when 
certain stimuli are presented, e.g., (EEG, fMRI); (b) 
direct stimuli data, e.g., sound, spoken language, 
video data, tactile data, odor data, etc. 

5. A NeuroCube architecture, consisting of a Neu- 
roCube module, (GRN)s at the lowest level, and 
a higher-level evaluation (classification) module. 

6. Different types of neurons and learning rules can be 
used in different areas of the architecture. 

7. Memory of the system is represented as a combi- 
nation of: (a) short-term memory, represented as 
changes of the neuronal membranes and temporary 
changes of synaptic efficacy; (b) long-term memory, 
represented as a stable establishment of synaptic ef- 
ficacy; (c) genetic memory, represented as a change 
in the genetic code and the gene/protein expression 
level as a result of the above short-term and long- 
term memory changes and evolutionary processes. 

8. Parameters in the NeuroCube are defined by 
genes/proteins that form dynamic GRN models. 

9. NeuroCube can potentially capture in its internal 
representation both spatial and temporal character- 
istics from multimodal brain data. 

10. The structure and the functionality of a NeuroCube 
architecture evolve in time from incoming data. 
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Fig. 40.11 The NeuroCube framework (after [40.67]) 


40.4.3 Quantum-Inspired Optimization 
of eSNN and CNGM 


A CNGM has a large number of parameters that 
need to be optimized for an efficient performance. 
Quantum-inspired optimization methods are suitable 
for this purpose as they can deal with a large num- 
ber of variables and will converge in much faster 
time that any other optimization algorithms [40.68]. 
Quantum-inspired eSNN (QeSNN) use the principle of 
superposition of states to represent and optimize fea- 
tures (input variables) and parameters of the eSNN 
including genes in a GRN [40.44]. They are optimized 
through a quantum-inspired genetic algorithm [40.44] 


Neurogenetic cube (NeuCube) 
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or a quantum-inspired particle swarm optimization al- 
gorithm [40.69]. Features are represented as qubits in 
a superposition of 1 (selected), with a probability a, and 
0 (not selected) with a probability 8. When the model 
has to be calculated, the quantum bits collapse in 1 or 0. 


40.4.4 Applications of CNGM 


Various applications of CNGM have been developed 
such as: 


@ Modeling brain diseases [40.41, 70] 
@ EEG and fMRI spatio-temporal pattern recogni- 
tion [40.67]. 


40.5 Conclusions and Further Directions 


This chapter presented a brief overview of the main 
principles of a class of neural networks called evolv- 
ing connectionist systems (ECOS) along with their 
applications for computational intelligence. ECOS fa- 
cilitate fast and accurate learning from data and new 
knowledge discovery across application areas. They 


integrate principles from neural networks, fuzzy sys- 
tems, evolutionary computation, and quantum comput- 
ing. The future directions and applications of ECOS 
are foreseen as a further integration of principles 
from information science-, bio-informatics, and neuro- 
informatics [40.71]. 
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41.1 Motivation 


Based on the definition provided by the IEEE Compu- 
tational Intelligence Society, computational intelligence 
(CI) covers biologically and linguistically motivated 
computational paradigms. Its scope broadly overlaps 
with that of soft computing (SC), a similar concept also 
conceived in the 1990s. The original definition of soft 
computing was [41.1]: 


An association of computing methodologies that 
includes as its principal members fuzzy logic 
(FL), neuro-computing (NC), evolutionary comput- 
ing (EC) and probabilistic computing (PC). 


Thus, in its original scope, CI excluded probabilis- 
tic reasoning systems, while including other nature- 
inspired methodologies, such as swarm computing, ant 
colony optimization, etc. More recently, however, CI 
has extended its scope to include statistically inspired 
machine-learning techniques. Throughout this review, 
we will adopt this less restrictive definition of CI 
techniques, increasing its overlapping with SC even 
more [41.2,3]. Readers interested in the historical ori- 
gins of the CI concept should consult [41.4—6]. 

In addressing real-world problems, we usually deal 
with physical systems that are difficult to model and 
possess large solution spaces. In these situations, we 
leverage two types of resources: domain knowledge 
of the process or product and field data that charac- 
terize the system’s behavior. The relevant engineering 
knowledge tends to be a combination of first princi- 
ples and empirical knowledge. Usually, it is captured in 
physics-based models, which tend to be more precise 
than data-driven models, but more difficult to con- 
struct and maintain. The available data are typically 
a collection of input-output measurements, represent- 
ing instances of the system’s behavior. Usually, data 
tend to be incomplete and noisy. Therefore, we often 
augment knowledge-driven models by integrating them 
with approximate solutions derived from CI methodolo- 
gies, which are robust to this type of imperfect data. CI 
is a flexible framework that offers a broad spectrum of 
design choices to perform such integration. 

Domain knowledge can be integrated within CI 
models in a variety of ways. Arguably, the simplest in- 
tegration is the use of physics-based models (derived 
from domain knowledge) to predict expected values of 
variables of interest. By contrasting the expected val- 
ues with the actual measured values, we compute the 
residuals for the same variables and use CI based mod- 


els to explain the differences. Domain knowledge can 
also be used to design CI-based models: it can influence 
the selection of the features (functions of raw data) that 
are the inputs to the CI models; it can suggest certain 
topologies for graphical models (e.g., NN architectures) 
to approximate known functional dependences; it can 
be represented by linguistics fuzzy terms and relation- 
ships to provide coarse approximations; it can be used 
to define data structures of individuals in the popula- 
tion of an evolutionary algorithm (EA); it can be used 
explicitly in metaheuristics (MH’s) that leverage such 
knowledge to focus its search in a more efficient way. 
For a more detailed discussion of the use of domain 
knowledge in EAs, see [41.7]. 

Computational intelligence started in the 1990s with 
three pillars: Neural networks (NNs), to create func- 
tional approximations from input—outputs training sets; 
fuzzy systems, to represent imprecise knowledge and 
perform approximate deductions with it; and Evolution- 
ary systems, to create efficient global search methods 
based on optimization through adaptation. Over the 
last decade, the individual developments of these pil- 
lars have become intertwined, leading to successful 
hybridizations. 


41.1.1 Building Computational Intelligence 
Object- and Meta-Models 


Recently, as described in [41.21], this hybridization has 
been structured as a three-layer approach, in which each 
layer has a specific purpose: 


@ Layer 1: Offline MHs. They are used in batch mode, 
during the model creation phase, to design, tune, 
and optimize run-time model architectures for de- 
ployment. Then they are used to adapt them and 
maintain them over time. Examples of offline MHs 
are global search methods, such as EAs, scatter 
search, tabu search, swarm optimization, etc. 

@ Layer 2: Online MHs. They are part of the run- 
time model architecture, and they are designed 
by offline MHs. The online MHs are used to in- 
tegrate/interpolate among multiple (local) object- 
models, manage their complexity, and improve their 
overall performance. Examples of online MHs are 
fuzzy supervisory systems, fusion modules, etc. 

@ Layer 3: Object-level Models. They are also part 
of the run-time architecture, and they are de- 
signed by offline MHs to solve object-level prob- 
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41.1 Motivation 


lems. For simpler cases, we use single object- 
level models that provide an individual SC func- 
tionality (functional approximation, optimization, 
or reasoning with imperfect data). For com- 
plex cases, we use multiple object-level models 
in parallel configuration (ensemble) or sequential 
configuration (cascade, loop), to integrate func- 
tional approximation with optimization and rea- 
soning with imperfect data (imprecise and uncer- 
tain). 


The underlying idea is to reduce or eliminate man- 
ual intervention in any of these layers, while leveraging 
CI capabilities at every level. We can manage complex- 
ity by finding the best model architecture to support 
problem decomposition, create high-performance lo- 
cal models with limited competence regions, allow 
for smooth interpolations among them, and promote 
robustness to imperfect data by aggregating diverse 


models. Let us examine some case studies that further 
illustrate this concept. 


Examples of Offline MH, Online MH, 

and Object Models 
In Table 41.1, we observe a variety of CI applications in 
which we followed the separation between object- and 
meta-level described earlier. In most of these applica- 
tions, the object-level models were based on different 
technologies such as machine learning (support vector 
machines, random forest), statistics (multivariate adap- 
tive regression splines, MARS), Hotelling’s T?), neural 
networks (feedforward, self-organizing maps), fuzzy 
systems, EAs, Case based. The online metaheuristics 
were mostly based on fuzzy aggregation (interpolation) 
of complementary local models or fusion of compet- 
ing models. The offline MHs were mostly implemented 
by evolutionary search in the model design space. De- 
scriptions of these applications can be found in the 


Table 41.1 Examples of CI applications at meta-level and object-level 


Case Problem instance Problem type Model design Model controller Object-level models References 
study (offline MHs) (online MHs) 
Anomaly Classification Model Fuzzy Multiple Models: SVM, [41.8] 
detection (system) T-norm tuning aggregation NN, Case-Based, MARS 
Anomaly Classification Manual design Fusion Multiple Models: [41.9] 
detection (system) Kolmogorov complexity, 
SOM. random forest, 
Hotteling T2, AANN 
#1 Anomaly Classification EA-base tuning Fuzzy Multiple Models: [41.10] 
detection (model) and regression of fuzzy supervisory supervisory Ensemble of AANN’s 
termset 
#2 Best units Ranking EA-base tuning of None Single Model: Fuzzy (41.11, 12] 
selection similarity function instance based models 
(Lazy Learning) 
#3 Insurance Classification EA Fusion Multiple Models: [41.13, 14] 
underwriting: NN, Fuzzy, MARS 
Risk management 
#4 Mortgage Regression Manual design Fusion Multiple Models: ANFIS, [41.15] 
collateral Fuzzy CBR, RBF 
valuation 
#5 Portfolio Multiobjective Seq. LP None Single Model: [41.16] 
rebalancing optimization MOEA (SPEA) 
Load, HR, Regression Multiple CART Fusion Multiple Models: [41.17] 
NO, forecast trees Ensemble of NN’s 
Aircraft engine Control/Fault EA tuning of linear Crisp supervisory Multiple Models (Loop): [41.18] 
fault recovery accommodation control gains SVM + linear control 
Power plant Optimization Manual design Fusion Multiple Models (Loop): [41.19] 
optimization MOEA + NN’s 
Flexible Optimization Manual design Fuzzy Single Model: [41.20] 
manufacturing supervisory Genetic Algorithms 


optimization 
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references listed in the last column of Table 41.1. The 
five case studies covered in this review are indicated in 
the first column of Table 41.1. 


41.1.2 Model Lifecycle 


In real-world applications, before using a model in 
a production environment we must address the model’s 
complete life cycle, from its design and implementa- 
tion to its validation, tuning, production testing, use, 
monitoring, and maintenance. By maintenance, we re- 
fer to all the steps required to keep the model vital (e.g., 
nonobsolete) and to adapt it to changes in the envi- 
ronment in which it is deployed. Many reasons justify 


this focus on model maintenance. Over the model’s life 
cycle, maintenance costs are the by far most expen- 
sive ones (as software maintenance costs are the most 
expensive ones in the life of a software system). Fur- 
thermore, when dealing with mission-critical software 
we need to guarantee continuous operation or at least 
fast recovery from system failures or model obsoles- 
cence to avoid lost revenues and other business costs. 
The use of MHs in the design stage allows us to cre- 
ate a process for automating the model building phase 
and subsequent model updates. This is a critical step to 
quickly deploy and maintain CI models in the field, and 
it will be further described in the case studies. Addi- 
tional information on this topic can be found in [41.22]. 


41.2 Machine Learning (ML) Functions 


Machine learning techniques can be roughly subdi- 
vided into supervised, semisupervised reinforcement, 
and unsupervised learning. The distinction among these 
categories depends on the complete, partial, or lack of 
available ground truth (i. e., correct outputs for each in- 
put vector) during the training phase. 

Unsupervised learning techniques are used when 
no ground truth is available. Their goal is to iden- 
tify structures in the input space that could be used 
to decompose the problem and facilitate local model 
building. Typical examples of unsupervised learning are 
cluster analysis, self-organizing maps (SOMs) [41.23], 
and dimension reduction techniques, such as principal 
components analysis (PCA), independent components 
analysis (ICA), multidimensional scaling (MDS), etc. 

Reinforcement learning (RL) does not rely on 
ground truth. It assumes that an agent operates in an 
environment and after performing one or more actions 
it receives a reward that is a consequence of its actions, 
rather than an explicit expression of ground truth. Sut- 
ton and Barto [41.24] were among the first proponents 
of this technique, which is quite promising to model 
adversarial situations, but it has not generated many 
industrial or commercial applications. A succinct de- 
scription of RL can be found in [41.25]. 

Semisupervised and supervised learning techniques 
are used when partial or complete ground truth is avail- 
able, such as labels for classification problems and 
real-values for regression problems. There are many 
traditional linear models for classification and regres- 
sion. For instance, we have linear discriminant analysis 
(LDA) and logistic regression (LR) for classification, 


and least-squares techniques — combined with feature 
subset selection or shrinkage methods (e.g., Ridge, least 
absolute shrinkage and selection operator (LASSO)) — 
for regressions. CI techniques usually generate nonlin- 
ear solutions to these problems. We can group the most 
of the commonly used nonlinear techniques, as 


© Directed graphical-based models, such as neural 
networks, neural fuzzy systems, Bayesian belief 
networks, Bayesian neural networks, etc. 

© Tree based: Classification analysis and regression 
trees (CARTs) [41.26], ID3/C4.5 [41.27], etc. 

© Grammar based: Genetic programming, evolution- 
ary programming, etc. 

© Similarity and metric learning: Lazy learning 
(or instance-based learning) [41.28, 29], case-based 
reasoning, (fuzzy) k-means, etc. 

© Undirected graphical models: Markov graphs, re- 
stricted Boltzmann machines [41.30], etc. 


The reader can find a comprehensive treatment of 
these techniques in [41.31]. Some of these models are 
used as part of on ensemble, rather than individually. 
Such is the case of random forest [41.32], which is 
based on a collection of CART trees [41.26]. Similarly, 
a fuzzy extension of random forest using fuzzy deci- 
sion trees [41.33] can be found in [41.34]. This trend 
toward the use of ensembles is covered in Sect. 41.5. 
We will now focus on a subset of the CI/ML applica- 
tions. Specifically, we will analyze two case studies in 
industrial applications (Sect. 41.3) and three in financial 
applications (Sect. 41.4). 
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41.3 CI/ML Applications in Industrial Domains: 
Prognostics and Health Management (PHM) 


To provide a coherent theme for the ML indus- 
trial application, we will focus on prognostics and 
health management (PHM). The main goal of PHM 
for assets such as locomotives, medical scanners, and 
aircraft engines is to maintain these assets’ opera- 
tional performance over time, improving their utiliza- 
tion while minimizing their maintenance cost. This 
tradeoff is critical for the proper execution of con- 
tractual service agreements (CSAs) offered by origi- 
nal equipment manufacturer’s (OEM) to their valued 
customers. 

PHM is a multidiscipline field, as it includes aspects 
of electrical engineering (reliability, design, service), 
computer and decision sciences (artificial intelligence, 
CI, MI, statistics, operations research (OR)), mechan- 
ical engineering (geometric models for fault propaga- 
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Fig. 41.1 PHM functional architecture 


tion), material sciences, etc. Within this paper, we will 
focus on the role that CI plays in PHM functionalities. 
PHM can be divided into two main components: 


© Health assessment: the evaluation and interpretation 
of the asset’s current and future health state. 

© Health management: the control, operation, and lo- 
gistic plans to be implemented in response to such 
assessment. 


PHM functional architecture is illustrated in 
Fig. 41.1, adapted from [41.35]. 
The first two tasks: 


(1) Remote monitoring, and 
(2) Input data preprocessing, are platform dependent, 
as they need domain knowledge to identify and 
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select the most informative inputs, perform data cu- 
ration (de-noising, imputation, and normalization), 
aggregate them, and prepare them to be suitable in- 
puts for the models. 
The remaining decisional tasks could be considered 
platform independent (to the extent that their func- 
tions could be accomplished by data-driven models 
alone.) These tasks are: 

(3) Anomaly detection and identification 

(4) Anomaly resolution 

(5) Diagnostics 

(6) Prognostics 

(7) Fault accommodation 

(8) Logistics decisions. 


Health assessment, which is based on descriptive 
and predictive analytics, is contained in the left block 
of Fig. 41.1 (annotated with P). Health management 
(HM), which is based on prescriptive analytics, is con- 
tained in the right block of Fig. 41.1 (annotated with 
HM). In the remaining of this section, we will cover 
two case studies related to anomaly identification and 
prognostics. 


41.3.1 Health Assessment 
and Anomaly Detection: 
An Unsupervised Learning Problem 


Anomaly Detection (AD) 

Using platform-deployed sensors, we collect data re- 
motely. We preprocess it, via segmentation, filtering, 
imputation, validation, and we summarize it by ex- 
tracting feature subsets that provide a more succinct, 
robust representation of its information content. These 
features, which could contain a combination of cat- 
egorical and numerical values, are analyzed by an 
anomaly detection model to assess the degree of abnor- 
mal behavior of each asset in the fleet. If the degree 
of abnormality exceeds a given threshold, the model 
will identify the asset, determine the time when the 
anomaly was first noticed and suggest possible causes 
of the anomaly (usually a coarse identification at the 
systems/subsystem level). Anomaly detection usually 
leverages unsupervised learning techniques, such as 
clustering. Its goal is to extract the underlying structural 
information from the data, define normal structures and 
regions, and identify departures from such regions. 


Anomaly Identification (Al) 
After detecting an abnormal change, e.g., a departure 
from a normal region of the state space, we need to 


identify its cause. There are many factors that could 
cause such change: 


(a) A system fault, which could eventually lead to a fail- 
ure. 

(b) A sensor fault, which is creating incorrect measure- 
ments. 

(c) An inadequate anomaly detection model, which is 
falsely reporting anomalies due to poor design, in- 
adequate model update, execution outside its region 
of competence, etc. 

(d) A sudden, unexpected operational transient, which 
is stressing the system by creating an abrupt load 
change. This transient could be originated by an op- 
erator error, who is requesting such sudden change; 
by an incorrect reference vector (in case of oper- 
ation automation), which is also requesting such 
abrupt change; or by a poorly designed controller, 
which is either over- or under-compensating for 
a perceived state change. 


The first factor (system fault) represents a correct 
anomaly classification and should trigger the rest of 
the workflow (diagnostics, prognostics, fault accom- 
modation, and maintenance optimization), while the 
other three factors generate false alarms (false posi- 
tives.) In the next case study, we will focus on how to 
improve the accuracy of an anomaly detection model, 
(third factor in the list) and decrease the probability of 
causing false positives. This increase in model fidelity 
will also create a sharper distinction between system 
faults and sensor faults (first and second factors in the 
list). 


Anomaly Detection for Aircraft Engines 
Problem Definition. As noted in Sect. 41.1, one 
of the best way to leverage domain knowledge is 
to create expected values using highly tuned physics 
based simulators, compare them with actual values and 
analyze the differences (residuals) using data-driven 
models. 


Physics-Based Simulator. In this case study, we fo- 
cused on the detection of anomalies in a simulated 
aircraft engine. A component level model (CLM), 
a thermodynamic model that has been widely used to 
simulate the performance of an aircraft engine, pro- 
vided the physics-based model. Flight conditions, such 
as altitude, Mach number, ambient temperature, and en- 
gine fan speed, and a large variety of model parameters, 
such as module efficiency and flow capacity are inputs 
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to the CLM. The outputs of the CLM are the values for 
pressures, core speed and temperatures at various loca- 
tions of engine, which simulate sensor measurements. 
Realistic values of sensor noise can be added after the 
CLM calculation. In this study, we used a steady state 
CLM model for a commercial, high-bypass, twin spool, 
turbofan engine. 


Actual Values. We used engine data collected under 
cruise conditions to monitor engine health changes. 


Data-Driven Model. We realized that a single, global 
model — regardless of the technology used to implement 
it — would be inadequate for large operating spaces of 
the simulated engine. Global models are designed to 
achieve a compromise among completeness (for cover- 
age), high fidelity (for accuracy), and transparency (for 
maintainability). As a result, we usually end up with 
models that in order to maintain small biases exhibit 
large variability. This variability might be too large to 
distinguish between model error and anomalous system 
behavior and can be a significant factor in the genera- 
tion of false alarms. 


Cl-Based Approach. To solve the model fidelity prob- 
lem, we decomposed the engine’s operating space into 
several, partially overlapping regions and developed 
a set of local models, trained on each region. This 
schema required a supervisory model (or meta model) 
to determine the competence region of each local model 
and select the appropriate one. In control problems, 
this supervisory module typically selects one controller 
(out of a collection of low-level controllers) to close 
the loop with the dynamic system. In many fuzzy 
controllers application [41.36, 37], a fuzzy supervisory 
module determines the applicability degree of the low- 
level controller and interpolates their outputs. Usually, 
this is done with a weighted, convex sum of the con- 
trollers’ outputs. The weights used in the convex sum 
are the applicability degrees of the low-level controllers 
in the part of the state space that contains the input. The 
transition from mode selection to mode melting [41.38] 
generates a smoother response surface by avoiding dis- 
continuities. 

We applied the same concept to the problem of im- 
proving the fidelity of data-driven models for anomaly 
detection. First, we decided to use auto-associative NNs 
(AANN) to implement the local models. Then, we de- 
veloped a fuzzy supervisory controller, defining the 
applicability of each AANN as a fuzzy region in the 
engine’s three-dimensional operating space, defined by 


altitude, Mach number, and Ambient temperature. Fi- 
nally, we used and evolutionary algorithm to tune the 
term set of the fuzzy supervisory and find the best fuzzy 
boundaries to interpolate between AANNs with over- 
lapping applicability. 


Local Models. Auto-Associative Neural Networks 
(AANN’s) are feedforward neural networks with struc- 
ture satisfying requirements for performing restricted 
auto-association. The inputs to the AANN go through 
a dimensionality reduction, as their information is com- 
bined and compressed in intermediate layers. For ex- 
ample, in Fig. 41.2 the nine nodes in the input layer are 
reduced to five and then three, in the second layer (en- 
coding) and third layer (bottleneck), respectively. Then, 
the nodes in the 3rd layer are used to recreate the origi- 
nal inputs, by going through a dimensionality expansion 
(fourth layer, decoding, and fifth layer, outputs). In the 
ideal case, the AANN outputs should be identical to the 
inputs. Their difference (residuals) and their gradient 
information are used to train the AANN to minimize 
such difference. 

This network computes the largest nonlinear princi- 
pal components (NLPCA’s) — the nodes in the interme- 
diate layer — to identify and remove correlations among 
variables. Besides the generation of residuals this type 
of network can also be used in dimensionality reduc- 
tion, visualization, and exploratory data analysis. As 
noted in [41.39]: 


While (principal component analysis) PCA iden- 
tifies only linear correlations between variables, 


Le 
Sor 


Oa 
Ss 
Naga 


Input Encoding Bottleneck Decoding Output 
layer layer layer layer layer 


Fig. 41.2 Architecture of a 9-5-3-5-9 auto associative neu- 
ral network 


789 


E'LH | d Hed 


790 Part D | Neural Networks 
NLPCA uncover both linear and nonlinear corre- rameters. For each of ambient temperature and Mach 
lations, without restriction on the character of the number, we varied four parameters, with a total of ten 
nonlinearities present in the data. search parameters. Each individual was a set of ten 
parameters that created a corresponding set of mem- 
NLPCA operates by training a feedforward neural net- bership functions that controlled residuals behavior 
work to perform the identity mapping, where the net- of the fuzzy supervisory model. The fitness of each 
work inputs are reproduced at the output layer. The individual was computed based on the aggregate of 
network contains an internal bottleneck layer (contain- the nine sensor residuals, with a goal toward max- 
ing fewer nodes than input or output layers), which imizing fitness or minimizing overall residuals. The 
forces the network to develop a compact representa- EA used was based on the genetic algorithm opti- 
tion of the input data, and two additional hidden layers. mization toolbox (GAOT) toolkit. The population size 
Additional information about AANNs can be found was set at 500, and the generation count was set at 
in [41.39-41]. 1000. The EA execution was very efficient taking only 
The complete CI approach is illustrated in Fig. 41.3, about 2h of execution time on a standard desktop 
adapted from [41.10]. The left part of the figure shows machine. 
the run-time anomaly detection (AD) model. The cen- 
ter part of Fig. 41.3 shows an instance of the term Results. As a result of this experiment, we were 
set used by the fuzzy supervisory system (the scale able to drastically reduce the residuals generated un- 
of the operational state variables was normalized as der steady state, no-fault assumption, and we improved 
a percentage of the range of values to preserve propri- the fidelity of the local model ensemble by more than 
etary information). In the right part of Fig. 41.3, we a factor of four with respect to a reference global 
can see the evolutionary algorithm (EA) in a wrapper data-driven model. This fidelity allowed us to cre- 
configuration, used to tune the membership functions ate a sharper baseline used to identify true engine 
(term sets). Each individual in the EA population is anomalies, distinguishing them from sensor anoma- 
a set of parameters that represents an implementable lies. For a more detailed description of these results 
term set configuration. For Altitude, we varied two pa- see [41.10]. 
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Fig. 41.3 Evolutionary algorithms tune the term sets of the fuzzy supervisory system to interpolate the outputs of an ensemble 
of local auto associative NNs 


Machine Learning Applications | 41.3 CI/ML Applications in Industrial Domains: Prognostics and Health Management (PHM) 791 


41.3.2 Health Assessment — Diagnostics: 
A Semisupervised 
and Supervised Learning Problem 


The information generated by the anomaly identifica- 
tion model allows a diagnostic module to focus on 
a given unit’s subsystem, analyzing key variables as- 
sociated with the subsystem, and trying to match their 
patterns with a library of signatures associated with 
faults or incipient failure modes. The result is a ranked 
list of possible faults. Therefore diagnostics is a classi- 
fication problem, mapping a feature space into a labeled 
fault space. 

Usually, data-driven diagnostics leverages super- 
vised learning techniques to extract potential signatures 
from the historical data and use them to recognize dif- 
ferent failure modes automatically. A large variety of 
statistical and AI-based techniques can be used for au- 
tomatic fault diagnostics, including neural networks, 
decision tree, random forest, Bayesian belief network, 
case-based reasoning, hidden Markov model, support 
vector machine, fuzzy logic etc. Those data-driven di- 
agnostics methods are able to learn the faulty signatures 
or patterns from the training data and associate them 
with different failure modes when new data arrives. 

Data-driven approaches have many benefits. First, 
they can be designed to be independent of domain 
knowledge related to a particular system. We could use 
this approach with data recorded for almost any compo- 
nent/system, as long as the recorded data is relevant to 
the health condition of the interested component. This 
reduces the effort involved with eliciting and incorpo- 
rating domain specific knowledge. A second benefit is 
the use of fusion techniques to take advantage of diverse 
information from multiple data sources/models to boost 
diagnostics performance. The third benefit is the robust- 
ness to noise exhibited by most of data-driven methods, 
such as fuzzy logic and neural networks. However, all 
data-driven techniques require the availability of la- 
beled historical data so these data collection step must 
precede the application of these methods. 

Domain knowledge, when available, can still be 
leveraged to initialize the structures of the data-driven 
models (feature selection, network topology, etc.) and 
provide better initial conditions for optimization and 
tuning techniques applied to the data-driven diagnostics 
models. 

Supervised learning is a very mature topic in ML. 
As a result, there are many diagnostics applications of 
CI techniques to medical [41.42—45] industrial [41.46, 
47] automotive [41.48], and other domains. Given its 


widespread use, we will not provide additional case 
studies for diagnostics. 


4.3.3 Health Assessment — Prognostics: 
A Regression Problem 


Prognostics is the prediction of remaining useful life 
(RUL), when the anomaly detection and diagnostics 
modules can identify and isolate an incipient fail- 
ure through its preceding faults. This incipient failure 
changes the graph of RUL versus time from a linear, 
normal-wear trajectory to an exponentially decaying 
one. The fault time and incipient failure mode deter- 
mine the inflection point in such curve and the dete- 
rioration steepness, respectively. These estimates are 
usually in units of time or utilization cycles, and have 
an associated uncertainty, e.g., a probability density 
curve around the actual estimate. Typically, this uncer- 
tainty (e.g., RUL confidence interval) increases as the 
prediction horizon is extended. Operators can choose 
a confidence level that allows them to incorporate a risk 
level into their decision making. They can change oper- 
ational characteristics, such as load, which may prolong 
the life of components at risk. They can also account 
for upcoming maintenance and set in motion a logis- 
tics process to support a smooth transition from faulted 
equipment to fully functioning. 

Predicting RUL is not trivial, because RUL depends 
on current deterioration state and future usage, such 
as unit load and speed, among others. Prognostics is 
closely linked with diagnostics. In the absence of any 
evidence of damage or faulted condition, prognostics 
reverts to Statistical estimation of fleet-wide life. It is 
common to employ prognostics in the presence of an 
indication of abnormal wear, faults, or other abnormal 
situation. Therefore, it is critical to provide accurate and 
quick diagnostics to allow prognostics to operate. At the 
heart of prognostics is the ability to properly model the 
accumulation and propagation of damage. A common 
approach to prognostics is to employ a model of dam- 
age propagation contingent on future use. Such models 
are often times based on detailed materials knowledge 
and makes use of finite element modeling. This requires 
an in depth understanding of the local conditions the 
particular component is exposed to. 

For example, for spall propagation in bearings, we 
need to know the local load, speed, and temperature 
conditions at the site of the damage, e.g., at the outer 
race (or ball or cage). In addition, we need to know 
the geometry and local material properties at the sus- 
pected damage site. This information is used to derive 
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the stresses that components are expected to experience, 
typically using a finite element approach. The potential 
benefit of this process is the promise of accurate predic- 
tion of when the bearing will fail. For a different fault 
mode, the process needs to be repeated. Because of the 
cost and effort involved, this method is reserved for a set 
of components that, if left undetected and without re- 
maining life information, might experience catastrophic 
failure that transcends the entire system and causes 
system failure. However, there is a large set of com- 
ponents that will not benefit from this approach, either 
because a physics-based damage model is not achiev- 
able or is too costly to develop. Therefore, it is desirable 
to increase coverage of prognostics for a range of fault 
modes. To this end, the techniques would ideally utilize 
existing models and sensor data. 

ML provides us with an alternative approach, which 
is based on analyzing time series data where the equip- 
ment behavior has been monitored via sensor mea- 
surements during the normal operation until equipment 
failure. When a reasonable set of these observations 
exists, ML algorithms can be employed to recognize 
these trends and predict remaining life (albeit, often 
times under the assumption of near-constant future 
load conditions.) Usually, specific faults have preferred 
directions in the health related feature space. By extrap- 
olating the propagation in this parameter space and by 
mapping the extrapolation into the time domain, we can 
derive RUL information. 

A prerequisite to leverage RUL estimation is to 
have a narrow confidence interval, so that this informa- 
tion is actionable and can be used in the asset health 
management part of PHM as a time horizon to opti- 
mize the logistics/maintenance scheduling plan. In most 
cases, however, we do not have run-to-failure data in 
the time series. Usually, when a failure is identified it is 
corrected promptly, causing the time series to be statis- 
tically censored on the right. The lack of run-to-failure 
data further compounds the technical difficulty of pre- 
dicting RUL with a small variance. 

We consider two options to address this problem. 
The first option is to use of an ensemble of diverse 
predictive models (Sect. 41.5.3 for a definition of diver- 
sity) such that the fusion of the ensemble will reduce 
the variance and make the output more actionable — 
Sect. 41.5 is devoted to this topic. The second option 
is to relax the problem formulation, by increasing the 
granularity of the models output. This granularity is de- 
termined by the actions that we will perform with such 
information. For example, we could formulate prognos- 
tics as: 


(1) A partial ordering over RUL. This formulation 
could be used to estimate the risk of claims in 
term life policies. Insurance underwriters estimate 
the applicants’ expected mortality at a coarse level 
by classifying each applicant into a given rate-class 
from a set of sorted classes that define decreasing 
RUL. Applicants inside each class are indistinguish- 
able in terms of risk and are charged the same 
premium (gender and age being equal). This will be 
further described in the case study of Sect. 41.4.1. 
In a PHM context, this formulation could be used 
to price the contractual service agreement renewals 
for different units within a fleet, in a fashion similar 
to the risk-based pricing of insurance underwriting. 

(2) An ordinal ordering over RUL (ranking). This for- 
mulation could be used to select the most reliable 
units of a fleet for mission-critical assignments. 
This will be further described in the case study of 
Sect. 41.3.3, where we illustrate how a train dis- 
patcher could select the best locomotives to create 
a hot train, e.g., a freight train with a guaranteed 
arrival time. 

(3) A cardinal ordering over RUL (rating). This formu- 
lation could be used to understand the relative level 
of readiness of units in a fleet, to prioritize the need 
for instruments calibration, power management as- 
sessment/verification, etc. 

(4) A binary classification of whether a given event 
(causing the end of RUL) will happen within a given 
time window. This formulation could be used to 
generate a time-dependent risk assessment to opti- 
mize fleet scheduling and unit allocation. 

(5) A regression on RUL, including the confidence in- 
terval of the prediction. This formulation provides 
the finest granularity. If we were able to reduce 
the confidence intervals of these predictions, we 
could refine and optimize the condition-based main- 
tenance of the assets in the fleet. 


The following case study will illustrate the second 
problem reformulation (ranking). In this case, genetic 
algorithms are used to evolve fuzzy instance-based 
models that will generate a ranking of the most reliable 
locomotives within a fleet. 


Case Study 2: RUL-Driven Ranking 

of Locomotives in a Fleet 
Problem Definition. The problem of selecting the 
best units from a fleet of equipment occurs in many 
military and commercial applications. Given a specific 
mission profile, a commander may have to decide which 
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five armored vehicles to deploy in order to minimize the 
chance of a breakdown. In the commercial world, rail 
operators need to make decisions on which locomotives 
to use in a train traveling from coast to coast with time 
sensitive shipments. 

The behavior of these complex electromechanical 
assets varies considerably across different phases of 
their life cycle. Assets that are identical at the time 
of manufacture will evolve into somewhat individual 
systems with unique characteristics based on their us- 
age and maintenance history. Utilizing these assets 
efficiently requires a) being able to create a model char- 
acterizing their expected performance, and b) keeping 
this model updated as the behavior of the underlying 
asset changes. 

In this problem formulation, RUL prediction for 
each individual unit is computed by aggregating its own 
track record with that of a number of peer units — units 
with similarities along three key dimensions: system de- 
sign, patterns of utilization, and maintenance history. 
The notion of a peer is close to that of a neighbor 
in CBR, except the states of the peers are constantly 
changing. Odometer-type variables like mileage and 
age increase, and discrete events like major mainte- 
nance or upgrades occur. Thus, it is reasonable to 
assume that after every significant mission, the peers 
of a target unit may change based upon changes in both 
the unit itself, and the fleet at large. Our results suggest 
that estimating unit performance from peers is a prac- 
tical, robust, and promising approach. We conducted 
two experiments — one for retrospective estimation and 
one for prospective estimation. In the first experiment, 
we explored how well the median RUL of any unit 
could be estimated from the medians of its peers. In 
the second experiment, for a given instant in time, we 
predicted the time to the next failure for each unit us- 
ing the history of the peers. In these experiments, the 
retrospective (or prospective) RUL estimates were used 
to induce a ranking over the units. The selection of 
the best N units was based on this ranking. The preci- 
sion of the selection was the percentage of the correctly 
selected units among the N units (based on ground 
truth). 


Cl-Based Approach. Our approach was based on 
fuzzy instance-based model (FIM), which can be found 
in [41.11]. We addressed the definition of similarity 
among peers by evolving the design of a similarity func- 
tion in conjunction with the design of the attribute space 
in which the similarity was evaluated. Specifically, we 
used the following four steps: 


(1) Retrieval of similar instances from the database 
(DB). 

(2) Evaluation of similarity measures between the 
probe and the retrieved instances. 

(3) Creation of local models based on the most similar 
instances. 

(4) Aggregation of local models outputs (weighted by 
their similarity measures). 


(1) Retrieval. We looked for all units in the fleet DB 
whose behavior was similar to the probe. These in- 
stances are the potential peers of the probe. The peers 
and probe can be seen as points in an n-dimensional fea- 
ture space. For instance, let us assume that a probe Q is 
characterized by an n-dimensional vector of feature Xo, 
and O(Q) = [D1, 9. D2,9, - . - , Deco). o] the history of its 
operational availability durations 


Q = [Xo:0(9)] 
= [t1,0.---.%n,9;D1,0.---,Do), 0] - (41.1) 


Any other unit u; in the fleet has a similar characteriza- 
tion 


u = [X; O(u)] 
= [41,j,%2,j,---5%n,j;D1,;,D2,j,...,Diqy,j] - 
(41.2) 


For each dimension i we defined a truncated general- 
ized Bell function, TGBF;(x;; ai, bi, ci), centered at the 
value of the probe c;, which represents the degree of 
similarity along that dimension. Specifically 


TGBF,(x;; ai bi, ci) 


2b; 
Xi Ci 


1+ 


qj 


Xi— Ci 


if | 1+ 


di 
0 otherwise 
(41.3) 


where e is the truncation parameter, e.g., e = 107. 
Since the parameters c; in each TGBF; are de- 
termined by the values of the probe, each TGBF; 
has only two free parameters a; and b; to control its 
spread and curvature. In a coarse retrieval step, we ex- 
tracted an instance in the DB if all of its features are 
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within the support of the TGBF’s. Then we formal- 
ized the retrieval step. P(Q), the set of potential peers 
of Q, is composed of all units within a range from 
the value of Q : P(Q) = {y,j=1,...,m|y € N(Xo)} 
where N (Xọ) is the neighborhood of Q in the state space 
X, defined by the constraint ||x;, 9 —;, ;|| < R; for all po- 
tential attributes i for which the corresponding weight is 
nonzero. R; is half of the support of the TGBF;, centered 
on the probe’s coordinate x;, o. 


(2) Similarity Evaluation. Each TGBF; is a mem- 
bership function representing the partial degree of sat- 
isfaction of constraint A;(x;). Thus, it represents the 
closeness of the instance around the probe value for 
that particular attribute. For a given peer P;, we eval- 
uated the function $;,; = TGBF;(x;, j; ai, bi, Xi,q) along 
each potential attribute i. The values (a;, b;) are design 
choices manually initialized, and later refined by the 
EAs. Since we wanted the most similar instances to be 
the closest to the probe along all n attributes, we used 
a similarity measure defined as the intersection of the 
constraint-satisfaction values. Furthermore, to represent 
the different relevance that each criterion should have 
in the evaluation of similarity, we attached a weight w; 
to each attribute A;. Therefore, we extended the notion 
of a similarity measure between P; and the probe Q as 
a weighted minimum operator 


5; = minj_,{max|(1 — wy), §;, i]} 
= minj; {max| (1 — w;), 
TGBF, (xj, j an bi, Xi,0) |} 5 (41.4) 
where w; € [0, 1]. The set of values for the weights {w;} 
and parameters {(a;, b;)} are critical design choices that 
impact the proper selection of peers. 


(3) Local Models. The idea of creating a local model 
on demand can be traced back to memory-based ap- 
proaches [41.28, 29] and lazy learning [41.49]. Within 
this case study, we focused on the creation of local pre- 
dictive models used to forecast each unit’s remaining 
life. First, we used each local model to generate an es- 
timated value of the predicted variable. Then, we used 
an aggregation mechanism based on the similarities of 
the peers to determine the final output. 

The generation of local models can vary in com- 
plexity, depending on the task difficulty. In the first 
experiment, we used the Median operator as the local 
model, hence we did not need to define any parameter 

yj = Median [Di,j, Do, j, pa 


In the second experiment we used an exponential aver- 
age, requiring the definition of a forgetting factor w 


Yi = Drt 1.j = Dko), j = X Diqp,j 
+ (1—a@) x D= [where D;,; = Dj, j] . 
(41.6) 


(4) Aggregation. We needed to combine the indi- 
vidual outputs y; of the peers P;(Q) to generate the 
estimated output yg for the probe Q. median (for ex- 
periment I) or the prediction of the next availability 
duration, Dyext, o (for experiment II) for the probe Q. 
To this end, we computed the weighted average of the 
peers’ individual outputs using their normalized simi- 
larity to the probe as a weight, namely 


yo = Mediang = 


where y = Median [D,,;,D2,;,.... Di), ] 
for Exp. I 


woes Six Yj 


yor Dnext, o> 57 5, 
j=1%i 
for Exp. II. 


where y; = Dy) +1, (41.7) 
The entire process is summarized in Fig. 41.4, adapted 
from [41.11]. 


Structural and Parametric Tuning 
Given the critical design roles of the weights {w;}, the 
parameters {(a;,b;)}, and the forgetting factor œ, it was 
necessary to create a methodology to generate their best 
values according to our metric, i. e., classification preci- 
sion. After testing several manually created peer-based 
models, we decided to use evolutionary search to de- 
velop and maintain the fuzzy instance-based classifier, 
following a wrapper methodology detailed in [41.13]. 
In this application, however, we extended evolutionary 
to include structural search, via attribute selection and 
weighting [41.50], besides the parametric tuning. 

The EAs were composed of a population of indi- 
viduals, each of which containing a vector of elements 
that represented distinct tunable parameters within the 
FIM configuration. Examples of tunable parameters in- 
cluded the range of each parameter used to retrieve 
neighbor instances and the relative parameter weights 
used for similarity calculation. The EAs used two types 
of mutation operators (Gaussian and uniform), and no 
crossover. Its population (with 100 individuals) was 
evolved over 200 generations. 
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Fig. 41.4 Description of Fuzzy instance-based models (FIM) aggregated by convex sum 


Each chromosome defined an instance of the at- 
tribute space used by the associated classifier by spec- 
ifying a vector of weights [w1,w2,...,wņn]. If wi € 
{0, 1}, we perform attribute selection, i.e., we select 
a crisp subset from the universe of potential attributes. 
If w; € {0, 1}, we perform attribute weighting, i. e., we 
define a fuzzy subset from the universe of potential at- 
tributes 


[wi, w2, see »Wrl[(ai, b1), (a2, b2), e.’ (an, b,)\[o] > 
(41.8) 


where 

© w; € (0, 1] for attribute weighting and 

© w; € {0, 1} for attribute selection 

@ n=Cardinarlity of universe of U,|U| = n 

@ d=}; w; (fuzzy) cardinality of selected features 
© (a;i, bi) = Parameters for GBF; 

© qa = Parameter for exponential average. 


The first part of the chromosome, containing the 
weights vector [w1, w2,..., Wn], defines the attribute 
space (the FIM structure) and the relevance of each 
attribute in evaluating similarity. The second part 
of the chromosome, containing the vector of pairs 
[(a1, b1), ... (aj, bi), ... (an, bn)] defines the parameter 


for retrieval and similarity evaluation. The last part 
of the chromosome, containing the parameter œ, de- 
fines the forgetting factor for the local models. The 
fitness function is computed using a wrapper ap- 
proach [41.50]. For each chromosome, represented 
by (41.8) we instantiated its corresponding FIM. Fol- 
lowing a leave-one-out approach, we used FIM to 
predict the expected life of the probe unit follow- 
ing the four steps described in the previous sub- 
section. We repeated this process for all units in 
the fleet and ranked them in decreasing order, us- 
ing their predicted duration Dyext, ọ. We then selected 
the top 20%. The fitness function of the chromosome 
was the precision of the classification, TP/(TP + FP), 
where TP is the count of True Positives and FP is 
the count of False Positives. This is illustrated in 
Fig. 41.5. 


Results. We used 18 months worth of data and per- 
formed the experiments at three different times, after 
6, 12, and 18 months, respectively. We wanted to test 
the adaptability of the learning techniques to environ- 
mental, operational, or maintenance changes. We also 
wanted to determine if their performance would im- 
prove over time with incremental data acquisition. For 
each start-up time, we used EAs to generate an opti- 
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Fig. 41.5 FRC optimization using EA 


mized weighted subset of attributes to define the peers 
of each unit. 


Experiment1: Retrospective Selection. The goal was 
to select the current best 20% units of the fleet based on 
their peers past performance. In this case, a random se- 
lection, which could be used as a baseline, would yield 
20%. However, the size of the fleet at each start-up time 
was different, ranging from 262 (after 6 months) to 634 
(after 12 months), to 845 (after 18 months.) We decided 
to keep the number of selected units constant (i. e., 
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52 units) over the three start-up times. Thus the baseline 
random selection for each start-up time was [20%-8 %-— 
6%], i.e., 52/262 = 20%; 52/634 = 8%; 52/845 = 
6%. 


Experiment 2: Prospective Selection. We wanted to 
select the future best 20% units for the next-pulse dura- 
tion. In this case, a random selection would yield 20%. 

The peers designed by the EAs provided the best 
accuracy overall: 


@ Experiment | (Retrospective selection): 
Precision = 63.5%, which was more than 10x bet- 
ter than random selection, and 1.7x better than 
existing heuristics. 

© Experiment 2 (Prospective selection): 
Precision = 55.0%, which was more than 2.5 x bet- 
ter than random selection, and 1.5x better than 
existing heuristics. 


Figures 41.6 and 41.7 illustrate the results of these 
two experiments. 

As mentioned in Sect. 41.1.2, successfully deployed 
intelligent systems must remain valid and accurate over 
time, while compensating for drifts and accounting for 
contextual changes that might otherwise render their 
design stale or obsolete. In this case study, we repeated 
the last set of experiments using dynamic and static 
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models. The dynamic models were fresh models, rede- 
veloped at each time slice by using the methodology 
described. The static models were developed at time 
slice 1 and applied, unchanged, at time slices 2 and 3. In 
this experiment, the original models showed significant 
deterioration over time: 43% — 29% — 25%. In con- 
trast, the dynamic models exhibited robust, improved 
precision: 43% — 43% — 55%. This is illustrated in 
Fig. 41.8. 

This comparison shows the benefit of automated 
model updating. By using an offline metaheuristics 
such as EAs, we can automate model development 
and model re-tuning. This allows us to maintain model 
performance over time, through frequent updates, and 
avoid the obsolescence-driven model deterioration, 
which in this example occurred 1 year after the first 
deployment. A more detailed description of this case 
study can be found in [41.11, 12]. 


4.3.4 Health Management - 
Fault Accommodation 
and Optimization 


All the functions described in Sects. 41.3.1-41.3.3 
could be described as descriptive and predictive 
analytics, as they provide assessments and projec- 
tions of the system’s health state. These assessments 
lead to prescriptive analytics, as they determine the 
on-board control action and an off-board logistics, 
repair and planning actions. On-board control actions 
are usually focused on maintaining performance 
or safety margins, and are performed in real-time. 
Off-board maintenance/repair actions cover more 
offline decisions. They require a decision support 
system (DSS) performing multiobjective optimiza- 
tions, exploring Pareto frontiers of corrective actions, 
and combining them with preference aggregations to 
generate the best decision tradeoffs. The underline 
techniques are intelligent control for fault accom- 
modation [41.18] and multiobjective optimization 
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techniques [41.51-54], aimed at minimizing the im- 
pact that maintenance and repairs event could cause to 
the profitable operation of the assets. For the sake of 
brevity, we will not provide a case study of optimiza- 
tion in the PHM domain, but we will present one in the 
financial domain (Sect. 41.4.3). 


41.4 CIIML Applications in Financial Domains: Risk Management 


Prognostics and health management of industrial assets 
bears a strong analogy with risk management of finan- 
cial and commercial assets. We have shown how unsu- 
pervised learning can be used to identify abnormal be- 
haviors, i.e., deviations from normal states/structures. 
In PHM, units in a fleet that stray away from nor- 
mal performance baselines are usually anomalies lead- 


ing to incipient failure modes. In financial domains, 
nonconforming user behaviors could be precursors to 
fraudulent transactions and could be identified using 
similar techniques. Similarly, supervised learning could 
be used to classify the root cause of an anomaly (diag- 
nostics) or to classify the risk class of an applicant for 
a financial/insurance product (risk classification). Re- 
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gressions could be used to forecast the remaining useful 
life of an asset under future load assumptions, or to 
forecast the residual value of assets after their lease 
period (or to create an instant valuation for an asset, 
such as in mortgage collateral valuation). Multiobjec- 
tive optimization techniques could be used to balance 
production values with life erosion cost (or combustion 
efficiency with emissions) or to balance an investment 
portfolio using multiple metrics of returns and risk. We 
will illustrate this analogy with the following three case 
studies, in which we will describe the use of CI tech- 
niques in risk classification for insurance underwriting, 
residential property valuation, and portfolio rebalancing 
optimization. 


41.4.1 Automation 
of Insurance Underwriting: 
A Classification Problem 


Problem Definition 
In many transaction-oriented processes, human deci- 
sion makers evaluate new applications for a given 
service (mortgages, loans, credits, insurances, etc.) and 
assess their associated risk and price. The automa- 
tion of these business processes is likely to increase 
throughput and reliability while reducing risk. The 
success of these ventures is depends on the availabil- 
ity of generalized decision-making systems that are 
not just able to reliably replicate the human decision- 
making process, but can do so in an explainable, 
transparent fashion. Insurance underwriting is one such 
high-volume application domain where intelligent au- 
tomation can be highly beneficial, and reliability and 
transparency of decision-making are critical. Tradition- 
ally, highly trained individuals perform insurance un- 
derwriting. A given insurance application is compared 
against several standards put forward by the insurance 
company and classified into one of the risk categories 
(rate classes) available for the type of insurance re- 
quested. The risk categories then affect the premium 
paid by the applicant — the higher the risk category, the 
higher the premium. The accept/reject decision is also 
part of this risk classification, since risks above a cer- 
tain tolerance level set by the company will simply be 
rejected. 

There can be a large amount of variability in the 
underwriting process when performed by human un- 
derwriters. Typically the underwriting standards cannot 
cover all possible cases, and sometimes they might be 
ambiguous. The subjective judgment of the underwriter 
will almost always play a role in the process. Variation 


in factors such as underwriter training and experience, 
and a multitude of other effects can cause different un- 
derwriters to issue inconsistent decisions. Sometimes 
these decisions fall in a gray area not explicitly covered 
by the standards. In these cases, the underwriter uses 
his/her own experience to determine whether the stan- 
dards should be adjusted. Different underwriters could 
apply different assumption regarding the applicability 
of the adjustments, as they might use stricter or more 
liberal interpretations of the standards. 


CI-Based Approach 
To address these problems, we developed a system to 
automate the application placement process for cases 
of low or medium complexity. For more complex cases, 
the system provided the underwriter with an assist 
based on partial analysis and conclusions. 

We used a fuzzy-rule-based classifier (FRC) to 
capture the underwriting standards derived from the 
actuarial guidance. Then we tuned the FRC with an 
evolutionarily algorithms to determine the best FRC 
parameters to maximize precision and recall, wile min- 
imizing the cost of misclassification. The remaining of 
this section will summarize this solution. 


Fuzzy Rule-Based Classifier (FRC). The fuzzy-rule 
based classifier (FRC), which is briefly described 
in [41.13,14], uses rule sets to encode underwriting 
standards. Each rule set represents a set of fuzzy con- 
straints defining the boundaries between rate classes. 
These constraints were first determined from the un- 
derwriting guidelines. They were then refined using 
knowledge engineering sessions with expert underwrit- 
ers to identify factors such as blood pressure levels and 
cholesterol levels, which are critical in defining the ap- 
plicant’s risk and corresponding premium. The goal of 
the classifier is to assign an applicant to the most com- 
petitive rate class, providing that the applicant’s vital 
data meet all of the constraints of that particular rate 
class to a minimum degree of satisfaction. The con- 
straints for each rate class r are represented by n fuzzy 
sets: A¥(x;), i= 1,...,n. Each constraint A/(x;) can 
be interpreted as the degree of preference induced by 
value x; for satisfying constraint Aj (x;). After evaluat- 
ing all constraints, we compute two measures for each 
rate class r. The first one is the degree of intersection of 
all the constraints and measures the weakest constraint 
satisfaction 


I(r) = (Ai (xi) = minj_ Aj (x) - (41.9) 


i=l 
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This expression implies that each criterion has equal 
weight. If we want to attach a weight w; to each cri- 
terion A; we could use the weighted minimum operator: 


V(r) = (WAG) 
i=l 


= mini, (max((1 — w;), A7 @:))) , 


where w; € [0, 1]. The second one is a cumulative mea- 
sure of missing points (the complement of the average 
satisfaction of all constraints), and measures the overall 
tolerance allowed to each applicant, i. e., 


MP(r) = Y-A) =n (: - unas) 


i=1 i=1 


=n(1—A’). (41.10) 


The final classification is obtained by comparing the 
two measures, I(r) and MP(r) against two lower bounds 
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Fig. 41.9 Example of three fuzzy constraints for rate 
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defined by thresholds tı and t2. The parametric defini- 
tion of each fuzzy constraint A; (x;) and the values of t1 
and tz are design parameters that were initialized with 
knowledge engineering sessions. 

Figure 41.9 — adapted from [41.13] — illustrates an 
example of three constraints (trapezoidal membership 
functions) associated with rate class Z, the input data 
corresponding to an application, and the evaluation of 
the first measure, indicating the weakest degree of sat- 
isfaction of all constraints. 


Optimization of Design Parameters of the FRC Clas- 
sifier. The FRC design parameters were tuned, moni- 
tored, and maintained to assure the classifier’s optimal 
performance. To this end, we used EAs, composed 
of a population of chromosomes. Each chromosome 
contained a vector of elements that represent distinct 
tunable parameters to configure the FRC classifier, i. e., 
the parametric definition of the fuzzy constraints Aj (x;) 
and thresholds tı and t2. 

A chromosome, the genotypic representation of 
a model, defines the complete parametric configuration 
of the classifier. Thus, an instance of such classifier 
can be created for each chromosome, as shown in 
Fig. 41.10. Each chromosome c;, of population P(t) 
(left-hand side of Fig. 41.10), goes through a decod- 
ing process to allow them to create the classifier on the 
right. Each classifier is then tested on all the cases in the 
case base, assigning a rate class to each case. We can 
determine the quality of the configuration encoded by 
the chromosome, i.e., the fitness of the chromosome, 
by analyzing the results of the test. Our EA uses two 
types of mutations (uniform and Gaussian) to produce 
new individuals in the population by randomly vary- 
ing parameters of a single chromosome. The more fit 
chromosomes in generation ¢ will be more likely to be 
selected for this and pass their genetic material to the 
next generation t+ 1. Analogously, the less fit solutions 
will be culled from the population. At the conclusion of 
the EAs execution the best chromosome of the last gen- 
eration determines the classifier’s configuration. Note 
the similarity between Figs. 41.10 and 41.5, which un- 
derlies the similar role that EAs play as the offline MHs 
to design the best fuzzy classifier (in this case study), or 
the best FIM (in the case of the second case study). 


Standard Reference Dataset (SRD). To test and tune 
the classifiers, we needed to establish a benchmark. 
Therefore, we generated a standard reference dataset 
(SRD) of approximately 3000 cases taken from a strat- 
ified random sample of the historical case population. 
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Fig. 41.10 FRC optimization using EA 


Each of these cases received a rate class decision when 
it was originally underwritten. To reduce variability in 
these decisions, a team of experienced underwriters per- 
formed a blind review of selected cases to determine 
the standard reference decisions. These cases were then 
used to create and optimize the FRC model. 


Fitness Function. In classification problems such as 
this one, we can use two matrices to construct the fit- 
ness function that we want to optimize. The first matrix 
is a TxT confusion matrix M that contains frequencies 
of correct and incorrect classifications for all possi- 
ble combinations of the standard reference decisions 
(SRDs), which represent ground truth rate class de- 
cisions as reached by consensus among senior expert 
underwriters for a set of insurance applications and 
classifier decisions. The frequencies of correct classi- 
fications can be found on the main diagonal of matrix 
M. The first (T — 1) columns represent the rate classes 
available to the classifier. Column T represents the clas- 
sifier’s choice of not assigning any rate class, sending 
the case to a human underwriter. The same ordering is 
used to sort the rows for the SRD. The second matrix is 
a T xT penalty matrix P that contains the value loss due 
to misclassification. The entries in the penalty matrix P 
are zero or negative values. They were computed from 
actuarial data showing the net present value (NPV) for 


Fuzzy rule evaluation 


each entry (j,k). The penalty value P(j, k) was the differ- 
ence between the NPV of the entry (j, kj) and the highest 
NPV -— corresponding to the correct entry (j, j), located 
on the main diagonal. The fitness function f combined 
the values of M, resulted from a test run of the clas- 
sifier configured with chromosome c;, with the penalty 
matrix P to produce a single value 


T 


F 
fle) =} J MGk) * PGA). 


j=1 k=1 


(41.11) 


Function f represents the expected value loss for that 
chromosome computed over the SRD and is the fitness 
function used to drive the evolutionary search. 


Results 
Testing and Validation of FRC. We defined Cover- 
age as the percentage of cases as a fraction of the total 
number of input cases; Relative accuracy as the percent- 
age of correct decisions on those cases that were not 
referred to the human underwriter; Global accuracy as 
the percentage of correct decisions, including making 
correct rate class decisions and making a correct deci- 
sion to refer cases to human underwriters as a fraction 
of total input cases. Then we performed a comparison 
against the SRD. The results, reported in [41.13], show 
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Table 41.2 Typical performance of the un-tuned and tuned rule-based decision system (FRC) 


Metrics Initial parameters based on written Best knowledge engineered parameters Optimized parameters 
guidelines (%) 

Coverage 94.01 90.38 91.71 

Relative accuracy 75.92 92199 9552 

Global accuracy 74.75 90.07 93.63 


Table 41.3 Average FRC performance over 5 tuning case sets compared to five disjoint test sets 


Metrics Average performance on training sets 
Coverage 91.81 
Relative accuracy 94.52 
Global accuracy 92.74 


aremarkable improvement in all measures. Specifically, 
we obtained the following results: 

Using the initial parameters (first column of Ta- 
ble 41.2) we can observe a large moderate Cover- 
age (94%) associated with a low relative accuracy 
(76%) and a lower global accuracy (+~75%). These 
performance values are the result of applying a strict 
interpretation of the underwriter (UW) guidelines, with- 
out allowing for any tolerance. Had we implemented 
such crisp rules with a traditional rule-based system, 
we would have obtained these results. This strictness 
would prevent the insurer from being price competitive, 
and would not represent the typical modus operandi 
of human underwriters. However, by allowing each 
underwriter to use his/her own interpretation of such 
guidelines, we could introduce large underwriters’ vari- 
ability. One of our main goals was to provide a uniform 
interpretation, while still allowing for some tolerance. 
This goal is addressed in the second column of Ta- 
ble 41.2, which shows the results of performing knowl- 
edge engineering and encoding the desired tradeoff 
between risk and price competitiveness as fuzzy con- 
straints with preference semantics. This intermediate 
stage shows a different tradeoff since both Global and 
relative accuracy have improved. Coverage slightly de- 
creases (90%) for a considerable gain in relative 
accuracy (93%). Although we obtained this initial pa- 
rameter set by interviewing the experts, we had no guar- 
antee that such parameters were optimal. Therefore, we 
used EAs to tune them. We allowed the parameters to 
move within a predefined range centered on their initial 
values and, using the SRD and the fitness function de- 
scribed above, we obtained an optimized parameter set, 
whose results are described in the third column of Ta- 
ble 41.2. The results of the optimization show the point 
corresponding to the final parameter set dominates the 
second set point (in a Pareto sense), since both cover- 


Average performance on disjoint test sets 
91.80 
93.60 
91.60 


age and relative accuracy were improved. Finally, we 
can observe that the final metric, global accuracy (last 
row in Table 41.2), improves monotonically as we move 
from using the strict interpretation of the guidelines 
(75%), through the knowledge-engineered parame- 
ters (~90%), to the optimized parameters (~94%). 

While the reported performance of the optimized 
parameters, shown in Table 41.2, is typical of the 
performance achieved through the optimization, a five- 
fold cross-validation on the optimization was also per- 
formed to identify stable parameters in the design space 
and stable metrics in the performance space. This is 
shown in Table 41.3. 

With this kind of automation, variability of the risk 
category decision was greatly reduced. This also elim- 
inated a source of risk exposure for the company, al- 
lowing it to operate more competitively and profitably. 
The intelligent automation process, capable of deter- 
mining its applicability for each new case, increased 
underwriting capacity by enabling them to handle larger 
volume of applications. Additional information on this 
approach can be found in [41.13, 14]. 


41.4.2 Mortgage Collateral Valuation: 
A Regression Problem 


Problem Definition 

Residential property valuation is the process of deter- 
mining a dollar estimate of the property value for given 
market conditions. Within this case study, we will re- 
strict ourselves to a single-family residence designed or 
intended for owner occupancy. The value of a property 
changes with market conditions, so any estimate of its 
value must be periodically updated to reflect those mar- 
ket changes. Any valuation must also be supported by 
current evidence of market conditions, e.g., recent real 
estate transactions. 
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Current manual process for estimating the value of 
properties usually requires an on-site visit by a human 
appraiser. This process is slow and expensive for batch 
applications such as those used by banks for updat- 
ing their loan and insurance portfolios, verifying risk 
profiles of servicing rights, or evaluating default risks 
for securitized packages of mortgages. The appraisal 
process for these batch applications is currently es- 
timated, to a lesser degree of accuracy, by sampling 
techniques. Secondary buyers and mortgage insurers 
may also require verification of property value on in- 
dividual transactions. Some of the applications also 
require that the output be qualified by a reliability mea- 
sure and some justification, so that questionable data 
and unusual circumstances can be flagged for the hu- 
man who uses the output of the system. Thus, the 
automation of residential property was motivated by 
a broad spectrum of application areas. 

The most common and credible method used by 
appraisers is the sales comparables approach. This 
method consists of finding comparable cases, i. e., re- 
cent sales that are comparable to the subject property 
(using sales records); contrasting the subject property 
with the comparables; adjusting the comparables’ sales 
price to reflect their differences from the subject prop- 
erty (using heuristics and personal experience); and 
reconciling the comparables adjusted sales prices to de- 
rive an estimate for the subject property (using any 
reasonable averaging method). This process assumes 
that the item’s market value can be derived by the prices 
demanded by similar items in the same market. 


Cl-Based Approach: 

LOCVAL, AIGEN, AICOMP, Fusion 
To automate the valuation process, we developed a pro- 
gram that combined the result of three independent 
estimators. The first one, locational value (LOCVAL), 
was a coarse estimator based on the locational value 
of the property. The second one, generative AI model 
(AIGEN), was a generative estimator based on neuro- 
fuzzy networks that only used five features from our 
training set. The third one, comparable based AI model 
(AICOMP), was a fuzzy case-based reasoned that fol- 
lowed the comparable-based approach of the appraisers. 
Finally, we fused the output of the estimators into a sin- 
gle estimate and reliability value. 


Locational value model (LOCVAL). The first model 
was based solely on two features of the property: 
its location, expressed by a valid, geocoded address, 
and its living area, as shown in Fig. 41.11 (adapted 
from [41.15].) 

A dollar per square foot measure was constructed for 
each point in the county, by suitably averaging the ob- 
served, filtered historical market values in the vicinity 
of that point. This locational value estimator (LOCVAL) 
produced two output values: Locational_Value (a $/ft 
estimate) and Deviation_from_prevailing_value. The 
local averaging was done by an exponentially decreas- 
ing radial basis function with a space constant of 
0.15—0.2 miles. It could be described as the weighted 
sum of radial basis functions (all of the same width), 
each situated at the site of a sale within the past 1-year 
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and having amplitude equal to the sales price. Devia- 
tion from prevailing value was the standard deviation for 
houses within the area covered and was derived using 
a similar approach. The output of LOCVAL was a coarse 
estimate of the property value, which was used as an in- 
put for the generative approach (AIGEN). 


AIGEN: Fuzzy-Neural Network. The generative AI 
model (AIGEN) relied on a fuzzy-neural net that, after 
a training phase, provided an estimate of the subject’s 
value. The specific model was an extension of AN- 
FIS [41.55], which implemented a fuzzy system as 
a five-layer neural network so that the structure of the 
net could be interpreted in terms of high-level rules. 
The extension developed allowed the output to be lin- 
ear functions of variables that did not necessarily occur 
in the input. In this fashion, we achieved more fidelity 
with the local models (linear functions) without incur- 
ring in the computational complexity caused by a large 
number of inputs. AIGEN inputs were five property 
features (total_rooms, num_bedrooms, num_baths, liv- 
ing_area, and lot_size) and the output of LOCVAL 
(locational_value). 


AICOMP: Fuzzy Case-Based Reasoner (CBR). 
AICOMP is a fuzzy CBR system that used fuzzy pred- 
icates and fuzzy-logic-based similarity measures to 
estimate the value of residential property. This process 
consisted of selecting relevant cases (which would be 
nearby house sales), adapting them, and aggregating 
those adapted cases into a single estimate of the 
property value. AICOMP followed a process similar 
to the sales comparison used by certified appraisers to 
estimate a residential property’s value. This approach, 
which is further described in [41.56], consisted of: 


(1) Retrieving recent sales from a case base. Upon 
entering the subject property attributes, AICOMP 
retrieves potentially similar comparables from the 
case-base. This initial selection uses six attributes: 
address, date of sale, living area, lot area, number 
of bathrooms, and bedrooms. 

(2) Comparing the subject property with the retrieved 
cases. The comparables are rated and ranked on 
a similarity scale to identify the most similar ones 
to the subject property. This rating is obtained 
from a weighted aggregation of the decision maker 
preferences, expressed as fuzzy membership distri- 
butions and relations. 

(3) Adjusting the sales price of the retrieved cases. 
Each property’s sales price is adjusted to reflect 


their differences from the subject property. These 
adjustments are performed by a rule set that uses 
additional property attributes, such as construction 
quality, conditions, pools, fireplaces, etc. 

(4) Aggregating the adjusted sales prices of the re- 
trieved cases. The best four to eight comparables 
are selected. The adjusted sales price and similarity 
of the selected properties are combined to produce 
an estimate of the subject value with an associated 
reliability value. 


Fusion. Each model produced a property value and 
an associated reliability value. The latter was a func- 
tion of the typicality of the subject property based on 
its physical characteristics (such as lot size, living area, 
and total room). These typical values were represented 
by possibilistic distributions (fuzzy sets). We computed 
the degree to which each property satisfied each crite- 
rion. The overall property value reliability was obtained 
by considering the conjunction of these constraint satis- 
factions (i. e., the minimum of the individual reliability 
values). 

The computation times, required inputs, errors, and 
reliability values for these three methods are shown in 
Fig. 41.12. The locational value (LOCVAL) model took 
the least time and information, but produced the largest 
error. The CBR approach (AICOMP) took the largest 
time and number of inputs, but produced the lowest 
error. 

The fusion of the three estimators exhibited several 
advantages: 


@ The fusion process provided an indication of the re- 
liability in the final estimate: 
— If reliability was high, the fused estimate was 
more accurate than any of the individual ones 
— If reliability was limited, the system generated 
an explanation in human terms 
@ The fused estimates were more robust. 


These characteristics allowed the user to determine 
the suitability of the estimate within the given busi- 
ness application context. Knowledge-based rules were 
used for constructing this fusion at a supervisory level, 
and the few parameters were determined manually, by 
inspection and experimentation. A more detailed de- 
scription of this process can be found in [41.15]. This 
case study was the oldest case study to use model en- 
semble and fusion. However, the use of metaheuristics 
to guide the design phase at that time was still not well 
understood as in the more recent case studies. 
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Results 
The reliability values generated by the fusion were di- 
vided into three classes, labeled good, fair, and poor. 
From a test sample of 7293 properties, 63% were clas- 
sified as good, with a median absolute error of 5.4% (an 
error that was satisfactory for the intended application.) 
Of the remaining subjects, 24% were classified as fair, 
and 13% as poor. The fair set had a medium error of 
7.7%, and the poor set had a median error of 11.8%. 

The reliability computation and the fusion increased 
the robustness and usefulness of the system, which 
achieved good accuracy and was scalable for thou- 
sands of automated transactions. This approach made it 
a transparent, interpretable, fast, and inexpensive choice 
for bulk estimates of residential property value for a va- 
riety of financial applications. 


4.4.3 Portfolio Rebalancing: 
An Optimization Problem 


Problem Definition 
The goal of portfolio optimization is to manage risk 
through diversification and obtain an optimal risk- 
return tradeoff. In this case study, we address portfolio 
optimization within the context of an asset-liability 
management (ALM) application. The goal was to find 
the optimal allocation of available financial resources 
to a diversified portfolio of 1500+ long and short-term 
financial assets, in accordance with risk, liability, and 
regulatory constraints. 

To characterize the investor’s risk objectives and 
capture the potential risk-return tradeoffs, we used var- 


Fig. 41.12 Data comparison of multi- 
ple approaches 


ious measures to quantify different aspects of portfolio 
risk. For ALM applications, a typical measure of risk is 
surplus variance. We computed portfolio variance us- 
ing an analytical method based on a multifactor risk 
framework. In this framework, the value of a secu- 
rity can be characterized as a function of multiple 
underlying risk factors. The change in the value of 
a security can be approximated by the changes in the 
risk factor values and risk sensitivities to these risk 
factors. The portfolio variance equation can be de- 
rived analytically from the underlying value change 
function. 

In ALM applications, the portfolios have assets and 
liabilities that are affected by the changes in com- 
mon risk factors. Since a majority of the assets are 
fixed-income securities, the dominant risk factors are 
interest rates. In ALM applications, in addition to 
maximizing return or minimizing risk, portfolio man- 
agers are constrained to match the characteristics of 
asset portfolios with those of the corresponding lia- 
bilities to preserve portfolio surplus due to interest 
rate changes. Therefore, the ALM portfolio optimiza- 
tion problem formulation has additional linear con- 
straints that match the asset-liability characteristics 
when compared with the traditional Markowitz model. 
We use the following ALM portfolio optimization 
formulation 


Maximize Portfolio expected return 
Minimize Surplus variance 


Minimize Portfolio value at risk (41.12) 
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Subject to: 


Duration mismatch < target, 
Convexity mismatch < target; and 


Linear portfolio investment constraints. 


To measure the three objectives in (41.12), namely 
portfolio expected return, surplus variance, and port- 
folio value at risk, we used book yield, portfo- 
lio variance, and simplified value at risk (SVaR), 
respectively. These metrics are defined as fol- 
lows: 


© Portfolio book yield represents its accounting yield 
to maturity and is defined as 


>>; BookValue; x Book Yield 
>>; Book Value; 


Book Yieldp = 


(41.13) 


© Portfolio variance is a measure of its variability 

and is defined as the second moment of its value 
change AV 

o? = E|(AV)?| —E[(AV)?° (41.14) 

© Portfolio simplified value at risk is a complex mea- 


sure of the portfolio’s catastrophic risk and is de- 
scribed in details in [41.57]. 


These metrics define the 3D optimization space. 
Now, let us analyze its constraints. The change in the 
value AV of a security can be approximated by a sec- 
ond order Taylor series expansion given by 


(41.15) 


1 m m @V 
Fo BS (sam) AFiAF,. 


The first- and second-order partial derivatives in (41.15) 
are the risk sensitivities, i.e., the change in the se- 
curity value with respect to the change in the risk 
factors F;. These two terms are typically called delta 
and gamma, respectively [41.58]. For fixed-income se- 
curities, these measures are duration and convexity. 
The duration and convexity mismatches, which con- 
strain our optimization space in (41.12), are the absolute 


values of the differences between the effective dura- 
tions and convexities of the assets and liabilities in the 
portfolio, respectively. Though they are nonlinear (be- 
cause of the absolute value function), the constraints 
can easily be made linear by replacing each of them 
with two new constraints that each ensure that the 
actual value of the mismatch is less than the target 
mismatch and greater than the negative of the target 
mismatch, respectively. The other portfolio investment 
constraints include asset-sourcing constraints that im- 
pose a maximum limit on each asset class or secu- 
rity, overall portfolio credit quality, and other linear 
constraints. 


Cl-Based Approach 
Given the explicit need for customization and hy- 
bridization in methods for portfolio optimization, we 
could not find an existing multiobjective optimization 
algorithms could be applied without extensive modifi- 
cations. Specifically, the requirement to optimize while 
satisfying a large number of linear constraints excluded 
the ready application of prior evolutionary multiobjec- 
tive optimization approaches. This aspect was a princi- 
pal motivation to develop a novel hybrid techniques. 

Figure 41.13 illustrates the process used to drive 
the search for the efficient frontier. The process con- 
sisted of three steps, corresponding to the three boxes 
in Fig. 41.13. The first step (box 1 in Fig. 41.13) was 
the generation of the Pareto front. It consisted of: 


(a) Initializing the population of candidate portfolios 
using a randomized linear programming (RLP) 

(b) Generating an interim Pareto front with a Pareto 
sorting evolutionary algorithm (PSEA) 

(c) Completing gaps in the Pareto front with a target 
objective genetic algorithm (TOGA) and 

(d) Storing the results in a repository. After many runs, 
we filtered the repository with an efficient domi- 
nance filter and generated the first efficient frontier. 


The second step (box 2 in Fig. 41.13) was the in- 
teractive densification of the Pareto front. This richly 
sampled Pareto front was analyzed for possible gaps, 
and augmented with a last run of TOGA, leading to the 
generation of the second efficient frontier. Each point 
in this front represented a nondominated solution, i. e., a 
viable portfolio. The third step (box 3 in Fig. 41.13) was 
the portfolio selection. We needed to incorporate the de- 
cision maker’s preferences in the return-risk tradeoff. 
Our goal is to reduce the large number of viable so- 
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lutions into a much smaller subset that could then be 
further analyzed for a final portfolio selection. 

We will briefly describe the major components of 
this process. 


Randomized Linear Programming (RLP). The key 
challenge in solving the portfolio optimization problem 
was presented by the large number of linear allocation 
constraints. The feasible space defined by these con- 
straints is a high dimensional real-valued space (1500+ 
dimensions), and is a highly compact convex poly- 
tope, making for an enormously challenging constraint 
satisfaction problem. We leveraged our knowledge on 
the geometrical nature of the feasible space by de- 
signing a randomized linear programming algorithm 
that robustly sampled the boundary vertices of the 
convex feasible space. These extremity samples were 
seeded in the initial population of the PSEA and were 
exclusively used by the evolutionary multiobjective al- 
gorithm to generate interior points (via interpolative 
convex crossover) that were always geometrically fea- 
sible. This was similar in principle to the preprocess 
phase, proposed by Kubalik and Lazansky [41.59]. 


Pareto Sorting Evolutionary Algorithm (PSEA). We 
developed a Pareto sorting evolutionary algorithm 
(PSEA) that was able to robustly identify the Pareto 


front of optimal portfolios defined over a space of re- 
turns and risks. The algorithm used a secondary storage 
and maintains the diversity of the population by using 
a convex crossover operator, incorporating new random 
solutions in each generation of the search; and using 
a noncrowding filter. Given the reliance of the PSEA on 
the continuous identification of nondominated points, 
we developed a fast dominance filter to implement this 
function very efficiently. 


Target Objectives Genetic Algorithm (TOGA). We 
further enhanced the quality of the Pareto front by 
using a target objectives genetic algorithm (TOGA), 
a non-Pareto nonaggregating function approach to mul- 
tiobjective optimization. Unlike the PSEA, which was 
driven by the concept of dominance, the TOGA found 
solutions that were as close as possible to a predefined 
target for one or more criterion [41.60]. We used this to 
fill potential gaps in the Pareto front. 


Decision Maker Preferences. We incorporated the 
decision-maker’s preferences in the return-risk tradeoff 
to perform our selection. The goal was to reduce thou- 
sands of nondominated solutions into a much smaller 
subset (of +10 points), which could be further analyzed 
for a final portfolio selection. After obtaining a 3D 
Pareto front, we augmented this space with three ad- 
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ditional metrics, to reflect additional constraints for use 
in the tradeoff process. This augmented 6D space was 
used for the down-selection problem. To incorporate 
progressive ordinal preferences, we used a graphical 
tool to visualize 2D projections of the Pareto front. Af- 
ter applying a set of constraints to further refine the 
best region, we used an ordinal preference, defined by 
the order in which we visited and executed limited, lo- 
cal tradeoffs in each of the available 2D projections of 
the Pareto front. In this approach, the decision maker 
could understand the available space of options and 


41.5 Model Ensembles and Fusion 


Over the last decade, we have witnessed an emerging 
trend favoring the use of model ensembles over indi- 
vidual models. The elements of these ensembles are 
object-level models, the fusion mechanism is an online 
MHs, and their overall design is guided by offline MHs, 
as discussed in Sect. 41.1. 


41.5.1 Motivations for Model Ensembles 


This trend is driven by the improved performance 
obtained by ensembles. By fusing the outputs of an 
ensemble of diverse predictive models, we boost the 
overall prediction accuracy while reducing the variance. 
Fumera and Roli et al. [41.61] confirmed theoretically 
the claims of Dietterich [41.62]. They proved that av- 
eraging of classifiers outputs guarantees a better test 
set performance than the worst classifier of the en- 
semble. Moreover, under specific hypotheses, such as 
linear combiners of individual classifiers with unbiased 
and uncorrelated errors, the fusion of multiple classi- 
fiers can improve the performance of the best individual 
classifiers. Under ideal circumstances (e.g., with an in- 
finite number of classifiers) the fusion can provide the 
optimal Bayes classifier [41.63]. All this is possible if 
individual classifiers make different errors (diversity), 
as we will discuss in Sect. 41.5.3. 

There is also a computational motivation for us- 
ing model ensembles. Many learning algorithms are 
based on local search and suffer from the problem 
of local minima, which is usually resolved by mul- 
tiple independent initializations. In other cases, gen- 
erating the optimal training might be computation- 
ally hard even with enough training data. The fusion 
of multiple classifiers trained from different starting 
points or training sets can better approximate the 


the costs/benefits of the available tradeoffs. The use 
of progressive preference elicitation provided a natural 
mechanism to identify a small number of the good so- 
lutions. 


Results. The optimization process was successfully 
tested on large portfolios of fixed-income base secu- 
rities — each portfolio involving over fifteen hundred 
financial assets, and investment decisions of several bil- 
lion dollars. For a more complete description of this 
application refer to [41.16]. 


optimal classifier at a fraction of the computational 
cost. 


41.5.2 Construction of Model Ensembles 


The ensemble construction requires the creation of base 
models, an ensemble topology, and a fusion mecha- 
nism. Let us briefly review these concepts. 


Base Models. Base models are the elements to be 
fused — they are the object-level models discussed in 
Sect. 41.1.1. They need to be diverse, e.g., they need to 
have low error correlations. They could differ in their 
parameters, and/or in their structure, and/or in the ML 
techniques used to create them. The process for inject- 
ing diversity in their design is described in Sect. 41.5.3. 


Topology. The ensemble can be constructed by fol- 
lowing a parallel or serial topology (or in some cases, 
a hybrid one). The most common topology is the par- 
allel one, in which multiple models are fed the same 
inputs and their outputs are merged by the fusion mech- 
anism. In the serial topology, the models are applied 
sequentially (as in the case when we first use a primary 
model, and in case of it failing to accept a pattern, a sec- 
ondary model is used to attempt a classification). 


Fusion Mechanism. We divide the fusion mecha- 
nisms based on two criteria: (a) the type of aggregation 
that they perform; (b) the dependency on their inputs. 
The former is concerned with the regions of compe- 
tence of the base models to be aggregated and divides 
the fusion mechanisms into: selection, interpolation, 
and integration. The latter is concerned with the depen- 
dency of the meta-model (fusion) on the inputs to the 
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ensemble and divides the fusion mechanisms into static 
and dynamic ones. 

Based on the first criterion, we have the following 
types of fusion mechanisms: 


@ Selection — used to fuse disjoint, complementary 
models. In this case, the base models were trained 
on disjoint regions of the feature space and, for each 
pattern, just one model is responsible for the final 
decision. Selection determines a binary relevance 
weight of each complementary model (where all but 
one of the weights are zero). This mechanism is typ- 
ically used in hierarchical control systems, in which 
a supervisory controller (meta controller) selects the 
most appropriate low-level controller for any given 
state. Another example of this mechanism is the use 
of decision trees [41.26], in which the leaf node 
reached by the input/state determines the selected 
model. 

© Interpolation — used to fuse overlapping comple- 
mentary models. In this case, the base models were 
trained on different but overlapping regions of the 
feature space and, for each state, a subset of models 
is responsible for the final decision. Interpolation 
determines a fuzzy relevance weight of each com- 
plementary model (where the weights are in the 
[0, 1] interval and they are usually normalized to 
add up to 1). By interpolating rather than switch- 
ing between models, we introduce smoothness in 
the response surface induced by the ensemble. This 
interpolation mechanism is typical of hierarchical 
fuzzy systems, usually found in fuzzy control appli- 
cations [41.36, 55]. 

© Integration, — used to fuse competitive models. 
In this case, all base models were trained on 
the same feature space and, for each input, all 
models contribute to the final decision accord- 
ing to their relevance weight. Integration deter- 
mines the relevance weight of each competitive 
model. 


Based on the second criterion (input dependency of 
the meta model), we have the following fusion mecha- 
nisms: 


@ Static fusion. The relevance weights are determined 
in a batch mode by a static fusion meta-model (on- 
line MHs). The mechanism is applied uniformly 
to all inputs. This is the typical case of alge- 
braic expressions used to compute the relevance 
weights [41.64]. 


© Dynamic fusion. The relevance weights are deter- 
mined at run time, by a dynamic fusion meta-model 
(online MHs). The weights vary according to the in- 
puts. This is the typical case of dynamic systems 
used to compute the relevance weights [41.36—38]. 


As noted by Roli et al. [41.65], the design of a suc- 
cessful fusion system consists of three parts: design of 
the individual object-level models, selection of a set of 
diverse models, and design of the fusion mechanism. 
The operating word is diverse, where model diversity 
is defined by low correlation among the object-level 
model errors. In other words, these models should 
be as accurate as possible while avoiding coincident 
errors. This concept is described by Kuncheva and 
Whitaker [41.66], where the authors propose four pair- 
wise and six nonpairwise diversity measures to deter- 
mine the models difference. A more complete treatment 
of this topic can be found in [41.67]. 


41.5.3 Creating Diversity 
in the Model Ensembles 


Let us consider a model as a mapping from an n- 
dimensional feature space F to a k-dimensional output 
space Y. The model training dataset could be repre- 
sented as a flat file, in which each row is a point in the 
cross-product F x Y and each column represent a coor- 
dinate dimension for such points (either in the feature 
space F or in the output space Y.) 

Among the many approaches for injecting diversity 
in the creation of an ensemble of models, we find bag- 
ging, boosting, random subspace, randomization, and 
random forest. Some of these approaches subsample the 
rows of the training set (points or examples), some other 
ones subsample the columns (features), and a few do 
both. Let us review some of these approaches in chrono- 
logical order. 

Bootstrap [41.68-70] or bagging [41.71] is ar- 
guably the oldest techniques for creating an ensemble 
of models. In this approach, diversity is obtained by 
building each model with a different set of examples, 
which are obtained from the original training dataset 
by resampling the rows with replacement (using a uni- 
form probability distribution over the rows). Bagging 
combines the decisions of the classifiers using uniform- 
weighted voting. For each new training dataset, we must 
maintain the same number of rows as in the original 
training dataset, by sampling it that many times. Sam- 
pling with replacement leads ~63.2% unique rows. 
Sampling with replacement creates a series of inde- 
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pendent Bernoulli trials, so the number of times a row 
is sampled from k trials out of N rows is B(k, 1/N). 
For large values of N, the Bernoulli series can be ap- 
proximated by a Poisson distribution with mean (k/N). 
Therefore, the proportion of rows not sampled will be 
approximately e~*/", In bootstrap, the number of sam- 
ples is equal to the number of rows, i. e., k = N, so the 
Poisson approximation has a mean 1 and the proportion 
of sampled data is (1—e7*/") = (1—e7!) = 63.2%. 
As a result, we can achieve the same storage reduc- 
tion by not duplicating the same rows and instead attach 
a count at the end of each sampled record to indicate 
the number of time it was selected. An interesting vari- 
ation of this concept is the Bag of Little Bootstraps 
(BLBs) [41.72], which modifies the bootstrap approach 
to be usable with much larger data sets (where 63.2% 
of the original data would still be prohibitively large). 
Their proposed BLB approach performs a more drastic 
subsampling while maintaining the unbiased estimation 
and convergence rate of the original bootstrap method. 

An alternative to bagging is boosting, which is 
rooted in the probably approximately correct (PAC) 
learning model [41.73—75]. Instead of training all clas- 
sifiers in parallel (as in the case of bagging), we con- 
struct the ensemble in a serial fashion, by adding one 
model at a time. The model added to the ensemble at 
step j is trained on a dataset sampled selectively from 
the original dataset. The sampling distribution starts 
from a uniform distribution (as in bagging) and pro- 
gresses toward increasing likelihood of misclassified 
examples in the new dataset. Thus, the distribution is 
modified at each step, increasing the likelihood of the 
examples misclassified by the classifier at step (j— 1) 
being in the training dataset at step j. Like Bagging, 
Boosting combines the decisions of the classifiers us- 
ing uniform-weighted voting. 

Adaboost (or adapting boosting) [41.74] extends 
boosting from binary to multiclass classification prob- 
lems and regression models. It adapts the probability 
distribution over the rows in the training set to increase 
the difficulty of the training points by including more 
instances misclassified or wrongly predicted by previ- 
ous models. Adaboost combines the decisions of the 
classifiers using a weighted voting. For regressions, it 
aggregates all the normalized confidences for the out- 
put. For multiclass classification it selects the class with 
the highest votes, calculated from the normalized clas- 
sification errors of each class. 

A different approach to inject diversity is to limit 
the number of columns (features), rather than the num- 
ber of rows (points). Ho’s random subspaces tech- 


nique [41.76] selects random subsets of the available 
features to be used in training the individual classifiers 
in the ensemble. 

Dietterich [41.77] introduced an approach called 
randomization. In this approach, at each node of each 
tree of the ensemble, the 20 best attributes to split the 
node are determined and one of them is randomly se- 
lected for use at that node. 

Breiman [41.32] presented random forest ensem- 
bles, where bagging is used in combination with ran- 
dom feature subspace selection. At each node of each 
tree of the forest, a subset of m attributes (out of n 
available ones) is randomly selected, and the best split 
available based on the m attributes is selected for that 
node. Clearly, if m were too small, the tree performance 
would be severely affected, while if m were too close to 
the value of n, each tree performance would be higher, 
but diversity would suffer. In the case of random forest, 
a tradeoff between individual performance and overall 
diversity is achieved by using a value of m around the 
|./n| for classification problems, and around |n/3] for 
regression problems. 

Other approaches to increase diversity rely on the 
use of a high-level model to combine object-level mod- 
els derived from different machine-learning techniques, 
e.g., stacked generalization [41.78]. Alternatively, we 
can inject structural diversity in the design of the ob- 
ject models by using different topologies/architectures 
in graphical models (e.g., neural networks) or different 
function sets/grammars in genetic programming algo- 
rithms to construct models [41.79]. 

The above approaches allow us to extract different 
types of information from the data, which should lead to 
lower error correlations among the models. With these 
approaches we can generate a space of diverse models, 
which can then be searched by offline MHs to tune and 
optimize the model ensemble, according to tradeoffs of 
performance and diversity. 


41.5.4 Lazy Meta-Learning: 
A Model-Agnostic Fusion Mechanism 


A second trend in the development of analytics models 
is the inevitable commoditization of object-level mod- 
els. Multiple sources for model creation are now avail- 
able, ranging from Crowdsourcing analytics by compe- 
tition (e.g., [41.80]) or by collaboration (e.g., [41.81]) to 
cloud-based model automation tools, such as evolving 
model populations using genetic programming [41.79]. 
This situation creates different requirements for the fu- 
sion mechanism, which now should be agnostic with 
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respect to the genesis of the object-level models in the 
ensemble. 

We should note that all the previous approaches to 
fusion described in Sect. 41.5.3 use static fusion mecha- 
nisms, as they focus primarily on the creation of diverse 
base models (or object-level models). If we want to be 
agnostics with respect to these models, we need to have 
a smarter fusion mechanism, i.e., a meta model that 
can reason about the performance and applicability of 
the available object level models. 

This issue is partially addressed by Lazy Meta- 
Learning, proposed in [41.82]. In this approach, for 
each query we instantiate a customized fusion mech- 
anism. Such mechanism is a meta model, i. e., a model 
that operates on the object-level models whose predic- 
tions we want to fuse. Specifically, for a given query we 
dynamically (i.e., based on the query) create a model 
ensemble, followed by a customized fusion. The dy- 
namic model ensemble consists of: 


(1) Finding the most relevant object-level models 
from a DB of models, by matching their meta- 
information with the query. 

(2) Identifying the relevant models with higher perfor- 
mance. 

(3) Selecting a subset of models with highly uncorre- 
lated errors to create the ensemble. 


The customized fusion uses the meta-information of 
the models in the ensemble for dynamic bias compensa- 
tion and relevance weighting. The output is a weighted 
interpolation or extrapolation of the outputs of the 
model ensemble. 

More specifically the Lazy Meta-Learning process 
is divided into three stages: 


© Model creation, an offline stage in which we create 
the initial building blocks for the assembly (or we 
collect them/acquire them from other sources) and 
we compile their meta-information 

@ Dynamic model assembly, an online stage in which, 
for a given query we select the best subset of models 

@ Dynamic model fusion, an online stage in which we 
evaluate the selected models and dynamically fuse 
them to solve the query. 


Model Creation: The Building Blocks 
We assume the availability of an initial training set that 
samples an underlying mapping from a feature space X 
to an output y. In the case of supervised learning, we 
also know the ground truth-value ¢ for each record in 


the training set. We create a database DB of m diverse, 
local or global models developed by any source. If we 
have control on the model creation, we can increase 
model diversity by any of the techniques described in 
Sect. 41.5.3. Every time, we add a model to the DB, 
we need to capture its associated meta-information, i. e., 
information about the model itself, its training set, and 
its local/global performance. Such meta-information is 
used to create indices in the DB that will make its search 
more efficient. For each model M;, we use a compiled 
summary of its performance, represented by a CART 
tree 7;, of depth d;, and trained on the model error vector 
obtained during the validation of the model. To avoid 
overfitting, each tree is pruned to allow at least 25 points 
in each leaf node. 


Dynamic Model Ensemble: 
Query-Driven Model Selection and Ensemble 
This stage is divided into three steps: 


© Model Filtering, in which we retrieve from the 
DB the applicable models for the given query. For 
a query q, the process starts with a set of constraints 
to define its related feasibility set. In this case the 
constraints are: 


(a) Model soundness and competency in its region 
of applicability (i.e., there must be sufficient 
points in the training set to develop a reliable 
model. 

(b) Model vitality (i. e., the model is up-to-date, not 
obsolete). 

(c) Model applicability to the query (i. e., the query 
is in the model’s competence region). 


The intersection of these constraints satisfaction 
gives us a set of retrieved models for the query q. 
Let us denote the cardinality of this set as r. 

© Model Preselection, in which we reduce the number 
of models, based on their local performance charac- 
teristics, such as bias, variability, and distance from 
the query. For a query g, having retrieved its fea- 
sible r models from the previous step, we classify 
the query using the same CART tree T;, associated 
to each model, and reach leaf node L;(q). Each leaf 
node will be defined by its path to the root of the tree 
and will contain d; constraints over (at most) d; fea- 
tures. Leaf L;(q) provides performance estimates of 
model M; in the region of the query. These estimates 
are used to retrieve the set of Pareto-best models to 
be used in the next step. Let us denote the cardinal- 
ity of this set as p. 
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Fig. 41.14 Dynamic model ensemble on demand (filtering, selection) 


@ Model Final Selection, in which we define the final 
model subset. We need to use an ensemble whose 
elements have the most uncorrelated errors. We use 
the Entropy Measure E, proposed by Kuncheva and 
Whitaker, as the way to find the k most diverse 
models to form the ensemble. To avoid the intrin- 
sic combinatorial complexity, we approximate our 
search via a greedy algorithm (further described 
in [41.82]). 


The process is described in Fig. 41.14. 


Dynamic Model Fusion: Generating the Answer 
Finally, we evaluate the selected k models, compensate 
for their biases, and aggregate their outputs using a rel- 
evance weight that is a function of their proximity to 
the query to generate the solution to the query. This is 
illustrated in Fig. 41.15. 

This approach was successfully tested against a re- 
gression problem for a coal-fired power plant optimiza- 
tion. The optimization problem, described in [41.83], 
required to adjust 20+ set-point values for the power 


Fig. 41.15 Dynamic model fusion on demand > 
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Table 41.4 Static and dynamic fusion for: (a) 30 NNs (b) 45 SRs models; (c) 75 combined models 


NO, 
(a) NN (b) SR (c) NN+ = (a) NN 
SR 
Baseline 0.02279 0.03267 0.02877 91.79 
(average) 
Dynamic 0.01627 0.01651 0.01541 70.95 
fusion 
Percentage 28.6% 49.5% 46.4% 22.7% 
gain 


sions (NO,) and heat rate (inverse of efficiency). The 
optimization was predicated on having a reliable fit- 
ness function, i. e., a high fidelity mapping between the 
20+ input vector and the three outputs (load, NO,, heat 
rate). 

In [41.82], we focused on generating this mapping 
via dynamic fusion. We used a data base of approxi- 
mately 75 models: 30 neural networks (NNs) trained 
using bootstrapping and about 45 symbolic regression 
(SR) models evolved on the MIT Cloud using with the 
same training set of 5000+ records. We applied the dy- 
namic fusion approach, described in this section, and 
evaluated it on a disjoint validation set made of 2200+ 
records. The results of the mean of the absolute error 
(MAE) computed over this validation set are summa- 
rized in Table 41.4. 

The first conclusion is that dynamic fusion con- 
sistently outperformed static fusion, as shown by the 


Heat rate Load 

(b) SR (c) NN+ (a) NN (b) SR (c) NN+ 
SR SR 

109.15 101.14 1.0598 15275 1.3149 


73.90 70.90 0.8474 0.8243 0.8209 


32.3% 29.9% 20.0% 46.0% 37.6% 


percentage gain (last row in Table 41.4), which was 
computed as the difference between baseline and dy- 
namic fusion, as a percentage of the baseline. In the 
cases of NO, and load, the baseline (average) for the 
45 SR models was ~50% worse than that of the 30 
NNs. In creating the SR, we sacrificed individual per- 
formance to boost diversity, as described in Sect. 41.5.3. 
On the other hand, the NNs were trained for perfor- 
mance, while diversity was only partially addressed by 
bootstrap (but the NNs were trained with the same fea- 
ture set and the same topology). After dynamic fusion 
the performance of the SR was roughly comparable 
with the NNs (within 2.7%), and the overall perfor- 
mance of the combined models was 3—5% better than 
that of the NNs alone. This experiment verified the im- 
portance of diversity during model creation. A more 
complete treatment of Lazy Meta Learning, including 
results of these experiments can be found in [41.82]. 


41.6 Summary and Future Research Challenges 


We illustrated the use of CI techniques in ML ap- 
plications. We explained how to leverage CI to build 
meta models (for offline design and online con- 
trol/aggregation) and object-level models (for solv- 
ing the problem at hand.) We described the most 
typical ML functions: unsupervised learning (cluster- 
ing), supervised learning (classification and regres- 
sions), and optimization. To structure the cases studies 
described in this review, we presented two similar 
paradigms: PHM for industrial assets, and risk man- 
agement for financial and commercial assets. We ana- 
lyzed five case studies to show the use of CI models 
in: 


(1) Unsupervised learning for anomaly detection 
(based on neural networks, fuzzy systems, and 
EAs). 


(2) Supervised learning (classification) for assessing 
and pricing risk in insurance products (based on 
fuzzy systems and EAs). 

(3) Supervised learning (regression) for valuating mort- 
gage collaterals (based on radial basis functions, 
fuzzy systems, neural fuzzy systems, and fusion). 

(4) Supervised learning (regression-induced ranking) 
for selecting the best units in a fleet (based on fuzzy 
systems and EAs). 

(5) Multiobjective optimization for rebalancing a port- 
folio of investments (based on multiobjective EAs). 


In the last section we covered model ensembles 
and fusion, and emphasize the need for injecting di- 
versity during the model creation stage. We proposed 
a model—agnostics fusion mechanism that could be used 
to fuse commoditized models (such as the ones obtained 
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by crowdsourcing). We will conclude this review with 
a prospective view of research challenges for ML. 


solutions have been proposed, ranging from SW 
frameworks [41.87] to active Flash [41.88]. 


b. Technology stack for Big Data (volume). 
41.6.1 Future Research Challenges We need to address scalability issues along 
many dimensions, such as data size, number 
All applications described in the five case studies were of computational nodes over which to dis- 
developed before the advent of cloud computing and tribute the algorithms, number of models to 
big data. Since then, we have encountered situations in be trained/deployed, etc. Among the research 
which we need to analyze very large data sets, enabled groups addressing this issue, we found the UC 
by the Internet of things (IoT) [41.84], machine-to- Berkeley AMP Lab [41.89] to be among the 
machine (M2M) connectivity, and social media. In this leaders in this area. The AMP researchers have 
new environment, we need to scale up the CI/ML ca- developed the Berkeley data analytics stack 
pabilities and address the underlying three v’s in big (BDAS) [41.90], a technology stack composed 
data: volume, velocity, and variability. Large data vol- of Shark (to run structured query language 
umes pose new challenges to data storage, organization, (SQL) and complex analytics on large clus- 
and query; data feed velocity requires novel streaming ters) [41.91, 92], Spark (to reuse working set of 
capabilities; data variability requires the collection and data across multiple parallel operations, typical 
analysis of structured, unstructured, and semistructural of ML algorithms) [41.93], and Mesos (to share 
(e.g., locational) data and the ability to learn across mul- commodity clusters between multiple diverse 
tiple modalities. The use of cloud computing will also cluster computing frameworks, such as Hadoop 
result in the commoditization of analytics. In this con- and message passing interface (MPD) [41.94]. 
text, ML applications will also include the delivery of c. Parallelization of ML Algorithms (volume). We 
Analytics-as-a-Service (AaaS) [41.85]. need to design ML algorithms so that their com- 
We will conclude this section with a view of five putation can be distributed. Some algorithms are 
research challenges entailed by this new environment, easy to parallelize, like population-based EAs 
as illustrated in Fig. 41.16: (using an island [41.95,96] or a diffusion grid 
models to distribute the subpopulations to many 
(1) Data-driven Model Automation and Scalability computational nodes), or Random Forest (grow- 
a. Computation at the edge (velocity). When the ing subsets of trees on different computational 
cost of moving large data set becomes signifi- nodes) [41.97]. Other algorithms will need to be 
cant, we need to perform analysis while the data redesigned for parallelization. 
are still in memory, via in-situ analytics and in- d. Multimodal learning (variability). As the size of 
transit data transformation [41.86]. A variety of the information grows, its content will become 
Quality/ 
computation 
Yeli (1) Data-driven (3) Decision tateari 
model automation/ making/ Anytime ML 
Volume scalability uncertainty algorithms 


Evidence/ 
uncertainty 
representation 
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Structured and 
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Fig. 41.16 Research challenges for 
CI/ML analytics 


9'li | d Hed 


814 PartD 


Neural Networks 


9°lh | d Hed 


more heterogeneous. For instance, by navigat- 
ing through web pages, we encounter informa- 
tion represented as text, images, audio, tables, 
video, applets, etc. There are preliminary efforts 
for representing and learning across multiple 
modalities using graphical approaches [41.98], 
and kernels [41.99, 100]. However, a compre- 
hensive approach to this issue remains an open 
problem. 


(2) ML/Human Interactions 


a. 


Upgrade the human role. We need to remove the 
human modeler from the most time-consuming, 
iterative tasks, by automating data scrubbing 
(outliers removal, de-noising, imputation, or 
elimination of missing data) and data prepara- 
tion (multiple sources integration, feature selec- 
tion, feature generation). As noted in [41.86], 
while: 


All high performance computing (HPC) com- 
ponents — power, memory, storage, bandwidth, 
concurrence, etc. — will improve performance by 
a factor of 3 to 4444 by 2018... human cogni- 
tive capability will certainly remain constant. 


We are obviously the bottleneck in any kind of 
automation process and we need to upgrade our 
role, by interacting with the process at higher 
levels of model design. For example, in active 
learning we maximize the information value of 
each additional question to be answered by a hu- 
man expert [41.101, 102]. In interactive multi- 
criteria decision making we can use progressive 
preference articulation. This allows the expert 
to guide the automated search in the design 
space, by interactively simplifying the problem, 
e.g., by transforming an objective into a con- 
straint once the values of most solutions fall 
within certain ranges for that objective [41.103]. 
Non-expert ML users. For routine modeling 
task, we need to enable nonexpert users to define 
analytics in a declarative rather than procedural 
fashion. In [41.104] we can see an example of 
this concept, based on the analogy between the 
MLbase language for ML and traditional SQL 
languages for DBs. 

Integration of crowdsourcing with analyt- 
ics engines. Crowdsourcing is an emerging 
trend that is increasing human capacity in 
a manner similar to the way cloud com- 
puting is increasing computational capacity. 


In this analogy, we could think of Ama- 
zon Mechanical Turk [41.105] as the dual of 
Amazon Elastic Cloud Computing [41.106]. 

Originated from the concept of Wisdom of 

Crowds [41.107], crowdsourcing has shown 

a tremendous growth [41.108]. According to 

Malone et al. [41.109], the crowd’s contribution 

to the solution of a problem/task can be done via 

collection, competition, or collaboration. 

i. Collection is used when the task can be 
decomposed into independent micro-tasks 
that are then executed by a large crowd to 
generate, edit, or augment information. An 
example of this case is the labeling of videos 
or images to create a training set for super- 
vised learning [41.110]. The annotation of 
galaxy morphologies or lunar craters done 
at Zooniverse [41.111] is another example 
of this case. 

ii. Competition is used when a single individ- 
ual in the crowd can provide the complete 
solution to the task. An example of this 
case is the creation of data-driven models 
via contest hosted by sites such as Kag- 
gle [41.80] or CrowdANALYTICS [41.81]. 

iii. Collaboration is used when single individu- 
als in the crowd cannot provide a complete 
solution and the task cannot be decomposed 
into independent subtasks. This situation re- 
quires individuals to collaborate toward the 
generation of the solution. Examples of this 
case are usually large projects, such as the 
development of Linux or Wikipedia. 

Additional crowdsourcing systems are surveyed 

in [41.112]. We are interested in the intersec- 

tion of crowdsourcing with machine learning 
and big data, which is also the focus of UCB 

AMP Lab [41.89]. Among the research trends in 

this area, we find the impact of crowdsourcing in 

DBs queries, such as changing the closed-world 

assumption of DB queries [41.113], monitoring 

queries progress when crowdsourced inputs are 
expected [41.114], and using the crowd to inter- 
pret queries [41.115]. Furthermore, we can also 
apply ML techniques to improve the quality of 
the crowd-generated outputs, reducing variabil- 
ity in annotation tasks [41.116], and performing 
bias removal from the outputs [41.117]. In the 
future, we expect many more opportunities for 

CI/ML to leverage crowdsourcing and enhance 

the quality of its outputs. 
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d. New generations of Data Scientists. There is 


a severe skill gap, which we will need to over- 
come if we want to accelerate the applicabil- 
ity of ML to a broader set of problems. We 
need to train a new generation of data scien- 
tists, with skills in data flow (collection, storage, 
access, and mobility), data curation (preser- 
vation, publication, security, description, and 
cleanings), and basic analytics skills (applied 
statistics, MI/CI). Several universities are cre- 
ating a customized curriculum to address this 
need. The National Consortium for Data Sci- 
ence is an illustrative example of this emerging 
trend [41.118]. 

Extreme scale visual analytics. We can cope 
with increases in data volume/velocity, and 
analysis complexity because we have benefited 
from similar increases in computing capac- 
ity. Unfortunately, as aptly noted in [41.86], 
... there is no Moore’s law for human cog- 
nitive abilities. So, we face many challenges, 
described in the same reference, when we want 
to present/visualize the results of the complex 
analytics to the user (as in the case of multilevel 
hierarchies, time-dependent data, etc.) There are 
situations in which the user can select data with 
certain characteristics to be used in the analysis 
and steer data summarization and triage per- 
formed by the ML algorithms [41.86]. This will 
also overlap with the previous category of up- 
grading the human role. 


(3) Decision Making and Uncertainty 


a. 


Quality/Computation tradeoff. When faced with 
massively large data sets, we need to distribute 
data and models over multiple nodes. Often we 
also need to subsample the data sets while train- 
ing the models. This might introduce biases and 
increase variances in the models results. The use 
of the Bag of Little Bootstrap [41.72] allows 
us to extend bootstrap (Sect. 41.5.3) to large 
data sets. However, not all queries or functions 
will work with bootstrap. BlinkDB is a useful 
tool to address this problem, as it allows us to 
understand if additional resources will actually 
improve the quality of the answer. BlinkDB is 
a [41.119]: 


... massively parallel, sampling-based approxi- 
mate query engine for running ad-hoc, interac- 
tive SQL queries on large volumes of data. It 
allows to trade-off query accuracy for response 


time, enabling interactive queries over massive 
data by running queries on data samples and 
presenting results annotated with meaningful 
error bars... . 


Anytime ML Algorithms. Anytime algorithms 
are especially needed for online ML applica- 
tions in which models need to produce results 
within a given real-time constraints [41.120]. 
In [41.121] the authors propose a method for de- 
termining when to terminate the learning phase 
of an algorithm (for a sequence of iid tasks) to 
optimize the expected average reward per unit 
time. In simpler situations, we want to be able 
to interrupt the algorithm and use its most re- 
cent cached answer. This idea is related to the 
quality/computation tradeoff of point (3.a), as 
we expect the quality of the answer to increase if 
we allow more computational resources. For ex- 
ample, in EAs, under convergence assumptions, 
we expect further generations to have a better 
fitness than the current one. At any time we can 
stop the EA and fetch the answer for the current 
population. 

Evidence and Uncertainty Representation. This 
is one of the extreme-scale visual analytics 
challenges covered in [41.86]. In two previous 
points, we noted that as the data size increases 
we would need to perform data subsampling 
to meet real-time constraints. This subsampling 
will introduce even greater uncertainty in the 
process. We need better ways to quantify and 
visualize such uncertainty and provide the end- 
user with intuitive views of the information and 
its underlying risk. 


(4) Model Ensemble/Fusion 
a. Integration of structured and unstructured data. 


The simplest case is the integration of time de- 
pendent text (e.g., news, reports, logs) with time 
series data, e.g., text as a sensor. This topic 
is related to the multimodal learning discus- 
sion illustrated in (1.d). Alternatively, instead of 
learning across multiple data formats, we can 
use an ensemble of modality-specific learners 
and fuse their outputs. An example of this ap- 
proach can be found in [41.122] 

Integration of physics-based and data-driven 
models. There are at least two ways to perform 
such integration: using the models in parallel 
or serially. The simplest integration is based 
on a parallel architecture, in which both mod- 
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els are used simultaneously, yet separately. This 
case covers the use of data-driven models ap- 
plied to the residuals between expected values 
(generated by physics-based models) and ac- 
tual values (measured by sensors). This was 
illustrated in the first case study (Sect. 41.3.1). 
Another example of parallel integration is the 
use of an ensemble of physics-based and data- 
driven models, followed by an agnostic fusion 
mechanism (Sect. 41.5.4). A different type is the 
serial integration, in which one model is used 
to initialize the other one. An example of such 
integration is the use of data-driven models to 
generate estimates of parameters and initial con- 
ditions of physics based models. For instance, 
we could use data-driven RUL predictive mod- 
els to estimate the current degree of degradation 
of key components in an electro-mechanical 
system. Then, we could run a physics-based 
model of the system, using these estimates as 
initial conditions, to determine the impact of 
future load scenarios to the RUL predictions. 
Another example of serial integration is the use 
of physics-based models to generate (offline) 
a large data set, usually following a Design 
of Experiment methodology. This data set be- 
comes the training set for a data-driven model 
that will functionally approximate the physics- 
based model. Typically, a second data-driven 
model, frequently retrained with real-time data 
feeds, will be used to correct the outputs of the 
static approximation [41.123]. 

Model-agnostics fusion. In Sect. 41.5.4 we cov- 
ered the concept of model-agnostic fusion to 
be deployed when predictive models are cre- 
ated by a variety of sources (such as crowd- 
sourcing via competition or cloud-based genetic 
programming). We showed that this type of 
fusion is a meta model that leverages each pre- 
dictive model’s meta-data, defining its region 
of applicability and relative level of perfor- 
mance [41.82]. Additional research is needed to 
prevent over-fitting of the meta models and to 
extend this concept to classification problems. 


. Model diversity by design. By leveraging the 


almost infinite computational capacity of the 
cloud, we should be able to construct model en- 
sembles that are diverse by design. There are 
many techniques used to inject diversity in the 
models design, as described in Sect. 41.5.3. 
One of the most promising techniques con- 


sists in evolving a large population of sym- 
bolic regression models by distributing genetic 
programming algorithms using an island ap- 
proach [41.95, 96]. Random feature subsets can 
be assigned to each island, which also differ 
from the other islands through the use of distinct 
grammars, fitness functions, and functions sets. 
Additional information on this approach can be 
found in [41.79, 124, 125]. 


(5) Special Topics in Machine Learning 
a. Deep Learning. Originally proposed by 


Fukushima [41.126], deep learning (DL) gained 
acceptance when Hinton’s showed that DL 
training was decomposable [41.127, 128]. Hin- 
ton showed that each of the layers in the neural 
network could be pretrained one at a time, as an 
unsupervised Restricted Boltzmann Machine, 
and then fine-tuned using supervised backprop- 
agation. This discovery allows us to use large 
(mostly unlabeled) data sets available from Big 
Data applications to train DL networks. 
Learning with Graphs. There are many appli- 
cations of Recommender Systems in social net- 
works and targeted advertising. Typically these 
systems select information using collaborative 
filtering (CF), a technique based on collabo- 
ration among multiple agents, viewpoints, and 
data sources. Researchers have proposed vari- 
ous solutions to overcome some of the intrinsic 
challenges caused by data sparsity and network 
scalability. Among the most notable approaches 
we have: 

(1) Pregel [41.129]: a synchronous message 
passing abstraction in which all vertex- 
programs run simultaneously in a sequence 
of super steps. 

(2) GraphLab [41.130]: an asynchronous dis- 
tributed, shared-memory abstraction, de- 
signed to leverage attributes typical of ML 
algorithms, such as sparse data with local 
dependences, iterative algorithms, and po- 
tentially asynchronous execution. 

(3) PowerGraph [41.131] and its Spark imple- 
mentation GraphX [41.132]: an abstraction 
combining the best features of Pregel and 
GraphLab, better suited for natural graphs 
applications with large neighborhoods and 
high degree vertices. 


The above examples are just a sample of specialized 
ML algorithms and architectures for niche opportu- 
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nities, exploiting the characteristic of their respective 
problems. 

In conclusion, we need to shape CI research to ad- 
dress these new challenges. To remain a vital discipline 
during the next decade, we need to leverage the al- 
most infinite capacity provided by cloud computing and 
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While much has been discovered about the proper- 
ties of simple evolutionary algorithms (EAs), based 
on evolving a single individual, there is very little 
work touching on the properties of population- 
based systems. We highlight some of the work that 
does exist in this chapter. 
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Genetic algorithms (GA) are a particular class of evolu- 
tionary computation method often used for optimiza- 
tion problems. They were originally introduced by 
Holland [42.1] at around the same time when other 
evolutionary methods were being developed, and pop- 
ularized by Goldberg’s much-cited book [42.2]. They 
are characterized by the maintenance of a population of 
search points, rather than a single point, and the evolu- 
tion of the system involves comparisons and interaction 
between the points in the population. They are usually 
used for combinatorial optimization problems, that is 
where the search space is a finite set (typically with 
some structure). The most common examples use fixed- 
length binary strings to represent possible solutions, 
though this is by no means always the case — the search 
space representation should be chosen to suit the par- 
ticular problem class of interest. Members of the search 
space are then evaluated via a fitness function, which 
determines how well they solve the particular problem 
instance. This is, of course, by analogy with natural 
selection in which the fittest survives and evolves to 
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produce better and better solutions. Notice that the ef- 
ficiency of a genetic algorithm is therefore measured in 
terms of the number of evaluations of the fitness func- 
tion required to solve the problem (rather than a more 
direct measure of the computational complexity). For 
any well-defined problem class, the maximum num- 
ber of function evaluations required by an algorithm to 
solve a problem instance is called the black box com- 
plexity of the algorithm for that problem class. The 
black box complexity of a problem class is defined to be 
the maximum number of function evaluations required 
by the best possible black box algorithm. This is a re- 
search topic in its own right [42.3]. 

As an example, consider the subset sum problem, 
in which a set of n integers is given, along with a tar- 
get integer T, and we have to find a subset whose sum 
is as close to T as possible. We can represent subsets 
as binary strings of length n, in which a 1 indicates 
that an element is in the subset and a O that it is ex- 
cluded. The binary string forms the analog of the DNA 
(deoxyribonucleic acid) of the corresponding individual 


vV 
o 

= 

pes 
m 
f> 
N 


826 PartE 


Evolutionary Computation 


Lzh |3 Hed 


solution. Its fitness would then be given by the corre- 
sponding sum — which we are trying to minimize. 

For a different example, consider the traveling 
salesman problem, in which there are a number of 
cities to be visited. We have to plan a route to visit 
each city once and return home, while minimizing the 
distance traveled. A potential solution here could be ex- 
pressed as a permutation of the list of cities (acting as 
the DNA), with the fitness given by the corresponding 
distance. 

A genetic algorithm maintains a population of such 
solutions and their corresponding fitnesses. By focusing 
on the better members of the population and introduc- 


42.1 Algorithmic Framework 


The genetic algorithm works by updating the popula- 
tion in discrete iterations, called generations. We begin 
with an initial, randomly generated, population. This 
acts as a set of parents from which a number of off- 
spring are produced, from which the next generation is 
created. There are two basic schemes for doing this: the 
generational method and the steady-state method. 

The generational approach is to repeatedly produce 
offspring from the parent population, until there are 
enough to fill up a whole new population. One gener- 
ation, in this case, corresponds to the creation of all of 
these offspring. The steady-state approach, by contrast, 
produces a single offspring from the current parents, 
and then inserting it into the population, replacing some 
individual. Here, a generation consists of creating one 
new individual solution. 

In either case, the population size stays fixed at 
its initial size, which is a parameter of the algorithm. 
There exists some theoretical work investigating a good 
choice of population size in different situations, but 
there are few general principles [42.10]. The correct 
size will depend on the problem to be optimized, and 
the particular details of the rest of the algorithm. It 
should be noted, however, that in a number of cases it 
can be shown that smaller population sizes are prefer- 
able and, indeed, in some cases a population of size one 
is sufficient. 

The overall structure of the generational genetic al- 
gorithm is as follows: 


1. Initialize population of size jz randomly with points 
from the search space. 

2. Repeat until stopping criterion is satisfied: 
a) Repeat u times: 


ing small variations (or mutations), we hope that the 
population will evolve good, or even optimal, solutions 
in a reasonable time. 

A popular general introduction to the field can be 
found in Mitchell’s book [42.4]. Much of what is known 
about the theory of genetic algorithms was developed 
by Vose and colleagues [42.5], which centers around 
a description of the changing population as a dynamical 
system. A gentle overview of this theory can be found 
in [42.6]. More recently, there has been a stronger em- 
phasis on understanding the algorithmic aspects, with 
a particular focus on run-time analysis for optimization 
problems [42.7-9]. 


i. Choose a point from the population. 


ii. Modify the point with mutation and 
crossover. 

iii. Place resulting offspring in the new popula- 
tion. 


3. Stop. 


The critical points are therefore the selection 
method, used to choose points from the current popu- 
lation, and the mutation and crossover methods used to 
modify the chosen points. The idea is that selection will 
favor better solutions (in the sense that they provide bet- 
ter solutions to the optimization problem at hand), that 
mutation will introduce slight variations in the current 
chosen solutions, and that crossover will combine to- 
gether parts of different good solutions to, hopefully, 
form a better combination. We will look at different 
schemes for selection, mutation, and crossover in detail 
later, in Sects. 42.2, 42.4, and 42.6. 

The overall structure of the steady-state genetic al- 
gorithm is as follows: 


1. Initialize population of size u randomly with points 
from the search space. 

2. Repeat until stopping criterion is satisfied: 
a) Choose a point from the population. 
b) Modify the point with mutation and crossover. 
c) Choose an existing member of the population. 
d) Replace that member with the new offspring. 

3. Stop. 


It can be seen that, in addition to selection, mutation 
and crossover, the steady-state genetic algorithm also 
requires us to specify a means for choosing an individ- 
ual to be replaced. Suitable replacement strategies will 


Genetic Algorithms | 42.1 Algorithmic Framework 


be discussed later in Sect. 42.3, but for now we make 
a few general observations. 

It is clear that in the generational genetic algo- 
rithm, progress is driven by the selection method. In 
this step of the algorithm, we choose to keep those so- 
lutions which we prefer, by dint of the degree to which 
they optimize the problem we are trying to solve. For 
the steady-state genetic algorithm, progress can also 
be maintained by the replacement strategy, if this is 
designed so as to affect the replacement of poorly per- 
forming individuals. In fact, it is possible to put the 
whole burden of evolution on the replacement step, for 
example, by always replacing the worst member of the 
population, and allowing the selection step to choose 
any individual uniformly at random. Conversely, one 
may use a stronger selection method and replace indi- 
viduals randomly (or, of course, some combination of 
the two approaches). The steady-state algorithm there- 
fore allows the user finer tuning of the strength of 
selective pressure. 

In addition, the steady-state approach allows the 
user to guarantee that good individuals are never lost, 
by choosing a replacement strategy that protects such 
individuals. For example, replacing the worst individ- 
ual each time ensures that copies of the best individual 
always remain. Any EA that has the property of pre- 
serving the best individual is called elitist. This would 
seem to be a desirable property as otherwise progress 
toward the optimum can be lost due to mutation and 
selection (Sect. 42.5). The generational framework, as 
it stands, offers no such guarantee. Indeed, depending 
on the method chose, the best individual may not even 
be selected, let alone preserved! It is quite a common 
strategy, therefore, to add elitism to the generational 
framework, for example, by making sure at least one 
copy of the best individual is copied across to the next 
generation each time. 

The generational approach, without elitism explic- 
itly added, is referred to as a comma strategy. In partic- 
ular, it is a (u, 4) EA. This means that the population 
has size u, and a further u offspring are created, which 
becomes the next generation population. More gener- 
ally, we could have (jz, A) algorithms, where A > u, in 
which A offspring are created and the best jz are taken 
to be the next generation. If À is sufficiently large with 
respect to u, then the probability of not selecting the 
best individual gets rather small. If, in addition, there is 
a reasonable chance that mutation and crossover do not 
make any changes, we get an approximation of elitism. 

The steady-state algorithm, in which one replaces 
the worst individual, is referred to as a plus strategy. 


In particular, it is a (u + 1) EA. This means that one 
offspring is created from a population of size jz, and 
then the best jz are kept from the pool of parents plus 
offspring. More generally, we could have (w+ A) al- 
gorithms, in which A offspring are created and the best 
u from the collection of parents and offspring are kept 
to be the next generation. Plus strategies are, of course, 
elitist. 

Another advantage of the steady-state genetic algo- 
rithm is that progress can be immediately exploited. As 
soon as an improving individual is found, it is inserted 
into the population, and may be selected for further evo- 
lution. The generational algorithm, in contrast, has to 
produce a full set of offspring before any good discov- 
eries can be built on. This overhead can be minimized 
by taking the population to be as small as possible, al- 
though one then risks losing good individuals in the 
selection process, unless some form of elitism is explic- 
itly implemented. 

A further extension often made to either scheme is 
to implement a mechanism for maintaining a level of 
diversity in the population. Clearly, a potential advan- 
tage of having a population is that it can cover a broad 
area of the search space and allow for more effec- 
tive searching than if a single individual were used. 
Any such advantage would be lost, however, if the 
members of the population end up identical or very 
similar. In particular, the effectiveness of crossover is 
reduced, or even eliminated, if the population mem- 
bers are too similar. Some of the methods employed 
for maintaining diversity will be discussed in detail in 
Sect. 42.7. 

Before we move on to details, it is worth asking 
whether or not maintaining a population is an effec- 
tive approach to optimization. Indeed, such a question 
should be asked seriously whenever one attempts to use 
EAs for such problems. For certain classes of prob- 
lem, it may well be the case that a simple local search 
strategy (that is, a (1 + 1) evolutionary algorithm (EA)) 
may be as effective or better. For example, for some 
problems (such as OneMax, which simply totals the 
bit values in a string), the (1 + 1)-EA is provably op- 
timal amongst evolutionary algorithms that only use 
standard bitwise mutation [42.11]. There is a small, 
but limited, amount of theoretical work on this issue 
to guide us [42.10]. In the first place, if the search 
space comprises islands of good solutions with small 
gaps containing poor solutions, then a population might 
provide an effective way to jump across the gaps. This 
is because the poorer offspring are not so readily re- 
jected (especially with a generational approach), and 
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may persist long enough for a lucky mutation to move 
into a neighboring good region. One might therefore 
also expect a population-based approach to be effective 
generally on highly multimodal problems but, unfor- 
tunately, there is very little analysis on this situation. 
If crossover is thought to be effective for your prob- 
lem (Sect. 42.6), then a population is required, as we 
need to choose pairs of parents — although again, it 
might be the case that very small populations are suf- 
ficient. It certainly seems to be the case that for any 
such advantage, it is necessary to maintain a reason- 
able level of diversity, or the point of the population is 
lost [42.12]. 


42.2 Selection Methods 


The selection method is the primary means a genetic 
algorithm has of directing the search process toward 
better solutions. It is usually defined in terms of a fitness 
function which assigns a positive score to each point 
in the search space, with the optimal solution having 
the maximum (or minimum) fitness. Often the fitness 
function is, in fact, just the objective function of the 
problem to be optimized, however, there are times when 
this can be modified. This typically happens when the 
problem involves some constraints, and so account has 
to be taken as to what extent the constraints are satis- 
fied. There are a number of different ways to approach 
this situation: 


1. Simply discard any illegal solution and try again. 
That is, if mutating (say) a solution produces an ille- 
gal solution, discard it and try mutating the original 
again, until a legal solution is obtained. 

2. Repair the solution. This involves creating a special 
purpose heuristic which, given any illegal solution, 
modifies it until it becomes legal. 

3. More generally, one can construct modified oper- 
ators which are guaranteed to produce legal solu- 
tions. The above two methods are specific ways one 
might achieve this. 

4. Adapt the fitness function by adding penalty terms. 


It is this fourth approach which concerns us here, as 
it allows illegal solutions to be tolerated, but puts the 
onus on the selection method to drive the population 
away from illegal and toward legal solutions. The idea 
is to create a fitness function which is a weighted sum 
of the original objective function, and a measure of the 
extend to which constraints have been broken. That is, 


One situation in particular, where a genetic algo- 
rithm may be helpful, is if the potential solutions to 
a problem are represented by the population as a whole. 
That is, one tries to find an optimal set of things, and 
we can use the population to represent that set. This is 
the case, for example, in multiobjective optimization, 
where one tries to determine the Pareto set of dominant 
solutions [42.13]. For single objective problems, it may 
be possible to represent solutions as a set of objects, 
each of which can be evaluated according to its con- 
tribution to the overall solution. There has been some 
recent progress on problems of this type [42.14] and 
will consider this case in Sect. 42.9. 


if h: X — R is the objective function, we might have 
a fitness function 


k 
FŒ = woh(x) — wA), 


j=! 


where c; is a measure as to how far constraint j has been 
broken. A difficulty with this approach is in specify- 
ing the weights, since one would like to allow illegal 
solutions to be tolerated (at least in the early stages 
of the search), but do not want good, but illegal solu- 
tions to wipe out any legal ones. It is common, then, 
to try to fix the weights at least so that legal solutions 
are preferred to illegal ones (however good). This, then, 
suggests a fourth approach which works particularly 
well with tournament selection (see below), in which 
it is only necessary to say which of two solutions is to 
be preferred. We stipulate that legal solutions are to be 
preferred to illegal ones that two legal solutions should 
be compared with the objective function and that two 
illegal solutions should be compared by the extent of 
constraint violation. 

The degree to which poor, or illegal, solutions are 
tolerated by a genetic algorithm is determined by the 
strength of the selection method chosen. A weak se- 
lection method will allow poor (that is, low fitness) 
solutions to be selected with high probability com- 
pared to a strong scheme, which will typically select 
better solutions. It is usual to insist that a selection 
scheme should have the property that a better solution 
should have a higher probability of being selected than 
a weaker one. A number of selection methods have been 
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proposed, ranging from very weak to very strong, and 
we will consider them in this order. 


42.2.1 Random Selection 


The weakest selection method is simply to pick a mem- 
ber of the population uniformly at random. Of course, 
this has no selection strength at all and will not, by 
itself, guide the search process. It must therefore be 
combined with some other mechanism to achieve this. 
Typically, this would be used in a steady-state genetic 
algorithm, in which the replacement strategy imposes 
the selection pressure (Sect. 42.3). Another possibility 
is in a parallel genetic algorithm where offspring may 
replace parents if they are better, but the selection of 
partners for crossover is random (Sect. 42.8). 


42.2.2 Proportional Selection 


The fitness proportional selection method comes from 
taking the analogy of the role of fitness in natural evolu- 
tion seriously. In biological evolution, fitness is literally 
a measure of how many offspring an individual expects 
to have. Within the fixed-sized population of a ge- 
netic algorithm, then, we model this by saying that the 
probability of an individual being selected should be 
proportional to its fitness within the population. That 
is, the probability of selecting item x is given by 


fœ 
ESO) ’ 
where the sum ranges over all members of the popula- 


tion. This selection method is often implemented by the 
roulette wheel algorithm as follows: 


1. Let T be the total fitness of the population. 
2. Let R be a random number in the range 0 < R <T. 
3. Letc=0. Let i = 0. 
4. While c < R do 
a) Letc=c+f(i) 
b) Leti=i+1. 
5. Return i. 


where f (i) is the fitness of the item with index i in 
the population. 


42.2.3 Stochastic Universal Sampling 
In a generational genetic algorithm, one needs to select 


u individuals from the population in order to com- 
plete one generation. Using proportional selection to 


do this therefore requires O(u?) time, which can be 
a significant burden on the running time. An alterna- 
tive selection algorithm, which still ensures that the 
expected number of times an individual is selected is 
proportional to its fitness, is the stochastic universal 
sampling algorithm [42.15]. If T is again the total fit- 
ness of the population, then let 


_f@ 
=- H 


H= 


which is the expected number of copies of item i. 
The selection algorithm guarantees that either |E[i]] or 
[E[i]| copes of i are selected, for each item i in the pop- 
ulation: 


1. Let r be a random number in the range 0 < r < 1. 
2. Letc=0. 
3. Fori=0touw—1do 

a) Let c = c + Efi]. 

b) While r < c do: r = r + 1; Select(i). 


By the time the algorithm terminates, jz items will 
have been selected, in O(j) time (for a good introduc- 
tion to asymptotic notation [42.16]). 


42.2.4 Scaling Methods 


In the early stages of a run of a genetic algorithm, there 
is usually considerable diversity in the population, and 
the fitness of the best individuals may be considerably 
greater than the others. When using fitness proportional 
selection, this can lead to strong selection of the better 
individuals. Later on when the algorithm is nearing the 
optimum, the population is less diverse, and fitnesses 
may be more or less constant. In this situation, propor- 
tional selection is very weak, and does not discriminate. 

One idea to combat this problem is to scale the 
fitness function somehow, so as to adjust the selec- 
tion strength during the run. Two proposals along these 
lines are sigma scaling and Boltzmann selection. Sigma 
scaling (invented by Stephanie Forrest and described 
in [42.2] and [42.4]) explicitly takes the diversity of the 
population into account via the standard deviation o of 
the fitness in the population. Given the fitness function 
f:X — R, the new scaled fitness is 

ny = 14 OE 

20 

where f is the average fitness of the population. A neg- 
ative value of h(x) might be clamped at zero or some 
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small value. The idea behind sigma scaling is that now 
when there is a lot of diversity, o is large, and so the 
best individuals will not dominate the selection process 
so much. However, when diversity is low, so is o and the 
scaled fitness function can still discriminate effectively. 

The second proposal, Boltzmann selection, makes 
use of the idea that the diversity (and therefore strength 
of selection) is lost over time [42.17]. We therefore seek 
to scale the fitness using the time (or generation num- 
ber) as a parameter. This is usually done in the same 
way as simulated annealing, by controlling a tempera- 
ture parameter T, which is initially large, but decreases 
over time. We have a scaled fitness of 


h(x) = exp (2) : 


Of course, a difficulty with this approach is to select 


the appropriate cooling schedule by which T should de- 
crease over time. 


42.2.5 Rank Selection 


We have seen that one of the major drawbacks of the 
proportional selection method, and stochastic universal 
sampling, is that the probability of choosing individu- 
als is very sensitive to the relative scale of the fitness 
function. For example, an item with fitness 2 is twice as 
likely to be chosen as an item of fitness 1. But an item 
of fitness 101 is almost as likely to be chosen as one 
with fitness 100. It is therefore suggested that selection 
should depend only on the relative strength of the indi- 
vidual within the population. One way to achieve this 
is to choose an individual depending on its rank. That 
is, we sort the population using the fitness function, but 
then ascribe a rank to each member, with the best indi- 
vidual getting score u down to 1, for the worst [42.18]. 
The simplest thing to do then is to choose individ- 
uals proportional to rank, using the roulette wheel or 
stochastic universal sampling algorithms. However, this 
is then sensitive to the population size. So a common al- 
ternative is to linearly scale the rank to achieve a score 
between two numbers a and b. We thus get a function 


h(i) = (b—a)r(i) + pa—b 

p-l 
where r(i) is the rank of item i in the population. We 
then seek to select items in proportion to their h-value. 
This can be done with the roulette wheel or stochastic 
universal sampling method. Notice that since the sum 


of ranks is known, so is the sum of /-values, and so the 
probability of selecting item i is 
2((b—a)r(i) + wa—b) 
w(w—W(b+a) ` 


42.2.6 Tournament Selection 


A much simpler way to achieve a similar end as rank 
selection is to use tournament selection. We fix a pa- 
rameter k < u. To select an item from the population 
we simply do the following: 


1. Choose k items from the population, uniformly at 
random. 
2. Return the best item from those chosen; 


where best refers, of course, to assessment by the fitness 
function. Perhaps the most common version is binary 
tournament selection, where k = 2. In this case, it is not 
strictly necessary to have a fitness function assign a nu- 
merical value to points in the search space. All that is 
required is a means to compare two points and return 
the preferred one. 

It is straightforward to show that the probability of 
choosing item i from the population is 

2r(i)— 1 

-e 
where r(i) is the rank from 1 (the worst) to u (the best). 
If one chooses the two items to be compared without 
replacement, then the probability of choosing item i be- 
comes 


2r(i) —2 
(m1) ° 


which is equivalent to rank selection, linearly scaled 
witha = O and b = 1. 

At the other extreme, if one were to pick the tourna- 
ment size to be very large (close to jz) then it becomes 
more and more likely that only the best individuals will 
be selected. Increasing the tournament size in this way 
is a good method for controlling the selection strength. 


42.2.7 Truncation Selection 


The strongest selection method of all would be to 
only select the best individual. Slightly more forgiving 
is truncation selection, where only individuals within 
a given fraction of top performers are selected. In a gen- 
erational genetic algorithm, these must be repeatedly 


Genetic Algorithms | 42.3 Replacement Methods 


selected at random. In a steady-state algorithm one 
simply picks of them at random. Truncation selection, 
therefore, introduces a new parameter, which is the frac- 
tion of the population available for selection. This may 


42.3 Replacement Methods 


The steady-state genetic algorithm requires a method 
by which a new offspring solution can be placed into 
the population, replacing one of the existing members. 
As with selection, there are different approaches, which 
have different strengths in terms of the extent to which it 
drives the population to retain better solutions. Indeed, 
most of the methods for replacement are based on those 
already described for selection (Sect. 42.2). 


42.3.1 Random Replacement 


A simple method commonly found in steady-state ge- 
netic algorithms is for the new solution to replace an 
existing member chosen uniformly at random. If this 
is done, then the replacement phase does not push 
the search process in any particular direction and the 
onus for evolving toward better solutions is on the 
selection method chosen. Because of this, it is com- 
monly supposed that a steady-state genetic algorithm 
with random replacement is more or less equivalent to 
a generational genetic algorithm, using the same selec- 
tion method. This is not quite true; however, it can be 
shown theoretically that the long-term behavior of both 
algorithms will be the same [42.19]. What will not nec- 
essarily be the same is the short term, transient behavior 
and, in particular, the speed with which the algorithms 
will arrive at their long-term equilibrium may well be 
different. 


42.3.2 Inverse Selection 


Several replacement methods are based directly on 
selection methods, but changed so that the poorer per- 
forming solutions are more likely to be replaced than 
the better ones. For example, one can construct an in- 
verse fitness proportional replacement method, where 
the probability of being replaced is determined by the 
fitness. In order to ensure that lower fitness means 
a higher chance of replacement, the reciprocal of the 
fitness might be used to determine the probability of re- 
placement. Alternatively, the fitness can be subtracted 
from that of the global optimum (if known). This would 


vary from, say, a half (rather weak) to a tenth (rather 
strong). Usually, this must be done experimentally, as 
there is little theoretical analysis of this form of selec- 
tion for combinatorial problems. 


have the advantage that the optimum, if found, would 
never be replaced. In this case, the probability that item 
i in the population is selected for replacement will be 


f -fÒ 
uf* — DFO) 


where f* is the optimum fitness value. 

Replacement determined by fitness has the same 
drawbacks as selection done in this way. For example, 
toward the end of the search all the population members 
will have similar fitness values. Fitness proportional re- 
placement will then be almost as likely to replace the 
best individual as the worst. Consequently, an alterna- 
tive method using scaling or rank might be preferred. 
The simplest of these ideas is to use a tournament, but 
this time pick the worst of the sample: 


1. Choose k items from the population, uniformly at 
random without replacement. 
2. Return the worst item from those chosen. 


This has the advantage that the best item in the pop- 
ulation cannot be replaced. 


42.3.3 Replace Worst 


Perhaps the most common choice of replacement strat- 
egy is to simply replace the worst member of the 
population with the new offspring. This is a relatively 
strong approach, as it preserves all the better members 
of the population. Indeed, this strategy can well be com- 
bined with the random selection method, and using only 
replacement as the means of driving the evolution to- 
ward better individuals. 

An even stronger variant would be to replace the 
worst member of the population only if the new off- 
spring is better or equal in value. This has the property 
that the minimum fitness of the population can never 
decrease, and so we are guaranteed progress through- 
out the evolution. In some cases, this may lead to much 
faster evolution. However, it may also happen that for 
a long time, no new individuals are added to the pop- 
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ulation. This will be increasingly the case toward the 
end of a run, when it is increasingly hard to create bet- 
ter individuals. Replacing the worst member, regardless 
of how good or bad the new offspring, at least keeps 
adding new search points and creates new possibilities 
for further exploration, while retaining copies of the 
better solutions in the population. 


42.3.4 Replace Parents 


A different idea for choosing which element to replace 
is for the offspring to replace the parent (or parents). 
This would have an advantage in maintaining some 
level of diversity (Sect. 42.7) since the offspring is 
likely to be similar to the parent. The simplest way to 
do this, in the case when mutation but not crossover is 
used is for the offspring to replace the parent if it is bet- 
ter. If the selection process is purely random, then this 
amounts to running a number of local search algorithms 
in parallel, as there is no interaction between the indi- 
viduals in the population. 

A variation on this is when there is crossover, which 
requires the selection of a second parent (Sect. 42.6). 
In this case, it makes sense for the offspring to replace 
the worst of the two parents, guaranteeing that the best 
individual is never replaced. This is the idea behind the 
so-called microbial genetic algorithm [42.20], which is 
a steady-state genetic algorithm, with random selection, 
standard crossover, and mutation (Sects. 42.4 and 42.6), 
with the offspring replacing the worst parent: 


1. Generate random population. 
2. Repeat until stopping criterion satisfied: 


42.4 Mutation Methods 


Whereas selection and replacement focus the genetic 
algorithm on a subset of its population, the mutation 
and crossover operators enable it to sample new points 
in the search space. The idea behind mutation is that, 
having selected a good member of the population, we 
should try to create a variant with the hope that it is 
even better. To do this, it is often possible to make use 
of some natural or well-established local search opera- 
tors for the problem class concerned. For example, for 
the traveling salesman problem, the well-known 2-opt 
operator works by reversing a random segment of the 
selected tour [42.23, 24]. This immediately provides us 
with a way of mutating solutions for this problem class. 


a) Select two individuals from the population uni- 
formly at random. 
b) Perform crossover and mutation. 
c) Let the new offspring replace the worst of the 
two parents. 
3. Stop. 


A further variation possible if crossover is used is 
to create two offspring and for them to replace both 
parents under some suitable conditions. Possibilities 
include: 


© If atleast one offspring is better than both parents. 

© If both offspring are no worse than the worst parent. 

© If one offspring is better than both parents and the 
other is better than one of them. 


This is the idea behind the gene invariant genetic al- 
gorithm [42.21]. It is designed for use on search spaces 
given by fixed length binary strings. We arrange for the 
initial population to be constructed such that for every 
random string generated, we also include its bitwise 
complement. This ensures an equal number of ones 
and zeros at each bit position in the population. When 
crossover takes place between two parents, if we keep 
both possible offspring, then we maintain the number 
of ones and zeros. This arrangement naturally main- 
tains a lot of diversity in the population, without even 
the necessity to include mutation. Early empirical stud- 
ies suggested that this approach would be very good 
at avoiding certain kinds of traps in the search space. 
There seems to have been very little work following up 
this suggestion, however (although [42.22] for the anal- 
ysis of a (1 + 1) version of the algorithm). 


Similarly, when solving the Knapsack problem, it is 
often helpful to exchange items, and this too gives us 
a good idea how to mutate. 

Generally speaking, mutations are defined by 
choosing a representation for the points of the search 
space, and then defining a set of operators to act on that 
representation. For example, many combinatorial opti- 
mization problems concern choosing an optimal subset 
of some set. We can represent the search space, the col- 
lection of all subsets, using binary strings with length 
equal to the size of the set. Each position corresponds to 
a set element, and we use 0 and 1 to distinguish whether 
or not an element is included in a particular subset. We 
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then define a collection of operators with which to act 
upon the representation. For example, the action of re- 
moving or adding an item to the subset is very natural, 
and is given by simply flipping the bit at the appropri- 
ate position. If there are n bits in the representation, then 
this would give us n corresponding operators. In order 
to mutate a bitstring, one chooses an operator at random 
and applies it. 

In order to exchange an item in a subset with one 
that is not, we must simultaneously flip a 1 to a 0 and 
a 0 to a 1. If the subset has k elements, then there are 
k(n—k) ways to do this, giving us another possible set 
of operators. 

Perhaps the most common choice of mutation for 
binary strings is to randomly flip each bit indepen- 
dently with a fixed probability u called the mutation 
rate. This corresponds to flipping a subset of bits of 
size k with probability u«(1—u)"—*. Very often, the mu- 
tation rate is set to u = 1/n. This is popular because, 
while it favors single bit mutations, there is a signifi- 
cant probability of flipping two or more bits, enabling 
exchanges to take place. Notice, however, that there is 
a probability of (1 — 1/n)” ~ 1/e that nothing happens 
at all. While this clearly slows down evolution by an 
almost constant factor, it is not necessarily a bad thing 
to have a significant probability of doing nothing — it 
can sometimes prevent evolution rushing off down the 
wrong path (Sect. 42.5). 

While the mutation rate u = 1/n is the most com- 
monly recommended, the best value to choose de- 
pends on the details of the problem class, and the 
rest of the genetic algorithm being used [42.25]. For 
example, the simple (1+ 1) EA maintains a single 
point of the search space which it repeatedly mu- 
tates — replacing it only if a better offspring occurs. 
For linear functions, the choice of u = 1/n is prov- 
ably optimal for the class of linear functions [42.26]. 
However, it can be shown that for the so-called lead- 
ing ones problem, in which the fitness of a string is 
simply the position of the first zero, the optimal mu- 
tation rate for the (1 + 1) algorithm is in fact close to 
1.59/n [42.27]. 

For a more general approach to defining a mutation 
operator, one could assign a probability to each subset 
of bits, and flip such a subset with the given probability. 
This corresponds to defining a probability distribution 
over the set of binary strings x. To mutate a string x, we 


choose another string y with probability 2 (y) and return 
the result x ® y, where the © symbol represents bitwise 
exclusive-or [42.28]. This general method has the prop- 
erty that it is invariant with respect to the labels 0 and 
1 to represent whether or not an item is in a given sub- 
set. One might also wish mutation to be invariant with 
respect to the ordering of the n elements of the underly- 
ing set, so that the performance of the genetic algorithm 
is not sensitive to how this order is chosen. To do this 
requires a much more restricted mutation operator. One 
must specify a probability distribution over the numbers 
0, 1,...,m and then, having chosen a number according 
to this distribution, flip a random subset of bits of this 
size. Such a mutation is said to be unbiased with re- 
spect to bit labels (that is, the choice of 0 or 1) and bit 
ordering [42.29, 30]. 

Mutating each bit with a fixed rate is an example of 
an unbiased mutation operator, with the probability of 
flipping k bits being 


() u*(1 n u)" . 


The following is an efficient algorithm to choose k ac- 
cording to this binomial distribution [42.31]: 


Letx=y=0. 

Let c = log(1 — u). 

Let r be a random number from 0 to 1. 

Let y= y+ 1+ [log(r)/c]. 

If y <n then let x = x+ 1 and go to 3. Else return x. 


hr art Soe a 


We can use this algorithm to perform mutation by 
first selecting the number of bits to be flipped and 
then choosing which particular subset of that size will 
be mutated. The above algorithm has expected run- 
ning time of O(un). If u is relatively small, one can 
then sample the bit positions from {1,...,n} repeat- 
ing if the same index is selected twice (use a hash 
table to detect the unlikely event of a repeated sam- 
ple). For the choice u=1/n the random selection 
algorithm runs in constant time, and, for large n, the 
probability of having to do a repeat sample tends to 
zero. Consequently, performing mutation in this way 
is extremely efficient. An alternative method is to 
randomly choose the position of the next bit to be 
flipped [42.32]. 
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= 42.5 Selection—Mutation Balance 

o 

= 

= We are now in a position to put together a simple ge- 
= netic algorithm involving selection, replacement, and 
5 mutation. Selection and replacement will focus the 
ul 


search on good solutions, whereas mutation will ex- 
plore the search space by generating new individuals. 
We will find that there is a balance between these two 
forces, but the exact nature of the balance depends on 
the details of the algorithm, as well as the problem 
to be solved. To simplify things a little, we will con- 
sider the well-known toy problem, OneMax, in which 
the fitness of a binary string is the sum of its bits. For 
this problem, there are a number of theoretical results 
concerning the balance between selection and muta- 
tion, for example [42.33, 34]. For a thorough analysis 
of the selection—mutation balance on a LeadingOnes 
type problem, [42.35]. Here, we will illustrate the ef- 
fects with some empirical data. 

The first algorithm we will look at is the steady state 
genetic algorithm, using binary tournament selection, 
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and random replacement. As described above, the long- 
term behavior will be the same as for a generational ge- 
netic algorithm, with the same selection and mutation 
methods. Specifically, let us consider a population of 
size 10, fix the string length to 100 bit, and consider 
the effect of varying the mutation rate. In Fig. 42.1, we 
show four typical runs at mutation rates 0.03, 0.02, 0.01 
and 0.001 respectively. We plot the fitness of the best 
member of the population at each generation. 

Recall that this algorithm is not elitist. This means 
that it is possible to lose the best individual, by re- 
placing it with a mutant of the selected individual. 
It is chosen for replacement with probability of 0.1. 
Whether it is replaced by something better or worse de- 
pends on the mutation rate. A higher mutation rate will 
tend to be more destructive. We see from the plotted tra- 
jectories, that for higher mutation rates, the algorithm 
converges to a steady state more quickly (around gen- 
eration 200 in the case of mutation rate of 0.03) but to 
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Fig. 42.1a-d Trajectories of best of population for steady-state GA with random replacement on OneMax problem (100 
bit). (a) Mutation rate = 0.03; (b) mutation rate = 0.02, (c) mutation rate = 0.01, (d) mutation rate = 0.001 
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Fig. 42.2 Average time taken to optimize OneMax (100 
bit) for steady-state GA with worst replacement, with vary- 
ing mutation rates 


a poor quality solution (average fitness around 74). As 
the mutation rate is reduced, the algorithm takes longer 
to find the steady state, but it is of higher quality. For the 
smallest mutation rate shown, 0.001, the run illustrated 
takes 3500 generations to converge, but this includes the 
optimal solution. 

Since the OneMax problem is simply a matter of 
hill-climbing, we can improve matters by using an eli- 
tist algorithm. So we now consider the steady-state 
genetic algorithm, again with binary tournament selec- 
tion, but this time replacing the worst individual in the 
population. In this setup, we cannot lose the current best 
solution in the population. We quickly find experimen- 
tally that for reasonable mutation rates, we can always 
find the optimum solution in reasonable time. However, 
again there is a balance to be struck between selec- 
tion strength and mutation. If mutation is too high, it 
is again destructive, which slows down progress. If it is 
too low, we wait for a long time for progress to be made. 
There is now an optimal mutation rate to be sought. Fig- 
ure 42.2 shows the average time to find the optimum 
for the same four different mutation rates (0.03, 0.02, 
0.01, 0.001). The average is taken over 20 runs. We can 
clearly see that the best tradeoff is obtained with the 
mutation rate of 0.01 in this case, with an average of 
around 1400 generations required to find the optimum. 
This rate equals one divided by the string length, and is 
a common choice for EAs. 

It is not strictly necessary to implement an eli- 
tist strategy to obtain reasonable optimisation perfor- 
mance. Consider instead a generational genetic algo- 
rithm in which our selection is the strongest pos- 
sible — we always pick the best in the population. 
This is not technically elitist, as we apply muta- 
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Fig. 42.3 Average time taken (generations) to optimize 
OneMax (100 bit) for generational GA with best selection, 
population size 10, and various mutation rates 


4000 4 
3000 
2000 


1000 


5 10 15 20 


Fig. 42.4 Average time taken (generations) to optimize 
OneMax (100 bit) for generational GA with best selection, 
mutation rate 0.01, and various population sizes 


tion to the selected individual, which means there is 
a chance it is lost. However, if there is a reason- 
able population size, and the mutation rate is not too 
large, then there is a good chance that a copy of 
the best individual will be placed in the next gen- 
eration. This in effect simulates elitism [42.36]. Yet 
again there is a balance between selection and mu- 
tation, this time depending on the population size. 
If the mutation rate is high, then we will need 
a large population to have a good chance of preserv- 
ing the best individual. Smaller mutation rates will 
allow smaller populations, but will slow down the 
evolution. 

Consider first a population of size 10, and a range of 
mutation rates: 0.005, 0.01, 0.015 and 0.02. Figure 42.3 
shows the average time to find the optimum for our 
generational genetic algorithm. We see that it is very 
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efficient for sufficiently low mutation rates (only 125 
generation when the mutation rate is 0.01). However, 
there is a transition to much longer run times when the 
mutation rate gets higher (2500 generations when the 
mutation rate is 0.02). 

Conversely, we can consider fixing the mutation rate 
at 0.01 and varying the population size. Figure 42.4 
shows the optimisation time in generations for popu- 
lations of size 5, 10, 15 and 20. Again we see evidence 
of a transition from long times (when the population 
is too small) to very efficient times with larger popula- 
tions (as low as 80 generations when the population size 
is 20). However, we have to bear in mind that a gen- 
erational genetic algorithm has more fitness function 
evaluations per generation than a steady-state genetic 
algorithm. Thus we cannot keep increasing the popu- 
lation size indefinitely without cost. Figure 42.5 shows 
the number of fitness function evaluations required to 
find the optimum. We see that, of the examples shown, 
a population of size 10 is best (requiring an average of 
1374 evaluations). 

The exact tradeoff between mutation rate and popu- 
lation size can be calculated theoretically. If the number 
of bits is n, and the mutation rate is 1/n, it can be 


42.6 Crossover Methods 


Crossover (or recombination) is a method for combining 
together parts of two different solutions to make a third. 
The hope is that good parts of each parent solution will 
combine to make an even better offspring. Of course, we 
might also be recombining the bad parts of each parent 
and come up with a worse solution — but then selection 
and replacement methods will filter these out. 

Several methods exist for performing crossover, de- 
pending on the representation used. For binary strings, 
there are three common choices. One-point crossover 
chooses a bit position at random and combines all the 
bit values below this position from one parent, with all 
the remaining bit values from the other. Thus, given the 
parents 


01001101, 
11100111 


if we choose the fifth position for our cut point, we ob- 
tain the offspring solution 


01000111. 


0 > 
D 10 15 20 


Fig. 42.5 Average time taken (fitness function evalua- 
tions) to optimise OneMax (100 bit) for generational GA 
with best selection, mutation rate 0.01, and various popu- 
lation sizes 


shown that there is a transition between exponential and 
polynomial run time for the OneMax problem when the 
population size is approximately 5 log), n. Indeed it can 
be shown that this is a lower bound on the required pop- 
ulation size to efficiently optimize any fitness function 
with unique global optimum [42.37]. 


Similarly, for two-point crossover, we choose two bit po- 
sitions at random. The bit values between these two cut 
points come from one parent, and the remaining values 
come from the other. Thus, with the same two parents, 
choosing cut positions 2 and 6 produces the offspring 


O1100101. 


Both one- and two-point crossovers have the property 
of being biased with respect to the ordering of bits. If 
the problem representation has been chosen so that this 
order matters, then such a crossover may confer an ad- 
vantage as they tend to preserve values that are next 
to each other. For example, if the problem relates to 
finding an optimal subset, and the elements of the set 
have been preordered according to some heuristic (e.g. 
a greedy algorithm), then one- or two-point crossover 
may be appropriate. If, however, the order of the bits 
is arbitrary, then one should choose a method which is 
unbiased with respect to ordering. The common choice 
is called uniform crossover, and involves choosing bit 
values from either parent at random. One way of im- 
plementing this would be to generate a random bit 
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string and let the values in this string (or mask) deter- 
mine which parents the bit values should come from. 
For example, using again the two parents above, if we 
generate the random mask 01010101, we get the off- 
spring 


01001101. 


This leads to a more general view of crossover: We 
specify a probability distribution z over the set of bi- 
nary strings and, to perform crossover, we select a string 
according to this distribution and use it as a mask [42.5]. 
Uniform crossover corresponds to a uniform distribu- 
tion. One-point crossover corresponds to selecting only 
masks of the form O ...01 ...1. It can be shown that 
crossover by masks in general is always unbiased with 
respect to changes in labels of bit values (that is, ex- 
changing ones for zeros). If we also require crossover 
to be unbiased with respect to bit ordering, then it is 
necessary and sufficient that masks containing the same 
number of ones are selected with the equal probability. 

Any crossover by masks also has the nice property 
that if both parents agree on a bit position, then the off- 
spring will also share the same value at that position. 
Such crossovers are called respectful, and emphasize 
the idea that one is trying to preserve structure found in 
the parents [42.38, 39]. It can be shown that such prop- 
erties can be understood geometrically [42.40]. 

Our understanding on when crossover can be help- 
ful is rather limited and there are many open questions. 
For example, continuing to look at the OneMax prob- 
lem, we can examine experimentally whether or not 
crossover helps. Let us keep to the steady-state al- 
gorithm with tournament selection and replacement 
of the worst individual. We modify the algorithm 
by allowing, with a given probability, crossover to 
take place between the selected individual and a ran- 
domly chosen one. Our algorithm therefore is as fol- 
lows: 


1. Initialize population of size u randomly with points 
from the search space. 
2. Repeat until stopping criterion is satisfied: 

a) Choose a member of the population using bi- 
nary tournament selection. 

b) With probability p, crossover selected individ- 
ual with one chosen randomly from the popula- 
tion. 

c) Modify the result with mutation using rate 1/n. 

d) Replace worst member of population with the 
new offspring. 

3. Stop. 
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Fig. 42.6 Average time taken to optimize OneMax (100 
bits) for generational GA with tournament selection, mu- 
tation rate 0.01, population size 10, and varying crossover 
probabilities 
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Fig. 42.7 Average time taken to optimize OneMax for var- 
ious string lengths for generational GA with tournament 
selection, mutation rate 1/n, population size 10. Solid line 
is with no crossover. Dotted line is with uniform crossover 
with probability 1 


We first consider an experiment on OneMax with 
n= 100. We use a population of size 10, and vary 
the crossover probability between O and 1. The re- 
sults are shown in Fig. 42.6, which shows the average 
time to find the optimum (averages over 20 runs). It 
can be seen that there is some improvement as the 
crossover rate increases, although the results are rather 
noisy (error bars represent one standard deviation). Ex- 
amining this case further, we compare the steady-state 
genetic algorithm with no crossover (p = 0) with uni- 
form crossover (p = 1) for different string lengths. The 
results are shown in Fig. 42.7. Here it is clear that 
there in significant improvement, which appears to be 
increasing as the string length grows. To date, there 
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is no theoretical analysis of why this should be the 
case. 100 7 
The first example of a problem class for which 
crossover can provably help is the following [42.12, 80 
41]. We take the OneMax function, and then create 
a trap just before the optimum 60 
0 if6<|\x|| <n, 40 
jump(x) = hea 
||x|| otherwise , 20 
where ||x|| is the number of ones in x (Fig. 42.8 for an 0 ` 
illustration). The idea is that the population first climbs 0 20 40 60 80 100 


the hill to the threshold 0 just before the trap. It is then 
rather unlikely that a single mutation event will create 
a string that crosses the gap and finds the global op- 
timum. However, crossing over two strings with just 
a few zeros in each may have a better chance of jumping 
the gap, especially if the zeros occur in different places 
in the two parents. To achieve this, some level of diver- 
sity must be maintained in the population — a subject 
discussed further in Sect. 42.7. 

For problem classes in which solutions are not nat- 
urally represented by binary strings, one has to think 
carefully about the best way to design a crossover 
operator. For example, take the case of the traveling 
salesman problem, in which a solution is given by a per- 
mutation of the cities, indicating the order in which they 
are to be visited. Over the years, a number of different 


42.7 Population Diversity 


For most forms of crossover, if one crosses an indi- 
vidual with itself, the resulting offspring will again be 
the same. Crossovers with this property are said to be 
pure [42.39]. This is certainly a property of one-point, 
two-point and uniform crossovers for binary strings. 
There is therefore no point in performing crossover if 
the population largely comprises copies of the same 
item. In fact, the whole idea of the population is rather 
wasted if this is the case. Rather, the hope is to gain ad- 
vantage by having different members of the population 
search different parts of the search space. It seems im- 
portant, therefore, to maintain a level of diversity in the 
population. 

The importance of this can be seen in solving the 
jump(x) problem described earlier in Sect. 42.6. Once 
the population has arrived at the local optimum, the 
individuals will typically contain 6—1 ones, and the 
rest zeros. If the zeros tend to fall in the same bit po- 


Fig. 42.8 The jump(x) function for string length n = 100, 
with threshold 6 = 98 


crossover methods have been proposed. For the most 
part, these were designed so as to be respectful of the 
positions of the cities along the route. That is, if a partic- 
ular city was visited first by both parents, then it would 
also be visited first by the offspring. However, a much 
more effective approach is to try to preserve the edges 
between adjacent cities in a route. That is, if city A is 
followed immediately by city B by both parents, irre- 
spective of where this takes place in the route, then we 
should try to ensure that this also happens in the case 
of offspring. This kind of crossover is called an edge 
recombination operator [42.42]. 


sitions for each member of the population, then it will 
be impossible for crossover to jump across the gap. For 
example, crossing 


1111111000 
and 
1111110100 


cannot produce the optimum. If, however, we can en- 
sure that diversity in the population, then two members 
at the local optimum might be 


1111111000 
and 
0110110111, 


which has a reasonable chance to jump the gap. 
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Several different mechanisms have been proposed 
to ensure some diversity is maintained in a population. 
The simplest way is to enforce it directly by not allow- 
ing duplicate individuals in the population [42.43]. So 
if an offspring is produced which is the same as some- 
thing already in the population, we just discard it. 

A second method is to adjust the replacement 
method in a steady-state genetic algorithm, by making 
sure the new offspring replaces something similar to it- 
self. For example, one could have a replacement rule 
that makes the offspring replace the population mem- 
ber most similar to it (as measured by the Hamming 
distance). This method, called crowding will, of course, 
destroy the elitism property [42.44]. This can be sal- 
vaged by only doing the replacement if the offspring is 
at least as good as what it replaces. 

A third approach to diversity is to explicitly modify 
the fitness function in such a way that individuals re- 
ceive a penalty for being too similar to other population 
members [42.45]. This idea is called fitness sharing. We 
think of the fitness function as specifying a resource 
available to individuals. Similar individuals are compet- 
ing for the resource, which has to be shared out between 
them. 

A fourth approach is to limit the choice of the part- 
ner for crossover. So far, our algorithms have chosen 
the crossover partner uniformly at random. One could 
instead, try to explicitly choose the most different indi- 
vidual in the population [42.46]. 

We test these various methods using the jump(x) 
problem of the previous section. Recall that for the 
population to jump across the gap requires a crossover 
between two diverse individuals. We work with a string 
length n = 100 and a threshold 0 = 98. As a baseline, 
consider the steady-state genetic algorithm, with binary 


42.8 Parallel Genetic Algorithms 


There have been a number of studies of different ways 
to parallelize genetic algorithms. There are two basic 
methods. The first is the island model, in which we 
have several populations evolving in parallel, which oc- 
casionally exchange members [42.47]. To specify such 
an algorithm, one needs to decide on the topology of 
the network of populations (that is, which populations 
can exchange members) and the frequency with which 
migrations can take place. We also need a method to 
decide on which individuals should be passed, and how 
they should be incorporated into the new population. 


Table 42.1 Success of various diversity methods in solv- 
ing the jump(x) problem. Each is tested for 20 trials, to 
a maximum of 10000 fitness function evaluations. Mean 
and standard deviation refer just to successful runs 


Method Percentage Mean Standard 
successful evaluations deviation 
runs (%) 

None 5 8860 - 

No duplicates 80 3808 2728 

Crowding 100 1320 389 

Sharing 65 3452 2652 

Partner choice 10 688 113 


tournament selection, replacement of the worst, uni- 
form crossover (probability 1) and mutation rate 0.01. 
In 20 trials on the jump(x) function, only once did 
this algorithm succeed in finding the optimum within 
10000 generations. On that one successful run, it re- 
quired 8860 evaluations to complete. We compare this 
result with the same algorithm, modified in each of the 
four diversity-preserving methods above. In the case 
of fitness sharing, we simply penalize any individual 
which has multiple copies in the population by subtract- 
ing the number of copies from the fitness (It should be 
noted that there are different ways of doing this, and the 
original method is far more complicated.). The results 
of the experiments are summarized in Table 42.1. The 
best result is given by the crowding mechanism, but no- 
tice that it is essential to preserve elitism here, otherwise 
the algorithm never solves the problem within 10000 
generations. 

A different approach altogether is to structure the 
population in some way, so as to prevent certain individ- 
uals interaction. We take up this idea, in a more general 
context, in the following section. 


That is, we need a form of selection and replacement 
for the migration stage. 

As an example, consider having several steady-state 
genetic algorithms operating in parallel. After a certain 
number of generation, we choose a member of each 
population to migrate — for example, the best one in 
each population. We copy this individual to the neigh- 
boring populations, according to the chosen topology. 
To keep things simple, we choose the complete topol- 
ogy, in which every pair of populations is connected. 
Thus, each population receives a copy of the best from 
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all the other populations. These now have to be incor- 
porated into the home population somehow. An easy 
method is to take the best of all the incoming indi- 
viduals, and use it to replace the worst in the current 
population. Such an algorithm will look like this: 


1. Create m populations, each of size u. 

2. Update each population for c generations. 

3. For each population, replace the worst individual by 
the best of the remaining populations. 

4. Goto2. 


To take an extreme case, if the population size is 
u = 1 and we migrate every generation (c = 1), then 
this is rather similar to a (1, m) EA. 

There are two possible advantages of the island 
model. Firstly, it is straightforward to distribute it on 
a genuinely parallel processing architecture, leading to 
performance gains. Secondly, there may be some prob- 
lem classes for which the use of different populations 
can help. The idea is that different populations may 
explore different parts of the search space, or develop 
different partial solutions. The parameter c is chosen 
large enough to allow some progress to be made. The 
migration stage then allows efforts in different direc- 
tions to be shared, and workable partial solutions to be 
combined. Some recent theoretical progress has been 
made in analyzing this situation for certain problem 
classes [42.48]. 

A particular case of such a model is found in co- 
evolutionary algorithms. These come in two flavors: 
competitive and co-operative. In a competitive algo- 
rithm, we typically have two parallel populations. One 
represents solutions to a problem, and the other repre- 
sents problem instances. The idea is that, as the former 
population is finding better solutions, so the latter is 
finding harder instances to test these solutions. An early 
example evolved sorting networks for sorting lists of 
integers [42.49]. While one population contained dif- 
ferent networks, the other contained different lists to be 
sorted. The fitness of a network was judged by its ability 
to sort the problem instances. The fitness of a problem 
instance was its ability to cause trouble for the net- 
works. As the instances get harder, the sorting networks 
become more sophisticated. 

In a co-operative co-evolutionary algorithm, the 
different populations work together to solve a single 
problem [42.50]. This is done by dividing the problem 
into pieces, and letting each population evolve a solu- 
tion for each piece. The fitness of a piece is judged by 
combining it with pieces from other populations and 
evaluating the success of the whole. Theoretical anal- 


ysis shows that certain types of problems can benefit 
from this approach by allowing greater levels of explo- 
ration than in a single population [42.51]. 

The second parallel model for genetic algorithms 
is the fine grained model, in which there is a sin- 
gle population, but with the members of the popu- 
lation distributed spatially, typically on a rectangular 
grid [42.52]. At each time step, each individual is 
crossed over with a neighbor. The resulting offspring 
replaces the original if it is better, according to the fit- 
ness function. The algorithm looks like this: 


1. Create an initial random population of size m. 
In parallel, for each individual x: 
a) Choose a random neighbor y of x. 
b) Cross over x and y to form z. 
c) Mutate z to form the offspring. 
d) Replace x with the offspring if it is better. 
3. Go to 2. 


Notice that such an algorithm is generational, but 
also elitist. There has been very little analysis of this 
kind of architecture, although there is some empirical 
evidence that it can be effective, especially for problems 
with multiple objectives, in which different tradeoffs 
can emerge in different parts of the population [42.53]. 

To illustrate the fine-grained parallel genetic al- 
gorithm, consider a ring topology, in which the kth 
member of the population has as neighbors the (k—1)th 
and (k+ 1)th member (wrapping round at the ends). We 
try it on OneMax, with a population of size 10, using 
uniform crossover and, as usual, a mutation rate of 1/n. 
The results, for a variety of string lengths, are shown 
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Fig. 42.9 The average time (in fitness function evalua- 
tions) for the ring-topology parallel genetic algorithm to 
find the optimum for OneMax for a variety of string 
lengths. Population size 10, mutation rate = 1/n, uniform 
crossover 
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in Fig. 42.9. We can see that the parallel algorithm 
is competitive with the steady-state genetic algorithm. 
Note that fitness function evaluations are plotted, and 
not generations. 

The distributed nature of the population should 
help to maintain a level of diversity. Hence, we would 
expect a parallel genetic algorithm (with crossover) 
to perform reasonably well on the jump(x) function 


42.9 Populations as Solutions 


We finish this chapter by considering a genetic algo- 
rithm for which the population as a whole represents the 
solution to the problem, rather than it being a collection 
of individual solutions. It is therefore an example of co- 
operative co-evolution taking place within a single pop- 
ulation. Moreover, it is one of the very few examples 
involving a population-based genetic algorithm using 
crossover, for which a serious theoretical analysis ex- 
ists. It is one of the highlights of the theory of genetic 
algorithms to date [42.14, 54, 55]. 

The problem we are addressing is the classical All- 
Pairs Shortest Path problem. We are given a graph with 
vertex set V (containing n vertices) and edge set E with 
positive weighted edges. The goal is to determine the 
shortest path between every pair of vertices in the graph, 
where length is given by summing the weights along the 
path. 

To clarify what is meant exactly by a path, first 
consider a sequence of vertices, vj,..., Vm such that 
(Vk, vk+1) € E for all k = 1,...,m—1. Such a sequence 
is called a walk. A path is then a walk with no repeated 
vertices. Since for any walk between two vertices there 
is a shorter path (by omitting any loops), we can equiva- 
lently consider the problem of finding the shortest walks 
between any two vertices. 

The population will represent a solution to the 
problem for a given graph, by having each individual 
representing a walk between two vertices. The prob- 
lem is solved when the population contains exactly the 
shortest paths for all of the n(n — 1) pairs of vertices. 

The algorithm will be a steady-state genetic algo- 
rithm, with random selection and a replacement method 
which enforce diversity. For any pair of vertices we will 
allow at most one walk between them to exist in the 
population. The outline of the algorithm is as follows: 


1. Initialize the population to be E. 
2. Select a population member uniformly at random. 


of Sect. 42.6. Over 20 trials of our ring-based fine- 
grained algorithm, on the jump(x) function with n= 
100 and 6 = 98, we find that it solves the problem 
on all trials, requiring an average of 2924 function 
evaluations (standard deviation is 1483). It is there- 
fore competitive with the simple diversity enforcement 
method on this problem (compare with results in Ta- 
ble 42.1). 


3. With probability p do crossover, else do mutation. 

4. If a walk with the same start and end is in popula- 
tion, replace it with offspring, if offspring length is 
no worse. 

5. If a walk with same start and end is not in popula- 
tion, add offspring walk to population. 

6. Goto 2. 


We can see from line 5 that another unusual feature 
of the algorithm is that the population size can grow. 
Indeed, it starts with just the edge set from the graph 
and has to grow to get the paths between all pairs of 
vertices. 

In line 3, we see there is a choice between muta- 
tion and crossover, governed by a parameter p. Given 
that our individuals are walks in a graph (rather than bi- 
nary strings) it is clear that we need to specify some 
special purpose operators. We define mutation to be 
a random lengthening or shrinking of a walk as follows. 


Suppose that the selected walk is vj, v2, ... , Vm—1, Vm- 
We randomly select a vertex from the neighbors of 
vı and vm. If this is neither v2 nor v,,—; then we ap- 


pend it to the walk. If it is one of v2 or v,,—; then we 
truncate the path at that end. This process is repeated 
a number of times, given by choosing an integer s ac- 
cording to a Poisson distribution with parameter A = 1. 
We then perform s + 1 mutations to generate the off- 
spring walk. 

As an example, consider the graph illustrated in 
Fig. 42.10 (notice that the edge weights are not shown). 
Suppose we have selected the walk (3,4, 5, 6) from the 
population to mutate. We choose our random Poisson 
variable s — let us say it is one. So we have two muta- 
tions to apply. We gather the set of vertices connected to 
the two end points. That is, {1,4, 8} and {5,7, 10, 11}, 
and we pick one of these at random. Let us suppose 
that 8 is selected. We therefore extend our walk to be- 
come (8,3,4,5,6). For the second mutation, we again 
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Fig. 42.10 An example graph for the All Pairs Shortest Path prob- 
lem (edge weights not shown) 


collect together the vertices attached to the end points: 
{3,9} and {5,7, 10,11}. Choosing one at random we 
select, say, 5. Since this is a vertex in the walk prior to 
an end point, we truncate to produce the final offspring 
(8,3, 4,5). If there is already a walk from 8 to 5 in the 
population, we replace it with the new one, only if the 
new one is in fact shorter (that is, the sum of the weights 
is less). If there is no such walk, then we add the new 
one to the population. 

To perform crossover, we have to be careful to en- 
sure that we end up with a valid walk. Suppose that 
the individual we have selected is a walk from u to v. 
We consider all the members of the population that start 
from v, but exclude the one that goes back to u. We then 
choose a member of this set, uniformly at random, and 
concatenate it to our original walk. This guarantees that 
the offspring is again a valid walk. 

For example, consider again the graph in Fig. 42.10 
and the selected walk (3, 4, 5, 6). We need to first collect 
together all the walks in the population that start from 
vertex 6, excluding the one (if it exists) that goes from 


42.10 Conclusions 


We have seen that the defining feature of a genetic al- 
gorithm is its maintenance of a population, which is 
used to search for an optimal (or at least sufficiently 
good) solution to the problem class at hand. For prob- 
lems where solutions are naturally represented as binary 
strings, there is an obvious analogy with an evolving 
population of individuals in nature, with the strings pro- 
viding the DNA. Analogs of mutation and crossover 
(recombination) are then readily definable and can be 
used as search operators. The theory that describes the 


6 to 3. Imagine that we find the following: 


(6, 10, 5,2) 
(6,5,2,7) 
(6,11,10) 


and we pick one at random — say, the second one. 
Concatenating this to the original walk produces the 
offspring (3,4,5, 6,5, 2,7). Notice that this is a walk, 
rather than a path, since vertex 5 is repeated. If this is 
better than any existing walk from vertex 3 to vertex 7, 
then it replaces it. If there is no such walk in the popu- 
lation, then the new one gets added. 

A considerable amount of theoretical analysis have 
been done for this genetic algorithm on the class of All 
Pairs Shortest Path problems. If we run the algorithm 
with no crossover (that is, we set p = 0) then it can be 
shown that it requires @(n*) generations for the popu- 
lation to converge to the optimal set of paths. Adding 
crossover by choosing 0 < p < 1 improves the perfor- 
mance to O(n? logn). Note that the classical approach 
to solving this problem (the Floyd—Warshall algorithm) 
requires O(n?) time, which is faster and includes all 
computations that have to be performed (i. e. this is not 
just a count of the black box function evaluations). Of 
course, the classical algorithm has full details of the 
problem instance on which it is working, whereas the 
genetic algorithm is operating blind. Despite this great 
disadvantage, the genetic algorithm only pays a cost of 
a factor of log n over the classical approach (in addition 
to any implementation overhead). 

This example is one of the only cases where we 
have a proof that crossover helps for a naturally defined 
problem class. It is an important open problem to find 
others. 


trajectory of a population under such operators, in gen- 
eral terms, is well developed [42.5]. 

What is much less clear is the question of when all 
this is worth doing? That is, if our primary interest is in 
solving problems efficiently, in what circumstances is 
a genetic algorithm a good choice? To begin to answer 
this question requires an in-depth theoretical analysis of 
algorithms and problem classes. The work on this area 
has only just begun — most known results relate to the 
so-called (1+ 1) EA (that is, a population of size 1), 
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and very little work exists on the role of crossover. The 
All Pairs Shortest Path example represents the current 
state of the art in this respect. 

Having said that, it is clear (at least empirically 
and anecdotally) that genetic algorithms can be very ef- 
fective for complex problems, where problem instance 
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James McDermott, Una-May O'Reilly 


Genetic programming (GP) is the subset of evolu- 
tionary computation in which the aim is to create 
executable programs. It is an exciting field with 

many applications, some immediate and practical, 
others long-term and visionary. In this chapter, 

we provide a brief history of the ideas of genetic 
programming. We give a taxonomy of approaches 
and place genetic programming in a broader tax- 
onomy of artificial intelligence. We outline some 

current research topics and point to successful use 
cases. We conclude with some practical GP-related 
resources including software packages and venues 
for GP publications. 
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43.1 Evolutionary Search for Executable Programs 


There have been many attempts to artificially emulate 
human intelligence, from symbolic artificial intelli- 
gence (AI) [43.1] to connectionism [43.2, 3], to subcog- 
nitive approaches like behavioral AI [43.4] and statisti- 
cal machine learning (ML) [43.5], and domain-specific 
achievements like web search [43.6] and self-driving 
cars [43.7]. Darwinian evolution [43.8] has a type of 
distributed intelligence distinct from all of these. It has 
created lifeforms and ecosystems of amazing diversity, 
complexity, beauty, facility, and efficiency. It has even 
created forms of intelligence very different from itself, 
including our own. 

The principles of evolution — fitness biased selec- 
tion and inheritance with variation — serve as inspiration 
for the field of evolutionary computation (EC) [43.9], 
an adaptive learning and search approach which is 


general-purpose, applicable even with black-box per- 
formance feedback, and highly parallel. EC is a trial- 
and-error method: individual solutions are evaluated 
for fitness, good ones are selected as parents, and 
new ones are created by inheritance with variation 
(Fig. 43.1). 

GP is the subset of EC in which the aim is to create 
executable programs. The search space is a set of pro- 
grams, such as the space of all possible Lisp programs 
within a subset of built-in functions and functions com- 
posed by a programmer or the space of numerical C 
functions. The program representation is an encoding of 
such a search space, for example an abstract syntax tree 
or a list of instructions. A program’s fitness is evaluated 
by executing it to see what it does. New programs are 
created by inheritance and variation of material from 


845 


V 
o 

= 

“= 
m 
f> 
w 
= 


846 PartE 


cen |3 Hed 


Evolutionary Computation 


Empty population 


Random 
initialization 


Population 


Fitness evaluation 
and selection 


Replacement 


Children Parents 


Crossover and 
mutation 


Fig. 43.1 The fundamental loop of EC 


parent programs, with constraints to ensure syntactic 
correctness. 

We define a program as a data structure capable 
of being executed directly by a computer, or of being 
compiled to a directly executable form by a compiler, 
or of interpretation, leading to execution of low-level 
code, by an interpreter. A key feature of some pro- 
gramming languages, such as Lisp, is homoiconicity: 
program code can be viewed as data. This is essential in 
GP, since when the algorithm operates on existing pro- 
grams to make new ones, it is regarding them as data; 
but when they are being executed in order to determine 
what they do, they are being regarded as the program 
code. This double meaning echoes that of DNA (de- 
oxyribonucleic acid), which is both data and code in 
the same sense. 

GP exists in many different forms which differ 
(among other ways) in their executable representation. 
As in programming by hand, GP usually considers and 
composes programs of varying length. Programs are 


43.2 History 


GP has a surprisingly long history, dating back to very 
shortly after von Neumann’s 1945 description of the 
stored-program architecture [43.43] and the 1946 cre- 
ation of ENIAC [43.44], sometimes regarded as the 
first general-purpose computer. In 1948, Turing stated 
the aim of machine intelligence and recognized that 
evolution might have something to teach us in this re- 
gard [43.45]: 


also generally hierarchical in some sense, with nesting 
of statements or control. These representation proper- 
ties (variable length and hierarchical structure) raise 
a very different set of technical challenges for GP com- 
pared to typical EC. 

GP is very promising, because programs are so gen- 
eral. A program can define and operate on any data 
structure, including numbers, strings, lists, dictionaries, 
sets, permutations, trees, and graphs [43.10-12]. Via 
Turing completeness, a program can emulate any model 
of computation, including Turing machines, cellular 
automata, neural networks, grammars, and finite-state 
machines [43.13-18]. 

A program can be a data regression model [43.19] 
or a probability distribution. It can express the growth 
process of a plant [43.20], the gait of a horse [43.21], 
or the attack strategy of a group of lions [43.22]; it 
can model behavior in the Prisoner’s Dilemma [43.23] 
or play chess [43.24], Pacman [43.25], or a car-racing 
game [43.26]. A program can generate designs for 
physical objects, like a space-going antenna [43.27], or 
plans for the organization of objects, like the layout of 
a manufacturing facility [43.28]. A program can imple- 
ment a rule-based expert system for medicine [43.29], 
a scheduling strategy for a factory [43.30], or an 
exam timetable for a university [43.31]. A program 
can recognize speech [43.32], filter a digital sig- 
nal [43.33], or process the raw output of a brain- 
computer interface [43.34]. It can generate a piece 
of abstract art [43.35], a 3-D (three-dimensional) ar- 
chitectural model [43.36], or a piece of piano mu- 
sic [43.37]. 

A program can interface with natural or man-made 
sensors and actuators in the real world, so it can both 
act and react [43.38]. It can interact with a user or with 
remote sites over the network [43.39]. It can also intro- 
spect and copy or modify itself [43.40]. A program can 
be nondeterministic [43.41]. If true AI is possible, then 
a program can be intelligent [43.42]. 


Further research into intelligence of machinery will 
probably be very greatly concerned with searches. 
[...] There is the genetical or evolutionary search 
by which a combination of genes is looked for, the 
criterion being survival value. The remarkable suc- 
cess of this search confirms to some extent the idea 
that intellectual activity consists mainly of various 
kinds of search. 
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However, Turing also went a step further. In 1950, he 
more explicitly stated the aim of automatic program- 
ming (AP) and a mapping between biological evolution 
and program search [43.46]: 


We have |...] divided our problem [automatic pro- 
gramming] into two parts. The child-program [Tur- 
ing machine] and the education process. These two 
remain very closely connected. We cannot expect to 
find a good child-machine at the first attempt. One 
must experiment with teaching one such machine 
and see how well it learns. One can then try another 
and see if it is better or worse. There is an obvious 
connection between this process and evolution, by 
the identifications: 


© Structure of the child machine = Hereditary mate- 
rial 

© Changes = Mutations 

@ Natural selection = Judgment of the experimenter. 


This is an unmistakeable, if abstract, description of GP 
(though a computational fitness function is not envis- 
aged). 

Several other authors expanded on the aims and vi- 
sion of AP and machine intelligence. In 1959 Samuel 
wrote that the aim was to be able to Tell the computer 
what to do, not how to do it [43.47]. An important early 
attempt at implementation of AP was the 1958 learning 
machine of Friedberg [43.48]. 

In 1963, McCarthy summarized [43.1] several rep- 
resentations with which machine intelligence might 
be attempted: neural networks, Turing machines, and 
calculator programs. With the latter, McCarthy was re- 
ferring to Friedberg’s work. McCarthy was prescient 
in identifying important issues such as representations, 
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à 
o — 


operator behavior, density of good programs in the 
search space, sufficiency of the search space, appro- 
priate fitness evaluation, and self-organized modularity. 
Many of these remain open issues in GP [43.49]. 

Fogel et al.’s 1960s evolutionary programming may 
be the first successful implementation of GP [43.50]. 
It used a finite-state machine representation for pro- 
grams, with specialized operators to ensure syntactic 
correctness of offspring. A detailed history is available 
in Fogel’s 2006 book [43.51]. 

In the 1980s, inspired by the success of genetic al- 
gorithms (GAs) and learning classifier systems (LCSs), 
several authors experimented with hierarchically struc- 
tured and program-like representations. Smith [43.52] 
proposed a representation of a variable-length list of 
rules which could be used for program-like behavior 
such as maze navigation and poker. Cramer [43.53] was 
the first to use a tree-structured representation and ap- 
propriate operators. With a simple proof of concept, it 
successfully evolved a multiplication function in a sim- 
ple custom language. Schmidhuber [43.54] describes 
a GP system with the possibility of Turing complete- 
ness, though the focus is on meta-learning aspects. 
Fujiki and Dickinson [43.55] generated Lisp code for 
the prisoner’s dilemma, Bickel and Bickel [43.56] used 
a GA to create variable-length lists of rules, each of 
which had a tree structure. An artificial life approach 
using machine-code genomes was used by Ray [43.57]. 
All of these would likely be regarded as on-topic in 
a modern GP conference. 

However, the founding of the modern field of GP, 
and the invention of what is now called standard GP, 
are credited to Koza [43.19]. In addition to the abstract 
syntax tree notation (Sect. 43.3.3), the key innovations 
were subtree crossover (Sect. 43.3.3) and the descrip- 
tion and set-up of many test problems. In this and 
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Fig. 43.2a-c The StdGP representation is an abstract syntax tree. The expression that will be evaluated in the sec- 
ond tree from left is, in inorder notation, (x * y) — (x + 2). In preorder, or the notation of Lisp-style S-expressions, it is 
(— (* xy) (+ x 2)). GP presumes that the variables x and y will be already bound to some value in the execution environ- 
ment when the expression is evaluated. It also presumes that the operations x and —, etc. are also defined. Note that, all 
interior tree nodes are effectively operators in some computational language. In standard GP parlance, these operators 
are called functions and the leaf tree nodes which accept no arguments and typically represent variables bound to data 


values from the problem domain are referred to as terminals 
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later research [43.10, 58,59] symbolic regression of 
synthetic data and real-world time series, Boolean prob- 
lems, and simple robot control problems such as the 
lawnmower problem and the artificial ant with Santa 
Fe trail were introduced as benchmarks and solved 
successfully for the first time, demonstrating that GP 
was a potentially powerful and general-purpose method 
capable of solving machine learning-style problems 
albeit conventional academic versions of them. Mu- 
tation was minimized in order to make it clear that 
GP was different from random search. GP took on 
its modern form in the years following Koza’s 1992 
book: many researchers took up work in the field, new 
types of GP were developed (Sect. 43.3), successful 


43.3 Taxonomy of Al and GP 


In this section, we present a taxonomy which firstly 
places GP in the context of the broader fields of EC, 
ML, and artificial intelligence (AI). It then classifies GP 
techniques according to their representations and their 
population models (Fig. 43.3). 


43.3.1 Placing GP in an Al Context 


GP is a type of EC, which is a type of ML, 
which is itself a subset of the broader field of AI 
(Fig. 43.3). Carbonell et al. [43.61] classify ML tech- 
niques according to the underlying learning strategy, 
which may be rote learning, learning from instruc- 
tion, learning by analogy, learning from examples, and 
learning from observation and discovery. In this classi- 
fication, EC and GP fit in the learning from examples 
category, in that an (individual, fitness) pair is an ex- 


applications appeared (Sect. 43.4), key research top- 
ics were identified (Sect. 43.5), further books were 
written, and conferences and journals were established 
(Sect. 43.6). 

Another important milestone in the history of GP 
was the 2004 establishment of the Humies, the awards 
for human-competitive results produced by EC meth- 
ods. The entries are judged for matching or exceeding 
human-produced solutions to the same or similar prob- 
lems, and for criteria such as patentability and pub- 
lishability. The impressive list of human-competitive 
results [43.60] again helps to demonstrate to researchers 
and clients outside the field of GP that it is powerful and 
general purpose. 


ample drawn from the search space together with its 
evaluation. 

It is also useful to see GP as a subset of another 
field, AP. The term automatic programming seems to 
have had different meanings at different times, from 
automated card punching, to compilation, to template- 
driven source generation, then generation techniques 
such as universal modeling language (UML), to the am- 
bitious aim of creating software directly from a natural- 
language English specification [43.62]. We interpret AP 
to mean creating software by specifying what to do 
rather than how to do it [43.47]. GP clearly fits into this 
category. Other nonevolutionary techniques also do so, 
for example inductive programming (IP). The main dif- 
ference between GP and IP is that typically IP works 
only with programs which are known to be correct, 
achieving this using inductive methods over the spec- 


Fig. 43.3 A taxonomy of AI, EC, 
and GP 
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ifications, [43.63]. In contrast, GP is concerned mostly 
with programs which are syntactically correct, but be- 
haviorally suboptimal. 


43.3.2 Taxonomy of GP 


It is traditional to divide EC into four main subfields: 
evolution strategies (ES) [43.64, 65], evolutionary pro- 
gramming (EP) [43.50], GAs [43.66], and GP. In this 
view, ES is chiefly characterized by real-valued opti- 
mization and self-adaptation of algorithm parameters; 
EP by a finite-state machine representation (later gen- 
eralized) and the absence of crossover; GA by the 
bitstring representation; and GP by the abstract syn- 
tax tree representation. While historically useful, this 
classification is not exhaustive: in particular it does 
not provide a home for the many alternative GP rep- 
resentations which now exist. It also separates EP and 
GP, though they are both concerned with evolving pro- 
grams. We prefer to use the term GP in a general sense 
to refer to all types of EC which evolve programs. We 
use the term standard GP (StdGP) to mean Koza-style 
GP with a tree representation. With this view, StdGP 
and EP are types of GP, as are several others discussed 
below. In the following, we classify GP algorithms ac- 
cording to their representation and according to their 
population model. 


43.3.3 Representations 


Throughout EC, it is useful to contrast direct and indi- 
rect representations. Standard GP is direct, in that the 
genome (the object created and modified by the genetic 
operators) serves directly as an executable program. 
Some other GP representations are indirect, meaning 
that the genome must be decoded or translated in 
some way to give an executable program. An example 
is grammatical evolution (GE, see below), where the 
genome is an integer array which is used to generate 
a program. Indirect representations have the advantage 
that they may allow an easier definition of the genetic 
operators, since they may allow the genome to exist 
in a rather simpler space than that of executable pro- 
grams. Indirect representations also imitate somewhat 
more closely the mechanism found in nature, a mapping 
from DNA (deoxyribonucleic acid) to RNA (ribonu- 
cleic acid) to mRNA (messenger RNA) to codons to 
proteins and finally to cells. The choice between direct 
and indirect representations also affects the structure 
of the fitness landscape (Sect. 43.5.2). In the follow- 
ing, we present a nonexhaustive selection of the main 


representations used in GP, in each case describing ini- 
tialization and the two key operators: mutation, and 
crossover. 


Standard GP 
In Standard GP (StdGP), the representation is an ab- 
stract syntax tree, or can be seen as a Lisp-style 
S-expression. All nodes are functions and all arguments 
are the same type. A function accepts zero or more 
arguments and returns a single value. Trees can be ini- 
tialized by recursive random growth starting from a null 
node. StdGP uses parameterized initialization methods 
that diversify the size and structure of initial trees. Fig- 
ure 43.2a shows a tree in the process of initialization. 

Trees can be crossed over by cutting and swap- 
ping the subtrees rooted at randomly chosen nodes, as 
shown in Fig. 43.2b. They can be mutated by cutting 
and regrowing from the subtrees of randomly cho- 
sen nodes, as shown in Fig. 43.2c. Another mutation 
operator, HVL-Prime, is shown later in Fig. 43.11. 
Note that crossover or mutation creates an offspring 
of potentially different size and structure, but the off- 
spring remains syntactically valid for evaluation. With 
these variations, a tree could theoretically grow to 
infinite size or height. To circumvent this, as a prac- 
ticality, a hard parameterized threshold for size or 
height or some other threshold is used. Violations to 
the threshold are typically rejected. Bias may also 
be applied in the randomized selection of crossed- 
over subtree roots. A common variation of StdGP 
is strongly typed GP (STGP) [43.67, 68], which sup- 
ports functions accepting arguments and returning val- 
ues of specific types by means of specialized mu- 
tation and crossover operations that respect these 


types. 


Executable Graph Representations 
A natural generalization of the executable tree rep- 
resentation of StdGP is the executable graph. Neural 
networks can be seen as executable graphs in which 
each node calculates a weighted sum of its inputs and 
outputs the result after a fixed shaping function such 
as tanh(). Parallel and distributed GP (PDGP) [43.69] 
is more closely akin to StdGP in that nodes calculate 
different functions, depending on their labels, and do 
not perform a weighted sum. It also allows the topol- 
ogy of the graph to vary, unlike the typical neural 
network. Cartesian GP (CGP) [43.70] uses an integer- 
array genome and a mapping process to produce the 
graph. Each block of three integer genes codes for 
a single node in the graph, specifying the indices of 
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its inputs and the function to be executed by the node 
(Fig. 43.4). 

Neuro-evolution of augmenting topologies 
(NEAT) [43.71] again allows the topology to vary, 
and allows nodes to be labelled by the functions they 
perform, but in this case each node does perform 
a weighted sum of its inputs. Each of these represen- 
tations uses different operators. For example, CGP 
uses simple array-oriented (GA-style) initialization, 
crossover, and mutation operators (subject to some 
customizations). 


Finite-State Machine Representations 

Some GP representations use graphs in a different way: 
the model of computation is the finite-state machine 
rather than the executable functional graph (Fig. 43.5). 
The original incarnation of evolutionary programming 
(EP) [43.72] is an example. In a typical implementa- 
tion [43.72], five types of mutation are used: adding 
and deleting states, changing the initial state, changing 
the output symbol attached to edges, and changing the 
edges themselves. In this implementation, crossover is 
not used. 


[012 210 102 231 040 353] 


\/ 
[01* 21+ 10* 23- 04+ 35/] 


0 1 
OMORO 
Fig. 43.4 Cartesian GP. An integer-array genome is di- 
vided into blocks: in each block the last integer specifies 
a function (top-left). Then one node is created for each in- 
put variable (x,y,z) and for each genome block. Nodes 
are arranged in a grid and outputs are indexed sequen- 
tially (bottom-left). The first elements in each block specify 
the indices of the incoming links. The final graph is cre- 
ated by connecting each node input to the node output 
with the same integer label (right). Dataflow in the graph 
is bottom to top. Multiple outputs can be read from the 
topmost layer of nodes. In this example node 6 outputs 
xy—z+y, node 7 outputs x+z+y, and node 8 out- 


puts xy/xy 


Grammatical GP 
In grammatical GP [43.73], the context-free grammar 
(CFG) is the defining component of the representation. 
In the most common approach, search takes place in 
the space defined by a fixed nondeterministic CFG. The 
aim is to find a good program in that space. Often the 
CFG defines a useful subset of a programming language 
such as Lisp, C, or Python. Programs derived from the 
CFG can then be compiled or interpreted using either 
standard or special-purpose software. There are sev- 
eral advantages to using a CFG. It allows convenient 
definition of multiple data-types which are automati- 
cally respected by the crossover and mutation operators. 
It can introduce domain knowledge into the problem 
representation. For example, if it is known that good 
programs will consist of a conditional statement in- 
side a loop, it is easy to express this knowledge using 
a grammar. The grammar can restrict the ways in which 
program expressions are combined, for example mak- 
ing the system aware of physical units in dimensionally 
aware GP [43.74,75]. A grammatical GP system can 
conveniently be applied to new domains, or can incor- 
porate new domain knowledge, through updates to the 
grammar rather than large-scale reprogramming. 

In one early system [43.76], the derivation tree is 
used as the genome: initial individuals’ genomes are 
randomly generated according to the rules of the gram- 
mar. Mutation works by randomly generating a new 
subtree starting from a randomly chosen internal node 
in the derivation tree. Crossover is constrained to ex- 
change subtrees whose roots are identical. In this way, 
new individuals are guaranteed to be valid derivation 
trees. The executable program is then created from the 
genome by reading the leaves left to right. A later sys- 
tem, grammatical evolution (GEs) [43.77] instead uses 
an integer-array genome. Initialization, mutation and 
crossover are defined as simple GA-style array opera- 
tions. The genome is mapped to an output program by 
using the successive integers of the genome to choose 


Fig. 43.5 EP representation: finite-state machine. In this 
example, a mutation changes a state transition 
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among the applicable production choices at each step 
of the derivation process. Figure 43.6 shows a sim- 
ple grammar, integer genome, derivation process, and 
derivation tree. At each step of the derivation process, 
the left-most nonterminal in the derivation is rewritten. 
The next integer gene is used to determine, using the 
mod rule, which of the possible productions is chosen. 
The output program is the final step of the derivation 
tree. 

Although successful and widely used, GE has also 
been criticized for the disruptive effects of its operators 
with respect to preserving the modular functionality 
of parents. Another system, tree adjoining grammar- 
guided genetic programming (TAG3P) has also been 
used successfully [43.78]. Instead of a string-rewriting 
CFG, TAG3P uses the tree-rewriting tree adjoining 
grammars. The representation has the advantage, rel- 
ative to GE, that individuals are valid programs at 
every step of the derivation process. TAGs also have 
some context-sensitive properties [43.78]. However, it 
is a more complex representation. 

Another common alternative approach, surveyed 
by Shan et al. [43.79], uses probabilistic models over 
grammar-defined spaces, rather than direct evolutionary 
search. 


Linear GP 
In Linear GP (LGP), the program is a list of instructions 
to be interpreted sequentially. In order to achieve com- 
plex functionality, a set of registers acting as state or 
memory are used. Instructions can read from or write to 
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Fig. 43.6 GE representation. The grammar (top-left) 
consists of several rules. The genome (center-left) is 
a variable-length list of integers. At each step of the deriva- 
tion process (bottom-left), the left-most nonterminal is 
rewritten as specified by a gene. The resulting derivation 
tree is shown on the right: reading just the leaves gives the 
derived program 


the registers. Several registers, which may be read-only, 
are initialized with the values of the input variables. One 
register is designated as the output: its value at the end 
of the program is taken as the result of the program. 
Since a register can be read multiple times after writ- 
ing, an LGP program can be seen as having a graph 
structure. A typical implementation is that of [43.80]. 
It uses instructions of three registers each, which typi- 
cally calculate a new value as an arithmetic function of 
some registers and/or constants, and assign it to a regis- 
ter (Fig. 43.7). 

It also allows conditional statements and looping. It 
explicitly recognizes the possibility of nonfunctioning 
code, or introns. Since there are no syntactic constraints 
on how multiple instructions may be composed to- 
gether, initialization can be as simple as the random 
generation of a list of valid instructions. Mutation can 
change a single instruction to a newly generated instruc- 
tion, or change just a single element of an instruction. 
Crossover can be performed over the two parents’ list 
structures, respecting instruction boundaries. 


Stack-Based GP 
A variant of linear GP avoids the need for registers 
by adding a stack. The program is again a list of in- 
structions, each now represented by a single label. In 
a simple arithmetic implementation, the label may be 
one of the input variables (x;), a numerical constant, or 
a function (*, +, etc.). If it is a variable or constant, 
the instruction is executed by pushing the value onto 
the stack. If a function, it is executed by popping the 
required number of operands from the stack, execut- 
ing the function on them, and pushing the result back 
on. The result of the program is the value at the top of 


Xo | x1 Read-only 
ro | ri Read-write 
ro = Xo + xy 


ro = ri 


Fig. 43.7 Linear GP representation. This implementation 
has four registers in total (top). The representation is a list 
of register-oriented instructions (bottom). In this example 
program of three instructions, ro is the output register, and 
the formula 4(xp + x1)? is calculated 
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the stack after all instructions have been executed. With 
the stipulation that stack-popping instructions become 
no-ops when the stack is empty, one can again imple- 
ment initialization, mutation, and crossover as simple 
list-based operations [43.81]. One can also constrain the 
operations to work on what are effectively subtrees, so 
that stack-based GP becomes effectively equivalent to 
a reverse Polish notation implementation of standard 
GP [43.82]. A more sophisticated type of stack-based 
GP is PushGP [43.83], in which multiple stacks are 
used. Each stack is used for values of a different type, 
such as integer, boolean, and float. When a function 
requires multiple operands of different types, they are 
taken as required from the appropriate stacks. With the 
addition of an exec stack which stores the program 
code itself, and the code stack which stores items of 
code, both of which may be both read and written, 
PushGP gains the ability to evolve programs with self- 
modification, modularity, control structures, and even 
self-reproduction. 


Low-Level Programming 

Finally, several authors have evolved programs di- 
rectly in real-world low-level programming languages. 
Schulte et al. [43.84] automatically repaired programs 
written in Java byte code and in x86 assembly. Orlov 
and Sipper [43.85] evolved programs such as trail nav- 
igation and image classification de novo in Java byte 
code. This work made use of a specialized crossover 
operator which performed automated checks for com- 
patibility of the parent programs’ stack and control flow 
state. Nordin [43.86] proposed a machine-code repre- 
sentation for GP. Programs consist of lists of low-level 
register-oriented instructions which execute directly, 
rather than in a virtual machine or interpreter. The re- 
sult is a massive speed-up in execution. 


43.3.4 Population Models 


It is also useful to classify GP methods according 
to their population models. In general the population 
model and the representation can vary independently, 
and in fact all of the following population can be ap- 
plied with any EC representation including bitstrings 
and real-valued vectors, as well as with GP represen- 
tations. 

The simplest possible model, hill-climbing, uses 
just one individual at a time [43.87]. At each iteration, 
offspring are created until one of them is more highly fit 
than the current individual, which it then replaces. If at 
any iteration it becomes impossible to find an improve- 


ment, the algorithm has climbed the hill, i.e. reached 
a local optimum, and stops. It is common to use a ran- 
dom restart in this case. The hill-climbing model can 
be used in combination with any representation. Note 
that it does not use crossover. Variants include ES-style 
(u, A) or (+A) schemes, in which multiple parents 
each give rise to multiple offspring by mutation. 

The most common model is an evolving popula- 
tion. Here a large number of individuals (from tens to 
many thousands) exist in parallel, with new genera- 
tions being created by crossover and mutation among 
selected individuals. Variants include the steady-state 
and the generational models. They differ only in that 
the steady-state model generates one or a few new indi- 
viduals at a time, adds them to the existing population 
and removes some old or weak individuals; whereas the 
generational model generates an entirely new popula- 
tion all at once and discards the old one. 

The island model is a further addition, in which 
multiple populations all evolve in parallel, with infre- 
quent migration between them [43.88]. 

In coevolutionary models, the fitness of an individ- 
ual cannot be calculated in an endogenous way. Instead 
it depends on the individual’s relationship to other in- 
dividuals in the population. A typical example is in 
game-playing applications such as checkers, where the 
best way to evaluate an individual is to allow it to play 
against other individuals. Coevolution can also use fit- 
ness defined in terms of an individual’s relationship to 
individuals in a population of a different type. A good 
example is the work of [43.89], which uses a type of 
predator-prey relationship between populations of pro- 
grams and populations of test cases. The test cases 
(predators) evolve to find bugs in the programs; the pro- 
grams (prey) evolve to fix the bugs being tested for by 
the test suites. 

Another group of highly biologically inspired pop- 
ulation models are those of swarm intelligence. Here 
the primary method of learning is not the creation of 
new individuals by inheritance. Instead, each individ- 
ual generally lives for the length of the run, but moves 
about in the search space with reference to other indi- 
viduals and their current fitness values. For example, in 
particle swarm optimization (PSO) individuals tend to 
move toward the global best and toward the best point in 
their own history, but tend to avoid moving too close to 
other individuals. Although PSO and related methods 
such as differential evolution (DE) are best applied in 
real-valued optimization, their population models and 
operators can be abstracted and applied in GP methods 
also [43.90, 91]. 
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Finally, we come to estimation of distribution algo- 
rithms (EDAs). Here the idea is to create a population, 
select a subsample of the best individuals, model that 
subsample using a distribution, and then create a new 
population by sampling the distribution. This approach 
is particularly common in grammar-based GP [43.73], 


43.4 Uses of GP 


Our introduction (Sect. 43.1) has touched on a wide ar- 
ray of domains in which GP has been applied. In this 
section, we give more detail on just a few of these. 


43.4.1 Symbolic Regression 


Symbolic regression is one of the most common tasks 
for which GP is used [43.19, 95, 96]. It is used as a com- 
ponent in techniques like data modeling, clustering, and 
classification, for example in the modeling application 
outlined in Sect. 43.4.2. It is named after techniques 
such as linear or quadratic regression, and can be seen 
as a generalization of them. Unlike those techniques 
it does not require a priori specification of the model. 
The goal is to find a function in symbolic form which 
models a data set. A typical symbolic regression is im- 
plemented as follows. 

It begins with a dataset which is to be regressed, 
in the form of a numerical matrix (Fig. 43.8, left). 
Each row / is a data-point consisting of some input (ex- 
planatory) variables x; and an output (response) variable 
y; to be modeled. The goal is to produce a function 
f(x) which models the relationship between x and y as 
closely as possible. Figure 43.8 (right) plots the existing 
data and one possible function f. 

Typically StdGP is used, with a numerical language 
which includes arithmetic operators, functions like si- 
nusoids and exponentials, numerical constants, and the 
input variables of the dataset. The internal nodes of each 
StdGP abstract syntax tree will be operators and func- 
tions, and the leaf nodes will be constants and variables. 

To calculate the fitness of each model, the explana- 
tory variables of the model are bound to their values 
at each of the training points x; in turn. The model is 
executed, and the output f(x;) is the model’s predicted 
response. This value ĵ; is then compared to the response 
of the training point y;. The error can be visualized as 
the dotted lines in Fig. 43.8 (right). Fitness is usually 
defined as the root-mean-square error of the model’s 
outputs versus the training data. In this formulation, 


though it is also used with other representations [43.92— 
94]. The modeling-sampling process could be regarded 
as a type of whole-population crossover. Alternatively 
one can view EDAs as being quite far from the biolog- 
ical inspiration of most EC, and in a sense they bridge 
the gap between EC and statistical ML. 


therefore, fitness is to be minimized 


n N ENE 2 
fitness(f) = Lisi (fi) = yi)" œ) =y) í 
n 
Over the course of evolution, the population moves to- 
ward better and better models f of the training data. 
After the run, a testing data set is used to confirm that 
the model is capable of generalization to unseen data. 


43.4.2 Machine Learning 


Like other ML methods, GP is successful in quantita- 
tive domains where data is available for learning and 
both approximate solutions and incremental improve- 
ments are valued. In modeling or supervised learning, 
GP is preferable to other ML methods in circumstances 
where the form of the solution model is unknown a pri- 
ori because it is capable of searching among possible 
forms for the model. Symbolic regression can be used 
as an approach to classification, regression modeling, 
and clustering. It can also be used to automatically 
extract influential features, since it is able to pare 
down the feature set it is given at initialization. GP- 
derived classifiers have been integrated into ensemble 
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Fig. 43.8 Symbolic regression: a matrix of data (left) is 
to be modeled by a function. It is plotted as dots in the 


figure on the right. A candidate function f (solid line) can 
be plotted, and its errors (dotted lines) can be visualized 
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learning approaches and GP has been used in reinforce- 
ment learning (RL) contexts. Figure 43.9 shows GP as 
a means of ML which allows it to address problems 
such as planning, forecasting, pattern recognition, and 
modeling. 

For the sensory evaluation problem described 
in [43.97], the authors use GP as the anchor of a ML 
framework (Fig. 43.10). A panel of assessors provides 
liking scores for many different flavors. Each flavor 
consists of a mixture of ingredients in different pro- 
portions. The goals are to discover the dependency of 
a liking score on the concentration levels of flavors’ 
ingredients, identifying ingredients that drive liking, 
segmenting the panel into groups with similar liking 
preferences and optimizing flavors to maximize liking 
per group. The framework uses symbolic regression and 
ensemble methods to generate multiple diverse expla- 
nations of liking scores, with confidence information. It 
uses statistical techniques to extrapolate from the genet- 
ically evolved model ensembles to unobserved regions 
of the flavor space. It also segments the assessors into 
groups which either have the same propensity to like 
flavors, or whose liking is driven by the same ingredi- 
ents. 

Sensory evaluation data is very sparse and there 
is large variation among the responses of different 
assessors. A Pareto-GP algorithm (which uses multi- 
objective techniques to maximise model accuracy and 
minimise model complexity; [43.98]) was therefore 
used to evolve an ensemble of models for each assessor 
and to use this ensemble as a source of robust vari- 
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Fig. 43.9 GP as a component in ML. Symbolic regression can be 
used as an approach to many ML tasks, and integrated with other 
ML techniques 


able importance estimation. The frequency of variable 
occurrences in the models of the ensemble was inter- 
preted as information about the ingredients that drive 
the liking of an assessor. Model ensembles with the 
same dominance of variable occurrences, and which 
demonstrate similar effects when the important vari- 
ables are varied, were grouped together to identify 
assessors who are driven by the same ingredient set and 
in the same direction. Varying the input values of the 
important variables, while using the model ensembles 
of these panel segments, provided a means of conduct- 
ing focused sensitivity analysis. Subsequently, the same 
model ensembles when clustered constitute the black 
box which is used by an evolutionary algorithm in its 
optimization of flavors that are well liked by assessors 
who are driven by the same ingredient. 


43.4.3 Software Engineering 


At least three areas of software engineering have 
been tackled with remarkable success by GP: bug- 
fixing [43.99], parallelization [43.100, 101], and op- 
timization [43.102-104]. These three areas are very 
different in their aims, scope, and methods; however, 
they all need to deal with two key problems in this do- 
main: the very large and unconstrained search space, 
and the problem of program correctness. They therefore 
do not aim to evolve new functionality from scratch, but 
instead use existing code as material to be transformed 
in some way; and they either guarantee correctness of 
the evolved programs as a result of their representa- 
tions, or take advantage of existing test suites in order 
to provide strong evidence of correctness. 

Le Goues et al. [43.99] show that automatically fix- 
ing software bugs is a problem within the reach of GP. 
They describe a system called GenProg. It operates 
on C source code taken from open-source projects. It 
works by forming an abstract syntax tree from the orig- 
inal source code. The initial population is seeded with 
variations of the original. Mutations and crossover are 
constrained to copy or delete complete lines of code, 
rather than editing subexpressions, and they are con- 
strained to alter only lines which are exercised by the 
failing test cases. This helps to reduce the search space 
size. The original test suites are used to give confi- 
dence that the program variations have not lost their 
original functionality. Fixes for several real-world bugs 
are produced, quickly and with high certainty of suc- 
cess, including bugs in HTTP servers, Unix utilities, 
and a media player. The fixes can be automatically pro- 
cessed to produce minimal patches. Best of all, the fixes 
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Fig. 43.10 GP symbolic regression is unique and useful as an ML technique because it obviates the need to define the 
structure of a model prior to training. Here, it is used to form a personalized ensemble model for each assessor in a flavor 


evaluation panel 


are demonstrated to be rather robust: some even gener- 
alize to fixing related bugs which were not explicitly 
encoded in the test suite. 

Ryan [43.100] describes a system, Paragen, which 
automatically rewrites serial Fortran programs to par- 
allel versions. In Paragen I, the programs are directly 
varied by the genetic operators, and automated tests 
are used to reward the preservation of the program’s 
original semantics. The work of Williams [43.101] was 
in some ways similar to Paragen I. In Paragen II, 
correctness of the new programs is instead guaran- 
teed, using a different approach. The programs to 
be evolved are sequences of transformations defined 
over the original serial code. Each transformation is 
known to preserve semantics. Some transformations 
however directly transform serial operations to paral- 
lel, while other transformations merely enable the first 
type. 

A third goal of software engineering is optimization 
of existing code. White et al. [43.104] tackle this task 
using a multiobjective optimization method. Again, an 
existing program is used as a starting point, and the 
aim is to evolve a semantically equivalent one with im- 
proved characteristics, such as reduced memory usage, 
execution time, or power consumption. The system is 


capable of finding nonobvious optimizations, i. e. ones 
which cannot be found by optimizing compilers. A pop- 
ulation of test cases is coevolved with the population of 
programs. Stephenson et al. [43.102, 103] in the Meta 
Optimization project improve program execution speed 
by using GP to refine priority functions within the 
compiler. The compiler generates better code which ex- 
ecutes faster across the input range of one program and 
across the program range of a benchmark set. 

A survey of the broader field of search-based soft- 
ware engineering is given by Harman [43.105]. 


43.4.4 Design 


GP has been successfully used in several areas of de- 
sign. This includes both engineering design, where the 
aim is to design some hardware or software system 
to carry out a well-defined task, and aesthetic design, 
where the aim is to produce art objects with subjective 
qualities. 


Engineering Design 
One of the first examples of GP design was the synthe- 
sis of analog electrical circuits by Koza et al. [43.106]. 
This work addressed the problem of automatically cre- 
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ating circuits to perform tasks such as a filter or an 
amplifier. Eight types of circuit were automatically 
created, each having certain requirements, such as out- 
putting an amplified copy of the input, and low dis- 
tortion. These functions were used to define fitness. 
A complex GP representation was used, with both 
STGP (Sect. 43.3.3) and ADFs (Sect. 43.5.3). Exe- 
cution of the evolved program began with a trivial 
embryonic circuit. GP program nodes, when executed, 
performed actions such as altering the circuit topol- 
ogy or creating a new component. These nodes were 
parameterized with numerical parameters, also under 
GP control, which could be created by more typical 
arithmetic GP subtrees. The evolved circuits solved 
significant problems to a human-competitive standard 
though they were not fabricated. 

Another significant success story was the space- 
going antenna evolved by Hornby et al. [43.27] for the 
NASA (National Aeronautics and Space Administra- 
tion) Space Technology 5 spacecraft. The task was to 
design an antenna with certain beamwidth and band- 
width requirements, which could be tested in simulation 
(thus providing a natural fitness function). GP was used 
to reduce reliance on human labor and limitations on 
complexity, and to explore areas of the search space 
which would be rejected as not worthy of exploration 
by human designers. Both a GA and a GP representa- 
tion were used, producing quite similar results. The GP 
representation was in some ways similar to a 3-D turtle 
graphics system. Commands included forward which 
moved the turtle forward, creating a wire component, 
and rotate-x which changed orientation. Branching of 
the antenna arms was allowed with special markers 
similar to those used in turtle graphics programs. The 
program composed of these primitives, when run, cre- 
ated a wire structure, which was rotated and copied four 
times to produce a symmetric result for simulation and 
evaluation. 


Aesthetic Design 

There have also been successes in the fields of graphical 
art, 3-D aesthetic design, and music. Given the aesthetic 
nature of these fields, GP fitness is often replaced by 
an interactive approach where the user performs direct 
selection on the population. This approach dates back 
to Dawkins’ seminal Biomorphs [43.107] and has been 
used in other forms of EC also [43.108]. Early suc- 
cesses were those of Todd and Latham [43.109], who 
created pseudo-organic forms, and Sims [43.35] who 
created abstract art. An overview of evolutionary art is 
provided by Lewis [43.110]. 


A key aim throughout aesthetic design is to avoid 
the many random-seeming designs which tend to be 
created by typical representations. For example, a naive 
representation for music might encode each quarter- 
note as an integer in a genome whose length is the 
length of the eventual piece. Such a representation will 
be capable of representing some good pieces of music, 
but it will have several significant problems. The vast 
majority of pieces will be very poor and random sound- 
ing. Small mutations will tend to gradually degrade 
pieces, rather than causing large-scale and semantically 
sensible transformations [43.111]. 

As a result, many authors have tried to use rep- 
resentations which take advantage of forms of reuse. 
Although reuse is also an aim in nonaesthetic GP 
(Sect. 43.5.3), the hypothesis that good solutions will 
tend to involve reuse, even on new, unknown problems, 
is more easily motivated in the context of aesthetic de- 
sign. 

In one strand of research, the time or space to be 
occupied by the work is predefined, and divided into 
a grid of 1, 2, or 3 dimensions. A GP function of 1, 2 or 
3 arguments is then evolved, and applied to each point 
in the grid with the coordinates of the point passed as 
arguments to the function. The result is that the func- 
tion is reused many times, and all parts of the work 
are felt to be coherent. The earliest example of such 
work was that of Sims [43.35], who created fascinat- 
ing graphical art (a 2-D grid) and some animations 
(a 3-D grid of two spatial dimensions and 1 time di- 
mension). The paradigm was later brought to a high 
degree of artistry by Hart [43.112]. The same gener- 
ative idea, now with a 1-D grid representing time, was 
used by Hoover et al. [43.113], Shao et al. [43.114] and 
McDermott and O’Reilly [43.115] to produce music as 
a function of time, and with a 3-D grid by Clune and 
Lipson [43.116] to produce 3-D sculptures. 

Other successful work has used different ap- 
proaches to reuse. L-systems are grammars in which 
symbols are recursively expanded in parallel: after sev- 
eral expansions (a growth process), the string will by 
highly patterned, with multiple copies of some sub- 
strings. Interpreting this string as a program can then 
yield highly patterned graphics [43.117], artificial crea- 
tures [43.118], and music [43.119]. Grammars have 
also been used in 3-D and architectural design, both 
in a modified L-system form [43.36] and in the stan- 
dard GE form [43.120]. The Ossia system of Dahlst- 
edt [43.37] uses GP trees with recursive pointers to 
impose reuse and a natural, gestural quality on short 
pieces of art music. 
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Many research topics of interest to GP practitioners 
are also of broader interest. For example, the self- 
adaptation of algorithm parameters is a topic of interest 
throughout EC. We have chosen to focus on four re- 
search topics of specific interest in GP: bloat, GP 
theory, modularity, and open-ended evolution. 


43.5.1 Bloat 


Most GP-type problems naturally require variable- 
length representations. It might be expected that se- 
lection pressure would effectively guide the popula- 
tion toward program sizes appropriate to the problem, 
and indeed this is sometimes the case. However, it 
has been observed that for many different represen- 
tations [43.121] and problems, programs grow over 
time without apparent fitness improvements. This phe- 
nomenon is called bloat. Since the time complexity for 
the evaluation of a GP program is generally propor- 
tional to its size, this greatly slows the GP run down. 
There are also other drawbacks. The eventual solu- 
tion may be so large and complex that is unreadable, 
negating a key advantage of symbolic methods like GP. 
Overly large programs tend to generalize less well than 
parsimonious ones. Bloat may negatively impact the 
rate of fitness improvement. Since bloat is a significant 
obstacle to successful GP, it is an important topic of re- 
search, with differing viewpoints both on the causes of 
bloat and the best solutions. 

The competing theories of the causes of bloat are 
summarized by Luke and Panait [43.122] and Silva 
et al. [43.123]. A fundamental idea is that adding ma- 
terial rather than removing material from a GP tree 
is more likely to lead to a fitness improvement. The 
hitchhiking theory is that noneffective code is carried 
along by virtue of being attached to useful code. De- 
fense against crossover suggests that large amounts of 
noneffective code give a selection advantage later in GP 
runs when crossover is likely to highly destructive of 
good, fragile programs. Removal bias is the idea that it 
is harder for GP operators to remove exactly the right 
(i. e., noneffective) code than it is to add more. The fit- 
ness causes bloat theory suggests that fitness-neutral 
changes tend to increase program size just because 
there are many more programs with the same func- 
tionality at larger sizes than at smaller [43.124]. The 
modification point depth theory suggests that children 
formed by tree crossover at deep crossover points are 
likely to have fitness similar to their parents and thus 


more likely to survive than the more radically different 
children formed at shallow crossover points. Because 
larger trees have more very deep potential crossover 
points, there is a selection pressure toward growth. Fi- 
nally, the crossover bias theory [43.125] suggests that 
after many crossovers, a population will tend toward 
a limiting distribution of tree sizes [43.126] such that 
small trees are more common than large ones — note 
that this is the opposite of the effect that might be 
expected as the basis of a theory of bloat. However, 
when selection is considered, the majority of the small 
programs cannot compete with the larger ones, and 
so the distribution is now skewed in favour of larger 
programs. 

Many different solutions to the problem of bloat 
have been proposed, many with some success. One sim- 
ple method is depth limiting, imposing a fixed limit on 
the tree depth that can be produced by the variation op- 
erators [43.19]. 

Another simple but effective method is Tarpeian 
bloat control [43.127]. Individuals which are larger than 
average receive, with a certain probability, a constant 
punitively bad fitness. The advantage is that these in- 
dividuals are not evaluated, and so a huge amount of 
time can be saved and devoted to running more genera- 
tions (as in [43.122]). The Tarpeian method does allow 
the population to grow beyond its initial size, since the 
punishment is only applied to a proportion of individu- 
als — typically around 1 in 3. This value can also be set 
adaptively [43.127]. 

The parsimony pressure method evaluates all indi- 
viduals, but imposes a fitness penalty on overly large 
individuals. This assumes that fitness is commensurable 
with size: the magnitude of the punishment establishes 
a de facto exchange rate between the two. Luke and 
Panait [43.122] found that parsimony pressure was ef- 
fective across problems and across a wide range of 
exchange rates. 

The choice of an exchange rate can be avoided using 
multiobjective methods, such as Pareto-GP [43.128], 
where one of the objectives is fitness and the other 
program length or complexity. The correct definition 
for complexity in this context is itself an interesting 
research topic [43.96, 129]. Alternatively, the pressure 
against bloat can be moved from the fitness evalua- 
tion phase to the the selection phase of the algorithm, 
using the double tournament method [43.122]. Here 
individuals must compete in one fitness-based tour- 
nament and one size-based one. Another approach 
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is to incorporate tree size directly into fitness eval- 
uation using a minimum description length princi- 
ple [43.130]. 

Another technique is called operator length equal- 
ization. A histogram of program sizes is maintained 
throughout the run and is used to set the popula- 
tion’s capacity for programs of different sizes. A newly 
created program which would cause the population’s 
capacity to be exceeded is rejected, unless exception- 
ally fit. A mutation-based variation of the method 
instead mutates the overly large individuals using 
directed mutation to become smaller or larger as 
needed. 

Some authors have argued that the choice of GP rep- 
resentation can avoid the issue of bloat [43.131]. Some 
aim to avoid the problem of bloat by speeding up fit- 
ness evaluation [43.82, 132] or avoiding wasted effort 
in evaluation [43.133, 134]. Sometimes GP techniques 
are introduced with other motivations but have the side- 
effect of reducing bloat [43.135]. 

In summary, researchers including Luke and 
Panait [43.122], Poli etal. [43.127], Miller [43.131], 
and Silva et al. [43.123] have effectively declared vic- 
tory in the fight against bloat. However, their techniques 
have not yet become standard for new GP research and 
benchmark experiments. 


43.5.2 GP Theory 


Theoretical research in GP seeks to answer a variety of 
questions, for example: What are the drivers of popula- 
tion fitness convergence? How does the behavior of an 
operator influence the progress of the algorithm? How 
does the combination of different algorithmic mecha- 
nisms steer GP toward fitter solutions? What mecha- 
nisms cause bloat to arise? What problems are difficult 
for GP? How diverse is a GP population? Theoretical 
methodologies are based in mathematics and exploit 
formalisms, theorems, and proofs for rigor. While GP 
may appear simple, beyond its stochastic nature which 
it shares with all other evolutionary algorithms, its 
variety of representations each impose specific require- 
ments for theoretical treatment. All GP representations 
share two common traits which greatly contribute to 
the difficulty it poses for theoretical analysis. First, the 
representations have no fixed size, implying a complex 
search space. Second, GP representations do not im- 
ply that parents will be equal in size and shape. While 
crossover accommodates this lack of synchronization, 
it generally allows the exchange of content from any- 
where in one parent to anywhere in the other parent’s 


tree. This implies combinatorial outcomes and likes not 
switching with likes. This functionality contributes to 
complicated algorithmic behavior which is challenging 
to analyze. 

Here, we select several influential methods of theo- 
retical analysis and very briefly describe them and their 
results: schema-based analysis, Markov chain model- 
ing, runtime complexity, and problem difficulty. We 
also introduce the No Free Lunch Theorem and describe 
its implications for GP. 


Schema-Based Analysis 

In schema-based analysis, the search space is con- 
ceptually partitioned into hyperplanes (also known as 
schemas) which represent sets of partial solutions. 
There are numerous ways to do this and, as a con- 
sequence, multiple schema definitions have been pro- 
posed [43.136-139]. The fitness of a schema is esti- 
mated as the average fitness of all programs in the 
sample of its hyperplane, given a population. The pro- 
cesses of fitness-based selection and crossover are for- 
malized in a recurrence equation which describes the 
expected number of programs sampling a schema from 
the current population to the next. Exact formulations 
have been derived for most types of crossover [43.140, 
141]. These alternatively depend on making explicit the 
effects and the mechanisms of schema creation. This 
leads to insight; however, tracking schema equations 
in actual GP population dynamics is infeasible. Also, 
while schema theorems predict changes from one gen- 
eration to the next, they cannot predict further into the 
future to predict the long-term dynamics that GP prac- 
titioners care about. 


Markov Chain Analysis 
Markov chain models are one means of describing such 
long-term GP dynamics. They take advantage of the 
Markovian property observed in a GP algorithm: the 
composition of one generation’s population relies only 
upon that of the previous generation. Markov chains 
describe the probabilistic movement of a particular pop- 
ulation (state) to others using a probabilistic transition 
matrix. In evolutionary algorithms, the transition matrix 
must express the effects of any selection and varia- 
tion operators. The transition matrix, when multiplied 
by itself k times, indicates which new populations can 
be reached in k generations. This, in principle, allows 
a calculation of the probability that a population with 
a solution can be reached. To date a Markov chain for 
a simplified GP crossover operator has been derived, 
see [43.142]. Another interesting Markov chain-based 
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Substitution to 
chosen node 


result has revealed that the distribution of functionality 
of non-Turing complete programs approaches a limit 
as length increases. Markov chain analysis has also 
been the means of describing what happens with GP 
semantics rather than syntax. The influence of sub- 
tree crossover is studied in a semantic building block 
analysis by [43.143]. Markov chains, unfortunately, 
combinatorially explode with even simple extensions of 
algorithm dynamics or, in GP’s case, its theoretically in- 
finite search space. Thus, while they can support further 
analysis, ultimately this complexity is unwieldy to work 
with. 


Runtime Complexity 
Due to stochasticity, it is arguably impossible in most 
cases to make formal guarantees about the number of 
fitness evaluations needed for a GP algorithm to find 
an optimal solution. However, initial steps in the run- 
time complexity analysis of genetic programming have 
been made in [43.144]. The authors study the runtime 
of hill climbing GP algorithms which use a mutation 
operator called HVL-Prime (Figs. 43.11 and 43.12). 
Several of these simplified GP algorithms were ana- 
lyzed on two separable model problems, Order and 
Majority introduced in [43.145]. Order and Majority 
each have an independent, additive fitness structure. 
They each admit multiple solutions based on their ob- 
jective function, so they exhibit a key property of 
all real GP problems. They each capture a different 
relevant facet of typical GP problems. Order repre- 
sents problems, such as classification problems, where 
the operators include conditional functions such as 
an IF-THEN-ELSE. These functions give rise to con- 
ditional execution paths which have implications for 
evolvability and the effectiveness of crossover. Ma- 
jority is a GP equivalent of the GA OneMax prob- 
lem [43.146]. It reflects a general (and thus weak) 
property required of GP solutions: a solution must 
have correct functionality (by evolving an aggrega- 
tion of subsolutions) and no incorrect functionality. 
The analyses highlighted, in particular, the impact of 
accepting or rejecting neutral moves and the impor- 


c) 


Fig. 43.11a-c HVL-prime mutation: 
substitution and deletion (a) Original 
(I) parse tree, (b) Result of substitution 


O O (c) Result of deletion 


tance of a local mutation operator. A similar finding, 
[43.147], regarding mutation arose from the analy- 
sis of the Max problem [43.148] and hillclimbing. 
For a search process bounded by a maximally sized 
tree of n nodes, the time complexity of the sim- 
ple GP mutation-based hillclimbing algorithms using 
HVL-Prime for the entire range of MAX variants are 
O(n log? n) when one mutation operation precedes each 
fitness evaluation. When multiple mutations are succes- 
sively applied before each fitness evaluation, the time 
complexity is O(n*). This complexity can be reduced 
to O(nlogn) if the mutations are biased to replace 
a random leaf with distance d from the root with prob- 
ability 27%. 

Runtime analyses have also considered parsimony 
pressure and multiobjective GP algorithms for general- 
izations of Order and Majority [43.149]. 

GP algorithms have also been studied in the PAC 
learning framework [43.150]. 


Problem Difficulty 
Problem difficulty is the study of the differences be- 
tween algorithms and problems which lead to differ- 
ences in performance. Stated simply, the goal is to 
understand why some problems are easy and some are 
hard, and why some algorithms perform well on certain 
problems and others do not. Problem difficulty work in 
the field of GP has much in common with similar work 
in the broader field of EC. Problem difficulty is nat- 
urally related to the size of the search space; smaller 
spaces are easier to search, as are spaces in which 


hosen New 


Fig. 43.12a,b HVL-prime mutation: insertion (a) Original 
parse tree, (b) Result of insertion 
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the solution is over-represented [43.151]. Difficulty is 
also related to the fitness landscape [43.152], which in 
turn depends on both the problem and the algorithm 
and representation chosen to solve it. Landscapes with 
few local optima (visualized in the fitness landscape as 
peaks which are not as high as that of the global opti- 
mum) are easier to search. Locality, that is the property 
that small changes to a program lead to small changes 
in fitness, implies a smooth, easily searchable land- 
scape [43.151, 153]. 

However, more precise statements concerning prob- 
lem difficulty are usually desired. One important line of 
research was carried out by Vanneschi et al. [43.154— 
156]. This involved calculating various measures of the 
correlation of the fitness landscape, that is the rela- 
tionship between distance in the landscape and fitness 
difference. The measures include the fitness distance 
correlation and the negative slope coefficient. These 
measures require the definition of a distance measure 
on the search space, which in the case of standard GP 
means a distance between pairs of trees. Various tree 
distance measures have been proposed and used for this 
purpose [43.157—160]. However, the reliable prediction 
of performance based purely on landscape analysis re- 
mains a distant goal in GP as it does in the broader field 
of EC. 


No Free Lunch 

In a nutshell, the No Free Lunch Theorem [43.161] 
proves that, averaged over all problem instances, no 
algorithm outperforms another. Follow-up NFL anal- 
ysis [43.162, 163] yields a similar result for problems 
where the set of fitness functions are closed under per- 
mutation. One question is whether the NFL theorem 
applies to GP algorithms: for some problem class, is 
it worth developing a better GP algorithm, or will this 
effort offer no extra value when all instances of the 
problem are considered? Research has revealed two 
conditions under which the NFL breaks down for GP 
because the set of fitness functions is not closed un- 
der permutation. First, GP has a many-to-one syntax 
tree to program output mapping because many differ- 
ent programs have the same functionality while pro- 
gram output functionality is not uniformly distributed 
across syntax trees. Second, a geometric argument has 
shown [43.164], that many realistic situations exist 
where a set of GP problems is provably not closed un- 
der permutation. The implication of a contradiction to 
the No Free Lunch theorem is that it is worthwhile in- 
vesting effort in improving a GP algorithm for a class 
of problems. 


43.5.3 Modularity 


Modularity in GP is the ability of a representation 
to evolve good building blocks and then encapsulate 
and reuse them. This can be expected to make com- 
plex programs far easier to find, since good building 
blocks needed in multiple places in the program not 
be laboriously re-evolved each time. One of the best- 
known approaches to modularity is automatically de- 
fined functions (ADFs), where the building blocks are 
implemented as functions which are defined in one 
part of the evolving program and then invoked from 
another part [43.58]. This work was followed by au- 
tomatically defined macros which are more powerful 
than ADFs and allow control of program flow [43.165]; 
automatically defined iteration, recursion, and mem- 
ory stores [43.10]; modularity in other representa- 
tions [43.166]; and demonstrations of the power of 
reuse, [43.167]. 


43.5.4 Open-Ended Evolution and GP 


Biological evolution is a long-running exploration of 
the enormously varied and indefinitely sized DNA 
search space. There is no hint that a limit on new ar- 
eas of the space to be explored will ever be reached. 
In contrast, EC algorithms often operate in search 
spaces which are finite and highly simplified in com- 
parison to biology. Although GP itself can be used 
for a wide variety of tasks (Sect. 43.1), each specific 
instance of the GP algorithm is capable of solving 
only a very narrow problem. In contrast, some re- 
searchers see biological evolution as pointing the way 
to a more ambitious vision of the possibilities for 
GP [43.168]. In this vision, an evolutionary run would 
continue for an indefinite length of time, always ex- 
ploring new areas of an indefinitely sized search space; 
always responding to changes in the environment; and 
always reshaping the search space itself. This vision 
is particularly well suited to GP, as opposed to GAs 
and similar algorithms, because GP already works in 
search spaces which are infinite in theory, if not in 
practice. 

To make this type of GP possible, it is necessary to 
prevent convergence of the population on a narrow area 
of the search space. Diversity preservation [43.169], pe- 
riodic injection of new random material [43.170], and 
island-structured population models [43.88] can help in 
this regard. 

Open-ended evolution would also be facilitated by 
complexity and nonstationarity in the algorithm’s evo- 
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lutionary ecosystem. If fitness criteria are dynamic or 
coevolutionary [43.171—173], there may be no natural 
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43.6.1 Conferences and Journals 


Several conferences provide venues for the publication 
of new GP research results. The ACM Genetic and 
Evolutionary Computation Conference (GECCO) alter- 
nates annually between North America and the rest of 
the world and includes a GP track. EuroGP is held an- 
nually in Europe as the main event of Evo*, and focuses 
only on GP. The IEEE Congress on Evolutionary Com- 
putation is a larger event with broad coverage of EC 
in general. Genetic Programming Theory and Practice 
is held annually in Ann Arbor, MI, USA and provides 
a focused forum for GP discussion. Parallel Problem 
Solving from Nature is one of the older, general EC con- 
ferences, held biennially in Europe. It alternates with 
the Evolution Artificielle conference. Finally, Founda- 
tions of Genetic Algorithms is a smaller, theory-focused 
conference. 

The journal most specialized to the field is prob- 
ably Genetic Programming and Evolvable Machines 
(published by Springer). The September 2010, 10-year 
anniversary issue included several review articles on 
GP. Evolutionary Computation (MIT Press) and the 
IEEE Transactions on Evolutionary Computation also 
publish important GP material. Other on-topic journals 
with a broader focus include Applied Soft Computing 
and Natural Computing. 


43.6.2 Software 


A great variety of GP software is available. We will 
mention only a few packages — further options can be 
found online. 

One of the well-known Java systems is 
ECJ [43.174,175]. It is a general-purpose system 
with support for many representations, problems, and 
methods, both within GP and in the wider field of EC. 
It has a helpful mailing list. Watchmaker [43.176] is 
another general-purpose system with excellent out- 
of-the-box examples. GEVA [43.177, 178] is another 
Java-based package, this time with support only for 
GE. 

For users of C++ there are also several op- 
tions. Some popular packages include Evolutionary 


end-point to evolution, and so continued exploration un- 
der different criteria can lead to unlimited new results. 


Objects [43.179], wGP [43.180-182], and OpenBea- 
gle [43.183, 184]. Matlab users may be interested in 
GPLab [43.185], which implements standard GP, while 
DEAP [43.186] provides implementations of several al- 
gorithms in Python. PushGP [43.187] is available in 
many languages. 

Two more systems are worth mentioning for their 
deliberate focus on simplicity and understandability. 
TinyGP [43.188] and PonyGE [43.189] implement stan- 
dard GP and GE respectively, each in a single, readable 
source file. 

Moving on from open source, Michael Schmidt and 
Hod Lipson’s Eureqa [43.190] is a free-to-use tool with 
a focus on symbolic regression of numerical data and 
the built-in ability to use cloud resources. 

Finally, the authors are aware of two commercially 
available GP tools, each fast and industrial-strength. 
They have more automation and it just works function- 
ality, relative to most free and open-source tools. Free 
trials are available. DataModeler (Evolved Analytics 
LLC) [43.191] is a notebook in Mathematica. It em- 
ploys the ParetoGP method [43.128] which gives the 
ability to trade program fitness off against complex- 
ity, and to form ensembles of programs. It also exploits 
complex population archiving and archive-based selec- 
tion. It offers means of dealing with ill-conditioned data 
and extracting information on variable importance from 
evolved models. Discipulus (Register Machine Learn- 
ing Technologies, Inc.) [43.192] evolves machine code 
based on the ideas of Nordin et al. [43.193]. It runs 
on Windows only. The machine code representation 
allows very fast fitness evaluation and low memory us- 
age, hence large populations. In addition to typical GP 
features, it can: use an ES to optimise numerical con- 
stants; automatically construct ensembles; preprocess 
data; extract variable importance after runs; automat- 
ically simplify results; and save them to high-level 
languages. 


43.6.3 Resources and Further Reading 
Another useful resource for GP research is the GP 


Bibliography [43.194]. In addition to its huge, regu- 
larly updated collection of BibTeX-formatted citations, 
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it has lists of researchers’ homepages [43.195] and co- 
authorship graphs. The GP mailing list [43.196] is one 
well-known forum for discussion. 

Many of the traditional GP benchmark problems 
have been criticized for being unrealistic in various 
ways. The lack of standardization of benchmark prob- 
lems also allows the possibility of cherry-picking of 
benchmarks. Effort is underway to bring some stan- 
dardization to the choice of GP benchmarks [43.197, 
198]. 

Those wishing to read further have many good 
options. The Field Guide to GP is a good introduc- 
tion, walking the reader through simple examples, 
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Nikolaus Hansen, Dirk V. Arnold, Anne Auger 


Evolution strategies (ES) are evolutionary algo- 
rithms that date back to the 1960s and that are 
most commonly applied to black-box optimization 
problems in continuous search spaces. Inspired 
by biological evolution, their original formula- 
tion is based on the application of mutation, 
recombination and selection in populations of 
candidate solutions. From the algorithmic view- 
point, ES are optimization methods that sample 
new candidate solutions stochastically, most com- 
monly from a multivariate normal probability 
distribution. Their two most prominent design 
principles are unbiasedness and adaptive con- 
trol of parameters of the sample distribution. In 
this overview, the important concepts of success 
based step-size control, self-adaptation, and de- 
randomization are covered, as well as more recent 
developments such as covariance matrix adapta- 
tion and natural ES. The latter give new insights 
into the fundamental mathematical rationale be- 
hind ES. A broad discussion of theoretical results 
includes progress rate results on various func- 
tion classes and convergence proofs for evolution 
strategies. 
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44.1 Overview 


Evolution strategies [44.1—4], sometimes also referred 
to as evolutionary strategies, and evolutionary pro- 
gramming [44.5] are search paradigms inspired by the 
principles of biological evolution. They belong to the 
family of evolutionary algorithms that address opti- 
mization problems by implementing a repeated process 
of (small) stochastic variations followed by selection. 
In each generation (or iteration), new offspring (or 
candidate solutions) are generated from their parents 
(candidate solutions already visited), their fitness is 
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evaluated, and the better offspring are selected to be- 
come the parents for the next generation. 

ES most commonly address the problem of contin- 
uous black-box optimization. The search space is the 
continuous domain, R”, and solutions in search space 
are n-dimensional vectors, denoted as x. We consider 
an objective or fitness function f : R” > R, x bh f(x) to 
be minimized. We make no specific assumptions on f, 
other than that f can be evaluated for each x, and re- 
fer to this search problem as black-box optimization. 
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The objective is, loosely speaking, to generate solu- 
tions (x-vectors) with small f-values while using a small 
number of f-evaluations. Formally, we like to converge 
to an essential global optimum of f, in the sense that 
the best f(x) value gets arbitrarily close to the essen- 
tial infimum of f (i.e., the smallest f-value for which 
all larger, i.e., worse f-values have sublevel sets with 
positive volume). 

In this context, we present an overview of methods 
that sample new offspring, or candidate solutions, from 
normal distributions. Naturally, such an overview is bi- 
ased by the authors’ viewpoints, and our emphasis will 
be on important design principles and on contemporary 
ES that we consider as most relevant in practice or fu- 
ture research. More comprehensive historical overviews 
can be found elsewhere [44.6, 7]. 

In the next section, the main principles are intro- 
duced and two algorithm templates for an evolution 
strategy are presented. Section 44.3 presents six ES that 
mark important conceptual and algorithmic develop- 
ments. Section 44.4 summarizes important theoretical 
results. 


44.1.1 Symbols and Abbreviations 


Throughout this chapter, vectors like z € R” are column 
vectors, their transpose is denoted as zT, and transfor- 
mations like exp(z), z*, or |z| are applied component- 
wise. Further symbols are: 


e |z\= (\z|.|z2|,...)" absolute value taken compo- 

nent wise 

\Iz||= ./ >; z? Euclidean length of a vector 

~ equality in distribution 

œx in the limit proportional to 

o binary operator giving the component-wise prod- 

uct of two vectors or matrices (Hadamard product), 

such that for a,b e R” we have aob eR” and 

(aob); = aibi. 

@ 1. the indicator function, ly = 0 if œ is false or 0 or 
empty, and ly = 1 otherwise. 

@ cN number of offspring, offspring population 
size 

© €N number of parents, parental population size 

è y= (ey Iwl) "/ Xi w, the variance ef- 
fective selection mass or effective number of par- 
ents, where always Uw < u and py = p if all re- 
combination weights wę are equal in absolute value 

@ (1+ 1) elitist selection scheme with one parent and 
one offspring, see Sect. 44.2.5 


(ut A), e.g., (1+1) or (1, A), selection schemes, see 
Sect. 44.2.5 

(u/p,à) selection scheme with recombination (if 
p > 1), see Sect. 44.2.5 

p E€ N number of parents for recombination 

o > Oa step-size and/or standard deviation 

o € R} a vector of step-sizes and/or standard devi- 
ations 
gy € R a progress measure, see Definition 44.2 and 
Sect. 44.4.2 

Cu/u.a the progress coefficient for the (u/n, A)- 
ES [44.8] equals the expected value of the average 
of the largest jz order statistics of A independent 
standard normally distributed random numbers and 
is of the order of ,/2 log(A/,1). 

Ce R” a (symmetric and positive definite) co- 
variance matrix 

C2 € R"*” a matrix that satisfies cic} = C and 
is symmetric if not stated otherwise. If C? is sym- 


metric, the eigendecomposition C? = BAB! with 
BB! = I and the diagonal matrix A exists and we 
find C = C?C? = BAB as eigendecomposition 
of C. 

e; the i-th canonical basis vector 

f: R” — R fitness or objective function to be mini- 
mized 

Te R”*" the identity matrix (identity transforma- 
tion) 

i.i.d. independent and identically distributed 

N (x,C) a multivariate normal distribution with 
expectation and modal value x and covariance ma- 
trix C, see Sect. 44.2.8. 

n € N search space dimension 

P a multiset of individuals, a population 

S,Sq,S¢ < € R” a search path or evolution path 

sS, Sg endogenous strategy parameters (also known 
as control parameters) of a single parent or the k-th 
offspring; they typically parametrize the mutation, 
for example with a step-size o or a covariance ma- 
trix C 

t € N time or iteration index 

wk € R recombination weights 

x, x” x, € R” solution or object parameter vector 
of a single parent (at iteration £t) or of the k-th off- 
spring; an element of the search space R” that serves 
as argument to the fitness function f : R” > R. 
diag: R” > R"*" the diagonal matrix from 
a vector 

exp“ : R?” > R™", Abs SO (@A)‘/ i! 

is the matrix exponential for n> 1, otherwise 
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the exponential function. If A is symmetric and 
BAB! = A is the eigendecomposition of A with 
BB'=I and A diagonal, we have exp(A) = 


44,2 Main Principles 


ES derive inspiration from principles of biological evo- 
lution. We assume a population, P, of so-called indi- 
viduals. Each individual consists of a solution or object 
parameter vector x € R” (the visible traits) and further 
endogenous parameters, s (the hidden traits), and an 
associated fitness value, f(x). In some cases, the popu- 
lation contains only one individual. Individuals are also 
denoted as parents or offspring, depending on the con- 
text. In a generational procedure: 


1. One or several parents are picked from the popula- 
tion (mating selection) and new offspring are gen- 
erated by duplication and recombination of these 
parents. 

2. The new offspring undergo mutation and become 
new members of the population. 

3. Environmental selection reduces the population to 
its original size. 


Within this procedure, ES employ the following 
main principles that are specified and applied in the op- 
erators and algorithms further below. 


44.2.1 Environmental Selection 


Environmental selection is applied as so-called trun- 
cation selection. Based on the individuals’ fitnesses, 
f(x), only the u best individuals from the popula- 
tion survive. In contrast to roulette wheel selection 
in genetic algorithms [44.9], only fitness ranks are 
used. In evolution strategies, environmental selection 
is deterministic. In evolutionary programming, like 
in many other evolutionary algorithms, environmental 
selection has a stochastic component. Environmen- 
tal selection can also remove overaged individuals 
first. 


44.2.2 Mating Selection and Recombination 
Mating selection picks individuals from the population 


to become new parents. Recombination generates a sin- 
gle new offspring from these parents. Specifically, we 


B exp(A)BT = B (X2, A‘/i!) BT = I + BAB! + 
BA?BT/2 + ---. Furthermore, we have exp“ (A) = 
exp(A)® = exp(@A) and exp® (x) = (e%) = e®%. 


differentiate two common scenarios for mating selec- 
tion and recombination: 


© Fitness-independent mating selection and recom- 
bination do not depend on the fitness values of 
the individuals and can be either deterministic 
or stochastic. Environmental selection is then es- 
sential to drive the evolution toward better solu- 
tions. 

© Fitness-based mating selection and recombination, 
where the recombination operator utilizes the fitness 
ranking of the parents (in a deterministic way). En- 
vironmental selection can potentially be omitted in 
this case. 


44.2.3 Mutation and Parameter Control 


Mutation introduces small, random, and unbiased 
changes to an individual. These changes typically affect 
all variables. The average size of these changes depends 
on endogenous parameters that change over time. These 
parameters are also called control parameters, or en- 
dogenous strategy parameters, and define the notion 
of small, for example, via the step-size o. In contrast, 
exogenous strategy parameters are fixed once and for 
all, for example, parent number jz. Parameter control 
is not always directly inspired by biological evolution, 
but is an indispensable and central feature of evolution 
strategies. 


44.2.4 Unbiasedness 


Unbiasedness is a generic design principle of evolu- 
tion strategies. Variation resulting from mutation or 
recombination is designed to introduce new, unbiased 
information. Selection, on the other hand biases this 
information toward solutions with better fitness. Un- 
der neutral selection (i.e., fitness independent mating 
and environmental selection), all variation operators are 
desired to be unbiased. Maximum exploration and unbi- 
asedness are in accord. ES are unbiased in the following 
respects: 


zirh | J Hed 


874 Part E | Evolutionary Computation 
@ The type of mutation distribution, the Gaussian or Occasionally, a subscript to p is used in order to 
normal distribution, is chosen in order to have rota- denote the type of recombination, e.g., pr or pw for 
tional symmetry and maximum entropy (maximum intermediate or weighted recombination, respectively. 
exploration) under the given variances. Decreasing Without a subscript, we tacitly assume intermediate 
the entropy would introduce prior information and recombination, if not stated otherwise. The notation 
therefore a bias. has also been expanded to include the maximum age, 
@ Object parameters and endogenous strategy param- x, of individuals as (w,«,A)-ES [44.11], where plus- 
eters are unbiased under recombination and unbi- selection corresponds to x = oo and comma-selection 
ased under mutation. Typically, mutation has expec- corresponds to k = 1. 
tation zero. 
@ Invariance properties avoid a bias toward a specific 44.2.6 Two Algorithm Templates 
representation of the fitness function, e.g., repre- 
sentation in a specific coordinate system or using Algorithm 44.1 gives pseudocode for the evolution 
y specific fitness values (invariance to strictly mono- strategy. 
=i tonic transformations of the fitness values can be 
m achieved). Parameter control in evolution strategies Algorithm 44.1 The (j1/p+A)-ES 
= strives for invariance properties [44.10]. 1: given n, p, u, à E€ N4 
> 2: initialize P = {(x,, s f) | 1 < k < u} 
44.2.5 (u/p t A) Notation for Selection 3: while not happy 
and Recombination 4: forke{l,...,a} 
5 (xx, Sk) = recombine(select_mates(p, P)) 
An evolution strategy is an iterative (generational) 6: sk < mutate_s(s;) 
procedure. In each generation new individuals (off- 7 x, < mutate_x(s;,x,) € R” 
spring) are created from existing individuals (parents). 8: P < PU {(xk, Sk f(x~)) | 1<k <A} 
A mnemonic notation is commonly used to describe 9 P <select_by_age(P) // identity for ‘+’ 
some aspects of this iteration. The (u/ptA)-ES, 10: P< select_u_best(u, P) // by f-ranking 


where u, p and À are positive integers, also frequently 
denoted as (u + A)-ES (where p remains unspecified) 
describes the following: 


@ The parent population contains jz individuals. 

@ For recombination, p (out of u) parent individuals 
are used. We have therefore p < p. 

@ A denotes the number of offspring generated in each 
iteration. 

e + describes whether or not selection is additionally 
based on the individuals’ age. An evolution strategy 
applies either plus- or comma-selection. In plus- 
selection, age is not taken into account and the jz 
best of u +À individuals are chosen. Selection is eli- 
tist and, in effect, the parents are the ju all-time best 
individuals. In comma-selection, individuals die out 
after one iteration step and only the offspring (the 
youngest individuals) survive to the next generation. 
In that case, environmental selection chooses ju par- 
ents from A offspring. 


In a (u, A)-ES, A > u must hold and the case A = u 
requires fitness-based mating selection or recombina- 
tion. In a (u + A)-ES, A = 1 is possible and known as 
steady-state scenario. 


Given is a population, P, of at least jz individu- 
als (Xx, 5k,f(Xx)), K=1,..., u. Vector x, € R” is a so- 
lution vector and s; contains the control or endogenous 
strategy parameters, for example, a success counter or 
a step-size that primarily serves to control the mutation 
of x (in Line 7). The values of są may be identical for 
all k. In each generation, first À offspring are generated 
(Lines 4-7), each by recombination of p < jz individu- 
als from P (Line 2), followed by mutation of s (Line 6) 
and of x (Line 7). The new offspring are added to P 
(Line 8). Overaged individuals are removed from P 
(Line 9), where individuals from the same generation 
have, by definition, the same age. Finally, the best u 
individuals are retained in P (Line 10). 

The mutation of the x-vector in Line 7 always in- 
volves a stochastic component. Lines 5 and 6 may have 
stochastic components as well. 

When select_mates in Line 5 selects pọ = p in- 
dividuals from P, it reduces to the identity. If p = wu 
and recombination is deterministic, as is commonly the 
case, the result of recombine is the same parental cen- 
troid for all offspring. The computation of the parental 
centroid can be done once before the for loop or as the 
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last step of the while loop, simplifying the initialization 
of the algorithm. Algorithm 44.2 shows the pseudocode 
in this case. 


Algorithm 44.2 The (u/n + A)-ES 
1: givenn,A € N+ 
2: initialize x € R”, s, P = {} 
while not happy 
fork e{1,...,A} 
sp = Mutate_s(s) 
x, = mutate_x(s,;,x) 
P— PU { (Xk, Sk f Xx))} 
P < select_by_age(P) 
(x, s) < recombine(P, x, s) 


io 


// identity for ‘+’ 


W oo SOD we 


In Algorithm 44.2, only a single parental centroid 
(x, 5) is initialized. Mutation takes this parental cen- 
troid as input (notice that są and x; in Lines 5 and 6 are 
now assigned rather than updated) and recombination 
is postponed to the end of the loop, computing in Line 9 
the new parental centroid. While (xx, s) can contain all 
necessary information for this computation, it is often 
more transparent to use x and s as additional arguments 
in Line 9. Selection based on f-values is now limited to 
mating selection in procedure recombine (that is, pro- 
cedure select_j_best is omitted and jz is the number 
of individuals in P that are actually used by recom- 
bine). 

Using a single parental centroid has become the 
most popular approach, because such algorithms are 
simpler to formalize, easier to analyze, and even per- 
form better in various circumstances as they allow for 
maximum genetic repair (see in the following). All 
instances of ES given in Sect. 44.3 are based on Al- 
gorithm 44.2. 


44.2.7 Recombination Operators 


In ES, recombination combines information from sev- 
eral parents to generate a single new offspring. Often, 
multirecombination is used, where more than two par- 
ents are recombined (p > 2). In contrast, in genetic 
algorithms often two offspring are generated from the 
recombination of two parents. In evolutionary program- 
ming, recombination is generally not used. The most 
important recombination operators used in evolution 
strategies are the following: 


© Discrete or dominant recombination, denoted by 
(u/pp T A), is also known as uniform crossover in 
genetic algorithms. For each variable (component 


of the x-vector), a single parent is drawn uniformly 
from all p parents to inherit the variable value. For 
p parents that all differ in each variable value, the 
result is uniformly distributed across p” different 
x-values. The result of discrete recombination de- 
pends on the given coordinate system. 

© Intermediate recombination, denoted by (u/pr + 
à), takes the average value of all p parents (com- 
putes the center of mass, the centroid). 

© Weighted multirecombination [44.10, 12,13], de- 
noted by (u/pw t A), is a generalization of inter- 
mediate recombination, usually with p = n. It takes 
a weighted average of all p parents. The weight 
values depend on the fitness ranking, in that bet- 
ter parents never get smaller weights than inferior 
ones. With equal weights, intermediate recombina- 
tion is recovered. By using comma selection and 
p = H = À, where some of the weights may be zero, 
weighted recombination can take over the role of 
fitness-based environmental selection and negative 
weights become a feasible option [44.12, 13]. The 
sum of weights must be either one or zero, or re- 
combination must be applied to the vectors x, —x 
and the result added to x. 


In principle, recombination operators from genetic 
algorithms, like one-point and two-point crossover or 
line recombination [44.14] can alternatively be used. 
However, they have been rarely applied in ES. 

In ES, the result of selection and recombination 
is often deterministic (namely, if ọ = u and recombi- 
nation is intermediate or weighted). This means that 
eventually all offspring are generated by mutation from 
the same single solution vector (the parental centroid) 
as in Algorithm 44.2. This leads, for given variances, 
to maximum entropy because all offspring are inde- 
pendently drawn from the same normal distribution. 
With discrete recombination, the offspring distribution 
is generated from a mixture of normal distributions 
with different mean values. The resulting distribu- 
tion has lower entropy unless it has a larger overall 
variance. 

The role of recombination, in general, is to keep the 
variation in a population high. Discrete recombination 
directly introduces variation by generating different 
solutions. Their distance resembles the distance be- 
tween the parents. However, discrete recombination, 
as it depends on the given coordinate system, relies 
on separability: it can introduce variation successfully 
only if values of disrupted variables do not strongly de- 
pend on each other. Solutions resulting from discrete 
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Fig. 44.1a-c Three two-dimensional multivariate normal distributions N (0, C) ~ CIN (0, I). The covariance matrix C 
of the distribution is, from left to right, the identity I (isotropic distribution), the diagonal matrix ( We 9) (axis-parallel 
distribution) and (7-122 1:873) with the same eigenvalues (1/4,4) as the diagonal matrix. Shown are in each subfigure 
the mean at 0 as small black dot (a different mean solely changes the axis annotations), two eigenvectors of C along 
the principal axes of the ellipsoids (thin black lines), two ellipsoids reflecting the set of points {x : (x —0)™C7!(x—0) € 


{1, 4}} that represent the 1-o and 2-o lines of equal density, and 100 sampled points (however, a few of them are likely 


to be outside of the area shown) 


recombination lie on the vertices of an axis-parallel 
box. 

Intermediate and weighted multirecombination do 
not lead to variation within the new population as they 
result in the same single point for all offspring. How- 
ever, they do allow the mutation operator to introduce 
additional variation by means of genetic repair [44.15]. 
Recombinative averaging reduces the effective step 
length taken in unfavorable directions by a factor of ./u 
(or ./fly in the case of weighted recombination), but 
leaves the step length in favorable directions essentially 
unchanged, see also Sect. 44.4.2. This may allow in- 
creased variation by enlarging mutations by a factor of 
about u (or jy) as revealed in (44.16), to achieve max- 
imal progress. 


44.2.8 Mutation Operators 


The mutation operator introduces (small) variations by 
adding a point symmetric perturbation to the result 
of recombination, say a solution vector x € R”. This 
perturbation is drawn from a multivariate normal dis- 
tribution, N (0, C), with zero mean (expected value) 
and covariance matrix C € R”*". Besides normally dis- 
tributed mutations, Cauchy mutations [44.16—18] have 
also been proposed in the context of ES and evolution- 
ary programming. We have x + N(0,C) ~ N(x, C), 
meaning that x determines the expected value of the 
new offspring individual. We also have x + N (0, C) ~ 


x+ CIN (0, I), meaning that the linear transformation 
C? generates the desired distribution from the vec- 
tor N (0, I) that has i.i.d. N (0, 1) components. (Using 
the normal distribution has several advantages. The 
N (0,1) distribution is the most convenient way to im- 
plement an isotropic perturbation. The normal distribu- 
tion is stable: sums of independent normally distributed 
random variables are again normally distributed. This 
facilitates the design and analysis of algorithms remark- 
ably. Furthermore, the normal distribution has maxi- 
mum entropy under the given variances.) 

Figure 44.1 shows different normal distributions 
in dimension n= 2. Their lines of equal den- 
sity are ellipsoids. Any straight section through the 
two-dimensional density recovers a two-dimensional 
Gaussian bell. Based on multivariate normal distri- 
butions, three different mutation operators can be 
distinguished: 


@ Spherical/isotropic (Fig. 44.la) where the covari- 
ance matrix is proportional to the identity, i.e., 
the mutation distribution follows oN (0, I) with 
step-size ø > 0. The distribution is spherical and 
invariant under rotations about its mean. In the fol- 
lowing, Algorithm 44.3 uses this kind of mutation. 

© Axis-parallel (Fig. 44.1b) where the covariance ma- 
trix is a diagonal matrix, i.e., the mutation distri- 
bution follows N (0, diag(a)*), where ø is a vector 
of coordinate-wise standard deviations and the di- 
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agonal matrix diag(o)* has eigenvalues o? with 


eigenvectors e;. The principal axes of the ellip- 
soid are parallel to the coordinate axes. This case 
includes the previous isotropic case. Below, Algo- 
rithms 44.4—44.6 implement this kind of mutation 
distribution. 

© General (Fig. 44.1c) where the covariance matrix 
is symmetric and positive definite (i.e., xTCx > 
O for all x £0), generally nondiagonal and has 
(n? +n)/2 degrees of freedom (control param- 
eters). The general case includes the previous 
axis-parallel and spherical cases. Below, Algo- 
rithms 44.7 and 44.8 implement general multivari- 
ate normally distributed mutations. 


In the first and the second cases, the variations of 
variables are independent of each other, they are un- 
correlated. This limits the usefulness of the operator in 
practice. The third case is incompatible with discrete 
recombination: for a narrow, diagonally oriented ellip- 
soid (not to be confused with a diagonal covariance 
matrix), a point resulting from selection and discrete 
recombination lies within this ellipsoid only if each 


44.3 Parameter Control 


Controlling the parameters of the mutation operator is 
key to the design of ES. Consider the isotropic oper- 
ator (Fig. 44.la), where the step-size ø is a scaling 
factor for the random vector perturbation. The step-size 
controls to a large extent the convergence speed. In sit- 
uations where larger step-sizes lead to larger expected 
improvements, a step-size control technique should aim 
at increasing the step-size (and decreasing it in the op- 
posite scenario). 

The importance of step-size control is illustrated 
with a simple experiment. Consider a spherical func- 
tion f(x) = ||x||%, œ > 0, and a (1+1)-ES with constant 
step-size equal to o = 107°, i.e., with mutations drawn 
from 107?N (0, I). The convergence of the algorithm 
is depicted in Fig 44.2 (constant o graphs). 

We observe, roughly speaking, three stages: up to 
600 function evaluations, progress toward the optimum 
is slow. At this stage, the fixed step-size is too small. 
Between 700 and 800 evaluations, fast progress toward 
the optimum is observed. At this stage, the step-size 
is close to optimal. Afterward, the progress decreases 
and approaches the rate of the pure random search algo- 
rithm, well illustrated on the bottom subfigure. At this 


coordinate is taken from the same parent (which hap- 
pens with probability 1/p”—!) or from a parent with 
a very similar value in this coordinate. The narrower 
the ellipsoid the more similar (i. e., correlated) the value 
needs to be. As another illustration consider sampling, 
neutral selection and discrete recombination based on 
Fig. 44.1c): after discrete recombination the points 
(—2, 2) and (2, —2) outside the ellipsoid have the same 
probability as the points (2,2) and (—2, —2) inside the 
ellipsoid. 

The mutation operators introduced are unbiased in 
several ways. They are all point symmetrical and have 
expectation zero. Therefore, mutation alone will almost 
certainly not lead to better fitness values in expecta- 
tion. The isotropic mutation operator features the same 
distribution along any direction. The general mutation 
operator is, as long as C remains unspecified, unbiased 
toward the choice of a Cartesian coordinate system, 
i.e. unbiased toward the representation of solutions x, 
which has also been referred to as invariance to affine 
coordinate system transformations [44.10]. This how- 
ever depends on the way how C is adapted (see the 
following). 


stage the fixed step-size is too large and the probability 
to sample better offspring becomes very small. 

The figure also shows runs of the (1+1)-ES with 
1/5th success rule step-size control (as described in 
Sect. 44.3.1) and the step-size evolution associated to 
one of these runs. The initial step-size is far too small 
and we observe that the adaptation technique increases 
the step-size in the first iterations. Afterward, step-size 
is kept roughly proportional to the distance to the op- 
timum, which is in fact optimal and leads to linear 
convergence on the top subfigure. 

Generally, the goal of parameter control is to drive 
the endogenous strategy parameters close to their op- 
timal values. These optimal values, as we have seen 
for the step-size in Fig. 44.2, can significantly change 
over time or depending on the position in search space. 
In the most general case, the mutation operator has 
(n? + n)/2 degrees of freedom (Sect. 44.2.8). The con- 
jecture is that in the desired scenario lines of equal 
density of the mutation operator resemble locally the 
lines of equal fitness [44.4, pp. 242f.]. In the case of 
convex-quadratic fitness functions this resemblance can 
be perfect and, apart from the step-size, optimal param- 
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eters do not change over time (as illustrated in Fig. 44.3 
below). 

Control parameters like the step-size can be stored 
on different levels. Each individual can have its own 


a) Distance to optimum 


10° 


103 


10° — Random search 
—— Constant o 
—— Adaptive step—size o 
—— Step-size o 
10° 
0 500 1000 1500 


Function evaluations 
b) Distance to optimum 


107 


— Random search 
— Constant o 
—— Adaptive step-size o | 


10° 
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Function evaluations 
Fig. 44.2a,b Runs of the (1+1)-ES with constant step- 
size, of pure random search (uniform in [—0.2, 1]!°), and 
of the (1+ 1)-ES with 1/5th success rule (Algorithm 44.3) 
on a spherical function f(x) = ||x||%,a@ > 0 (because of in- 
variance to monotonic f-transformation the same graph 
is observed for any a > 0). For each algorithm, there are 
three runs in (a) and (b). The x-axis is linear in (a) and in 
log-scale in (b). For the (1+1)-ES with constant step-size, 
o equals 10~?. For the (1+1)-ES with 1/Sth success rule, 
the initial step-size is chosen very small to 107° and the 
parameter d equals 1 + 10/3. In (a) also the evolution of 
the step-size of one of the runs of the (1+1)-ES with 1/5th 
success rule is shown. All algorithms are initialized at 1. 
Eventually, the (1+1)-ES with 1/5th success rule reveals 
linear behavior (a), while the other two algorithms reveal 
eventually linear behavior in (b) 


step-size value (like in Algorithms 44.4 and 44.5), or 
a single step-size is stored and applied to all individuals 
in the population. In the latter case, sometimes different 
populations with different parameter values are run in 
parallel [44.19]. 

In the following, six specific ES are outlined, each 
of them representing an important achievement in pa- 
rameter control. 


44.3.1 The 1/5th Success Rule 


The 1/5th success rule for step-size control is based 
on an important discovery made very early in the re- 
search of evolution strategies [44.1]. A similar rule 
had also been found independently before in [44.20]. 
As a control mechanism in practice, the 1/5th success 
rule has been mostly superseded by more sophisticated 
methods. However, its conceptual insight remains re- 
markably valuable. 

Consider a linear fitness function, for example, f : 
xe x orf: x> J; xi. In this case, any point symmet- 
rical mutation operator has a success probability of 1/2: 
in one-half of the cases, the perturbation will improve 
the original solution, in one half of the cases the so- 
lution will deteriorate. Following the Taylor’s formula, 
we know that smooth functions with decreasing neigh- 
borhood size become more and more linear. Therefore, 
the success probability becomes 1/2 for step-size o > 
0. On most nonlinear functions, the success rate is 
indeed a monotonously decreasing function in o and 
goes to zero for o —> ov. This suggests to control the 
step-size by increasing it for large success rates and de- 
creasing it for small ones. This mechanism can drive the 
step-size close to the optimal value. 

Rechenberg [44.1] investigated two simple but quite 
different functions, the corridor function 


x, if |x| <1 fori=2,...,n 
fixe 


oo otherwise , 


and the sphere function 


fixe Dox. 


He found optimal success rates for the (1+1)-ES with 
isotropic mutation to be ~ 0.184 > 1/6 and ~ 0.270 < 
1/3, respectively (for n + oo) [44.1]. Optimality here 
means to achieve the largest expected approach of the 
optimum in a single generation. This leads to approxi- 
mately 1/5 as being the success value where to switch 
between decreasing and increasing the step-size. 
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Algorithm 44.3 The (1+1)-ES with 1/5th Rule 
1: given n E N4,d x JVn+1 
2: initialize x € R”, o > 0 
3: while not happy 
xı =x+oxNn(0,1D 
o < o x exp (Ip) < pax) — 1/5) 
if f(x1) < f(x) // select if better 
x=X // x-value of new parent 


// mutation 


Oy 


Algorithm 44.3 implements the (1+1)-ES 
with 1/5th success rule in a simple and effective 
way [44.21]. Lines 5-7 implement Line 9 from Al- 
gorithm 44.2, including selection in Line 8. Line 5 in 
Algorithm 44.3 updates the step-size o of the single 
parent. The step-size does not change if and only if the 
argument of exp is zero. While this cannot happen in 
a single generation, we still can find a stationary point 
for o: logo is unbiased if and only if the expected 
value of the argument of exp is zero. This is the case if 
Ely@,)<s(x) = 1/5, in other words, if the probability of 
an improvement with f (x1) < f(x) is 20%. Otherwise, 
logo increases in expectation if the success probability 
is larger than 1/5 and decreases if the success probabil- 
ity is smaller than 1/5. Hence, Algorithm 44.3 indeed 
implements the 1/5th success rule. 


44.3.2 Self-Adaptation 


A seminal idea in the domain of ES is parameter 
control via self-adaptation [44.3]. In self-adaptation, 
new control parameter settings are generated similar 
to new x-vectors by recombination and mutation. Al- 
gorithm 44.4 presents an example with adaptation of 
n coordinate-wise standard deviations (individual step- 
sizes). 


Algorithm 44.4 The (u/j1,A)-o SA-ES 
1: given n€ N4, À > 5n, wr dA/4EN, t ~ 1/yn, 
ti 1/n'/4 
2: initialize x € R”, o € R} 
3: while not happy 


4: forke{l,...,A} 

// random numbers i.i.d. for all k 
5: & =T N (0,1) // global step-size 
6: E= N(0,I) // coordinate-wise o 


7: zk = N (0,1) 
// mutation 
8: øk =0 o exp(E,) x exp(E) 
9: Xk =X +HOkOZK 
10: P = sel_u_best (xk, ok, fŒ) | 1 <k <A}) 
// recombination 


// x-vector change 


First, for conducting the mutation, random events 
are drawn in Lines 5-7. In Line 8, the step-size vector 
for each individual undergoes (i) a mutation common 
for all components, exp(&,), and (ii) a component-wise 
mutation with exp(&,). These mutations are unbiased, 
in that E logo, = logo. The mutation of x in Line 9 
uses the mutated vector ø. After selection in Line 10, 
intermediate recombination is applied to compute x 
and o for the next generation. By taking the average 
over o we have Eo = Eo, in Line 11. However, the 
application of mutation and recombination on ø intro- 
duces a moderate bias such that o tends to increase 
under neutral selection [44.22]. 

In order to achieve stable behavior of ø , the number 
of parents u must be large enough, which is reflected 
in the setting of A. A setting of t ~ 1/4 has been 
proposed in combination with é; being uniformly dis- 
tributed across the two values in {—1, 1} [44.2]. 


44.3.3 Derandomized Self-Adaptation 


Derandomized self-adaptation [44.23] addresses the 
problem of selection noise that occurs with self- 
adaptation of ø as outlined in Algorithm 44.4. Selection 
noise refers to the possibility that very good offspring 
may be generated with poor strategy parameter settings 
and vice versa. The problem occurs frequently and has 
two origins: 


© A small/large component in |ø% ozg| (Line 9 in Al- 
gorithm 44.4) does not necessarily imply that the 
respective component of ø z is small/large. Selection 
of o is disturbed by the respective realizations of z. 

@ Selection of a small/large component of |ø% 0 z| 
does not imply that this is necessarily a favorable 
setting: more often than not, the sign of a compo- 
nent is more important than its size and all other 
components influence the selection as well. 


Due to selection noise, poor values are frequently 
inherited and we observe stochastic fluctuations of o. 
Such fluctuations can in particular lead to very small 
values (very large values are removed by selection more 
quickly). The overall magnitude of these fluctuations 
can be implicitly controlled via the parent number p, 
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because intermediate recombination (Line 11 in Algo- 
rithm 44.4) effectively reduces the magnitude of o- 
changes and biases log o to larger values. 

For u <n, the stochastic fluctuations become pro- 
hibitive and therefore u ~ A/4 > 1.25n is chosen to 
make o-self-adaptation reliable. 

Derandomization addresses the problem of selec- 
tion noise on o directly without resorting to a large 
parent number. The derandomized (1, A)-oSA-ES is 
outlined in Algorithm 44.5 and addresses selection 
noise twofold. 


Algorithm 44.5 Derandomized (1, 4)-o SA-ES 
1: given n € N+, å ~ 10, t ~ 1/3,d ~% yn, di xn 
2: initialize x € R”, ø € R? 
3: while not happy 
4: forke{l,...,A} 
// random numbers i.i.d. for all k 
5: & =tN(0, 1) 
6: zk = N (0,1) 
// mutation, re-using random events 
T: xX, = X + exp(&) X O O Zk 
è = 1/d; Iz 
8: Ok =00 exp (eon 1) 
x exp" (En) 
9:  (x1,01,f(x1)) <— select_single_best( 
(Er onfa) | L<k <A}) 
// assign new parent 
10: o=0; 
ll: x =x, 


Instead of introducing new variations in ø by means 
of exp(&,), the variations from z, are directly used for 
the mutation of ø in Line 8. The variations are damp- 
ened compared to their use in the mutation of x (Line 7) 
via d and dj, thereby mimicking the effect of interme- 
diate recombination on ø [44.23, 24]. The order of the 
two mutation equations becomes irrelevant. 

For Algorithm 44.5 also a (u/ u, A) variant with re- 
combination is feasible. However, in particular in the 
(u/Hr, A)-ES, o-self-adaptation tends to generate too 
small step-sizes. A remedy for this problem is to use 
nonlocal information for step-size control. 


44.3.4 Nonlocal Derandomized Step-Size 
Control (CSA) 


When using self-adaptation, step-sizes are associated 
with individuals and selected based on the fitness of 
each individual. However, step-sizes that serve indi- 


viduals well by giving them a high likelihood to be 
selected are generally not step-sizes that maximize the 
progress of the entire population or the parental cen- 
troid x. We will see later that, for example, the optimal 
step-size may increase linearly with u (Sect. 44.4.2 
and (44.16)). With self-adaptation on the other hand, 
the step-size of the j-th best offspring is typically 
even smaller than the step-size of the best offspring. 
Consequently, Algorithm 44.5 assumes often too small 
step-sizes and can be considerably improved by using 
nonlocal information about the evolution of the pop- 
ulation. Instead of single (local) mutation steps z, an 
exponentially fading record, sg, of mutation steps is 
taken. This record, referred to as search path or evo- 
lution path, can be pictured as a sequence or sum of 
consecutive successful z-steps that is nonlocal in time 
and space. A search path carries information about the 
interrelation between single steps. This information can 
improve the adaptation and search procedure remark- 
ably. Algorithm 44.6 outlines the (w/j7,A)-ES with 
cumulative path length control, also denoted as cu- 
mulative step-size adaptation (CSA), and additionally 
with nonlocal individual step-size adaptation [44.25, 
26]. 


Algorithm 44.6 The (u/u,à)-ES with Search Path 
1: given neNi, AEN, UPrA/4eN, 


Co X yu/(n+ pu), dx 1+ yu/n, di x 3n 


Xk =X +O 0% 
P < sel_u_best({ Œr zr fŒ) |1 <k <A}) 
// recombination and parent update 
8: So 4 (l1—co)So + 


Veo =e) Xoz 


REP 


2: initialize x € R”, ø € R}. So =0 

3: while not happy 

4: forke{l,...,A} 

5: zk = N (0,1) // i.i.d. for each k 
6: 

T: 


s 
9) o<ao exp!/% Gear DI -1) 


10: s 
i s= 
- x= — Xk 
u 


In the (u/u,à)-ES with search path, Algo- 
rithm 44.6, the factor & for changing the overall step- 
size has disappeared (compared to Algorithm 44.5) and 
the update of ø is postponed until after the for loop. 


Iso 
EINO.DI 1) 
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Instead of the additional random variate €, the length 
of the search path ||s, || determines the global step-size 
change in Line 9. For the individual step-size change, 
|zz| is replaced by |so |. 

Using a search path is justified in two ways. First, 
it implements a low-pass filter for selected z-steps, 
removing high-frequency (most likely noisy) informa- 
tion. Second, and more importantly, it utilizes informa- 
tion that is otherwise lost: even if all single steps have 
the same length, the length of sg can vary, because it 
depends on the correlation between the directions of z- 
steps. If single steps point into similar directions, the 
path will be up to almost ,/2/co times longer than 
a single step and the step-size will increase. If they op- 
pose each other the path will be up to almost ap taf? 
times shorter and the step-size will decrease. The same 
is true for single components of So. 

The factors /cg (2— co) and ,/p in Line 8 guar- 
anty unbiasedness of sg under neutral selection, as 
usual. 

All ES described so far are of somewhat limited 
value, because they feature only isotropic or axis- 
parallel mutation operators. In the remainder we con- 
sider methods that entertain not only an n-dimensional 
step-size vector ø, but also correlations between vari- 
ables for the mutation of x. 


44.3.5 Addressing Dependences 
Between Variables 


The ES presented so far sample the mutation distri- 
bution independently in each component of the given 
coordinate system. The lines of equal density are either 
spherical or axis-parallel ellipsoids (compare Fig. 44.1). 
This is a major drawback, because it allows to solve 
problems with a long or elongated valley efficiently 
only if the valley is aligned with the coordinate system. 
In this section, we discuss ES that allow us to traverse 
nonaxis-parallel valleys efficiently by sampling distri- 
butions with correlations. 


Full Covariance Matrix 
Algorithms that adapt the complete covariance ma- 
trix of the mutation distribution (compare Sect. 44.2.8) 
are correlated mutations [44.3], the generating set 
adaptation [44.26], the covariance matrix adaptation 
(CMA) [44.27], a mutative invariant adaptation [44.28], 
and some instances of natural evolution strategies 
(NES) [44.29-31]. Correlated mutations and some nat- 
ural ES are however not invariant under changes of the 


coordinate system [44.10, 31,32]. In the next sections, 
we outline two ES that adapt the full covariance ma- 
trix reliably and are invariant under coordinate system 
changes: the covariance matrix adaptation evolution 
strategy (CMA-ES) and the exponential natural evolu- 
tion strategy (xNES). 


Restricted Covariance Matrix 

Algorithms that adapt nondiagonal covariance matrices, 
but are restricted to certain matrices, are the momentum 
adaptation [44.33], direction adaptation [44.26], main 
vector adaptation [44.34], and limited memory CMA- 
ES [44.35]. These variants are limited in their capability 
to shape the mutation distribution, but they might be ad- 
vantageous for larger dimensional problems, say larger 
than a 100. 


44.3.6 Covariance Matrix Adaptation (CMA) 


The CMA-ES [44.10, 27,36] is a de facto standard 
in continuous domain evolutionary computation. The 
CMA-ES is a natural generalization of Algorithm 44.6 
in that the mutation ellipsoids are not constrained to be 
axis-parallel, but can take on a general orientation. The 
CMA-ES is also a direct successor of the generating set 
adaptation [44.26], replacing self-adaptation to control 
the overall step-size with cumulative step-size adapta- 
tion [44.37]. 

The (u/uw,A)-CMA-ES is outlined in Algo- 
rithm 44.7. 


Algorithm 44.7 The (u/uw, à)-CMA-ES 
1: givenne Ni,A>5,u~A/2, 

wg = w'(k)/ eel w’(k), 
w’(k) = log(A/2 + 1/2) — log rank(f(x;)), 
Hw = 1/ pa wes Co X Ly /(n F Hw): 
da 1+ /u,/n, 
Ce X (4+ Hy/n)/ (n+ 4+ 2uy/n), 
qs 2/ (n? T Mw): Cu xX [My /(n? ale Hw), Cm = 1 


initialize sọ = 0, Se = 0, C= I, o € Ri, xeR” 
while not happy 
for ke {1,...,A} 
zk = N (0, I) //i.i.d. for all k 


Xk =x+o0C? X Zk 
P =sel_p_best({ (zr fx) |1 <k <A}) 
So <— (l1 —co)So + //search path for o 


Vco (2— co) Hw )_ Wiz 


zkEP 
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9: So (1—Ce) Se + // search path for C Line 12, where negative weights w+ for inferior off- 
ho Vee =c) Vi > wC? zųą spring are advisable. Such an update has been intro- 
EP duced as active CMA [44.38]. 
1 The factor Cm in Line 10 can be equally written 
10: XxX + mo C? 2 WKkZk as a mutation scaling factor K = 1/cm in Line 6, com- 
ai pare [44.39]. This means that the actual mutation steps 
ll: o<o exp! (E = i) are larger than the inherited ones, resembling the deran- 
E|N(,D| domization technique of damping step-size changes to 
12: C<(l—-cy+ce,—cy) C+ address selection noise as described in Sect. 44.3.3. 
CISSE + Cu > weC2z,(C2z,)" An elegant way to replace Line 10 is 
zkEP 2 
where ho = llsol2/n<2+4/0+D> Ch = C1 — o < o exp e/d/2 (EL — 1) (44.1) 
ho’)ce(2— ce), and C2 is the unique symmetric " 
positive definite matrix obeying C2xC2 =C. and often used in theoretical investigations of this up- 
All c-coefficients are < 1. date as those presented in Sect. 44.4.2. 
A single run of the (5/5w,10)-CMA-ES on 
Two search paths are maintained, s, and se. The a convex-quadratic function is shown in Fig. 44.3. For 


first path, s,, accumulates steps in the coordinate sys- 
tem where the mutation distribution is isotropic and 
which can be derived by scaling in the principal axes 
of the mutation ellipsoid only. The path generalizes sg 
from Algorithm 44.6 to nondiagonal covariance ma- 
trices and is used to implement cumulative step-size 
adaptation, CSA, in Line 10 (resembling Line 9 in Al- 
gorithm 44.6). Under neutral selection, so ~ N (0, I) 
and logo is unbiased. 

The second path, Se, accumulates steps, disregard- 
ing o, in the given coordinate system. Whenever so 
is large and therefore o is increasing fast, the coef- 
ficient hg prevents Se from getting large and quickly 
changing the distribution shape via C. Given hg = 
1, under neutral selection Se ~ N (0, C). The coeffi- 
cient c, in Line 12 corrects for the bias on Se introduced 
by events ho = 0. The covariance matrix update con- 
sists of a rank-1 update, based on the search path se, 
and a rank-u update with u nonzero recombination 
weights w;. Under neutral selection, the expected co- 
variance matrix equals the covariance matrix before the 
update. 

The updates of x and C follow a common princi- 
ple. The mean x is updated such that the likelihood 
of successful offspring to be sampled again is maxi- 
mized (or increased if cm < 1). The covariance matrix 
C is updated such that the likelihood of successful steps 
(x, —x)/o to appear again, or the likelihood to sample 
(in the direction of) the path se, is increased. A more 
fundamental principle for the equations is given in the 
next section. 

Using not only the u best but all à offspring can 
be particularly useful for the rank-j update of C in 


the sake of demonstration, the initial step-size is cho- 
sen far too small (a situation that should be avoided 
in practice) and increases quickly for the first 400 f- 
evaluations. After no more than 5500 f-evaluations the 
adaptation of C is accomplished. Then the eigenvalues 
of C (square roots of which are shown in the lower left) 
reflect the underlying convex-quadratic function and the 
convergence speed is the same as on the sphere function 
and about 60% of the speed of the (1 + 1)-ES as ob- 
served in Fig. 44.2. The resulting convergence speed is 
about 10000 times faster than without adaptation of C 
and at least 1000 times faster compared to any of the 
algorithms from the previous sections. 


44.3.7 Natural Evolution Strategies 


The idea of using natural gradient learning [44.40] 
in ES has been proposed in [44.29] and further pur- 
sued in [44.31,41]. Natural evolution strategies (NES) 
put forward the idea that the update of all distribution 
parameters can be based on the same fundamental prin- 
ciple. NES have been proposed as a more principled 
alternative to CMA-ES and characterized by operat- 
ing on Cholesky factors of a covariance matrix. Only 
later was it discovered that also CMA-ES implements 
the underlying NES principle of natural gradient learn- 
ing [44.31, 42]. 

For simplicity, let the vector 6 represents all pa- 
rameters of the distribution to sample new offspring. 
In the case of a multivariate normal distribution as 
above, we have a bijective transformation between 0 
and mean and covariance matrix of the distribution, 
6 = (x, 07C). 
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a) c) Object variables (mean, 10-D, popsize ~10) 


— Abs(f) 

—— f-min(f) 
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Fig. 44.3a-d A single run of the (5/5w, 10)-CMA-ES on the rotated ellipsoid function x > } ` ;—] a?y? with œ; = 
1036-D/@—1) y = Rx, where R is a random matrix with R'R = I, for n = 10. Shown is the evolution of various pa- 
rameters against the number of function evaluations. (a) best (gray), median and worst fitness value that reveal the final 
convergence phase after about 5500 function evaluations where the ellipsoid function has been reduced to the simple 
sphere; minimal and maximal coordinate-wise standard deviation of the mutation distribution and in between (mostly 
hidden) the step-size o that is initialized far too small and increases quickly in the beginning, that increases afterward 
several times again by up to one order of magnitude and decreases with maximal rate during the last 1000 f-evaluations; 
axis ratio of the mutation ellipsoid (square root of the condition number of C) that increases from 1 to 1000 where the 
latter corresponds to @,/a1. (b) sorted principal axis lengths of the mutation ellipsoid disregarding o (square roots of the 
sorted eigenvalues of C, see also Fig. 44.1) that adapt to the (local) structure of the underlying optimization problem; they 
finally reflect almost perfectly the factors œ! up to a constant factor. (c) x (distribution mean) that is initialized with all 
ones and converges to the global optimum in zero while correlated movements of the variables can be observed. (d) stan- 
dard deviations in the coordinates disregarding o (square roots of diagonal elements of C) showing the R-dependent 
projections of the principal axis lengths into the given coordinate system. The straight lines to the right of the vertical 
line at about 6300 only annotate the coordinates and do not reflect measured data 
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We consider a probability density p(.|@) over 
R” parametrized by @ and a nonincreasing func- 
tion ws, :R—R. More specifically, Wi, yh 
w(Pr,~p¢.|6)(f(@) < y)) computes the pg-quantile, 
or cumulative distribution function, of f(z) with z ~ 
p(.|@) at point y, composed with a nonincreasing pre- 
defined weight function w : [0,1] > R (where w(0) > 
w(1/2) =0 is advisable). The value of w (f(x)) is 
invariant under strictly monotonous nai elormnations 
of f. For x~ p(.|@) the distribution of WEF) ~ 
w(U[0, 1]) depends only on the predefined w; it is inde- 
pendent of 0 and f and therefore also (time-)invariant 
under 6-updates. Given A samples x, we have the 
rank-based consistent estimator 


we (fe) arr ee = 2) l 


We consider the expected W? p-transformed fit- 
ness [44.43] 


IO =EW PEF x~ p18) 
= | WoP(f(x)) pb )ae , (44.2) 
R” 


where the expectation is taken under the given sample 
distribution. The maximizer of J w.r.t. p(.|@) is, for any 
fixed wi p, a Dirac distribution concentrated on the mini- 
mizer off. A natural way to update 6 is therefore a gradi- 
ent ascent step in the VgJ direction. However, the vanilla 
gradient VgJ depends on the specific parametrization 
chosen in 0. In contrast, the natural gradient, denoted by 
Vg, is associated to the Fisher metric that is intrinsic to 
p and independent of the chosen @-parametrization. De- 
veloping VaJ (0) under mild assumptions on f and p(.|@) 
by exchanging differentiation and integration, recogniz- 
ing that the gradient Vo does not act on Wa , using the 
log-likelihood trick Vap(. |0) = p(.|@) Vo Inp(.|@) and 
finally setting 0’ = 6 yields 


TJ (0) = E (WE) Vo In p(x|6)) . 


We set 0’ = 6 because we will estimate Wọ using the 
current samples that are distributed according to p(.|@). 
A Monte Carlo approximation of the expected value by 
the average finally yields the comparatively simple ex- 
pression 


(44.3) 


preference weight 
— 


ou 1 À = = 
Vod(0) ~~ >) WFE) Yo Inpo) 
k=1 


intrinsic candidate direction 


(44.4) 


for a natural gradient update of 6, where x; ~ p(.|@) 
is sampled from the current distribution. ite natural 
gradient can be computed as Vo = Fg! Ve, where 
Fg is the Fisher information matrix expressed in 0- 
coordinates. For the multivariate Gaussian distribu- 
tion, Vg Inp(x;,|@) can indeed be easily expressed and 
computed efficiently. We find that in CMA-ES (Algo- 
rithm 44.7), the rank-jz update (Line 12 with cı = 0) 
and the update in Line 10 are natural gradient updates 
of C and x, respectively [44.31,42], where the k-th 
largest w% is a consistent estimator for the k-th largest 
WI (F (xx)) [44.43]. 

While the natural gradient does not depend on the 
parametrization of the distribution, a finite step taken 
in the natural gradient direction does. This becomes 
relevant for the covariance matrix update, where nat- 
ural ES take a different parametrization than CMA-ES. 
Starting from Line 12 in Algorithm 44.7, we find for 
cy =cn = 0 


Ca (l1—cu)C+ cu X WCC)" 


ZkEP 
c3 =c (a —cy)I+cu >> ma) c? 
ZEP 

r=! c: (Gz Y w (az =) c? 
GEP 

Cu Z1 

"x c2 exp > Wk (zzi -») c? ; 

KEP 


(44.5) 


The term bracketed between the matrices C? in the 
lower three lines is a multiplicative covariance ma- 
trix update expressed in the natural coordinates, where 
the covariance matrix is the identity and C2 serves 
as coordinate system transformation into the given co- 
ordinate system. Only the lower two lines of (44.5) 
do not rely on the constraint `, wg = 1 in order to 
satisfy a stationarity condition on C. For a given C 
on the right-hand side of (44.5), we have under neu- 
tral selection the stationarity condition E(Cyew) = C 
for the first three lines and E(log(Chew)) = log(C) 
for the last line, where log is the inverse of the 
matrix exponential exp. The last line of (44.5) is 
used in the exponential natural evolution strategy, 
XNES [44.31] and guarantees positive definiteness 
of C even with negative weights, independent of cy, 
and of the data z;. The xNES is depicted in Algo- 
rithm 44.8. 
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Algorithm 44.8 The Exponential NES (xNES) 
1: given neE Ni, A>5, we=w'(k)/ X} lw], 

w’ (k) = log(A/2 + 1/2) — log rank(f(x;)), 

Ne @(S+A)/(5n'°) < 1, fo © Nes Me © 1 


2: initialize C? = I, o € R4, x € R” 
3: while not happy 
4: forke{l,...,A} 
5: Z = N (0,1) /lii.d. for all k 
6: x, =xtoC? X Zk 
T P={ kf |1<k<åà; 
8: x< x+ no C? X wer 
ZkEP 
g: no/2 Ci )) 
: 0 <0 exp » wk | — -1 
n 
KEP 
f 1 1 ne/2 T lizel? 
10: C? < C? x exp” X we zg — —I 
ZEP n 


In xNES, sampling is identical to CMA-ES and 
environmental selection is omitted entirely. Line 9 re- 
sembles the step-size update in (44.1). Comparing the 
updates more closely, with cg = 1 (44.1) uses 


hyll Da wal? _ 
n 


1 


whereas xNES uses 


a7 Ee -1) 


k 


for updating o. For u = 1 the updates are the same. 
For u > 1, the latter only depends on the lengths of the 
Zk, While the former depends on their lengths and di- 
rections. Finally, xNES expresses the update (44.5) in 
Line 10 on the Cholesky factor Cc , which does not re- 
main symmetric in this case (C = C2 x ch still holds). 
The term —||z;||2/n keeps the determinant of C2 (and 
thus the trace of log C2) constant and is of rather cos- 
metic nature. Omitting the term is equivalent to using 
No + Nc instead of no in Line 9. 

The exponential natural evolution strategy is a very 
elegant algorithm. Like CMA-ES it can be inter- 
preted as an incremental estimation of distribution algo- 
rithm [44.44]. However, it performs generally inferior 
compared to CMA-ES because it does not use search 
paths for updating o and C. 


44.3.8 Further Aspects 


Internal Parameters 
Adaptation and self-adaptation address the control of 
the most important internal parameters in ES. Yet, all 
algorithms presented have hidden and exposed param- 
eters in their implementation. Many of them can be 
set to reasonable and robust default values. The pop- 
ulation size parameters jz and A however change the 
search characteristics of an evolution strategy signifi- 
cantly. Larger values, in particular for parent number 4, 
often help address highly multimodal or noisy problems 
successfully. 

In practice, several experiments or restarts are 
advisable, where different initial conditions for x 
and o can be employed. For exploring different pop- 
ulation sizes, a schedule with increasing population 
size (IPOP) is advantageous [44.4547], because runs 
with larger populations take typically more function 
evaluations. Preceding long runs (large u and A) 
with short runs (small jz and A) leads to a smaller 
(relative) impairment of the later runs than vice 
versa. 


Internal Computational Complexity 

Algorithms presented in Sects. 44.3.1-44.3.4 that sam- 
ple isotropic or axis-parallel mutation distributions have 
an internal computational complexity linear in the di- 
mension. The internal computational complexity of 
CMA-ES and xNES is, for constant population size, 
cubic in the dimension due to the update of C2. Typ- 
ical implementations of the CMA-ES however have 
quadratic complexity, as they implement a lazy update 
scheme for C3, where C is decomposed into cic? 
only after about n/A iterations. An exact quadratic 
update for CMA-ES has also been proposed [44.48]. 
While never considered in the literature, a lazy update 
for xNES to achieve quadratic complexity seems feasi- 
ble as well. 


Invariance 

Selection and recombination in ES are based solely 
on the ranks of offspring and parent individuals. As 
a consequence, the behavior of ES is invariant under 
order-preserving (strictly monotonous) transformations 
of the fitness function value. In particular, all spherical 
unimodal functions belong to the same function class, 
which the convex-quadratic sphere function is the most 
pronounced member of. This function is more thor- 
oughly investigated in Sect. 44.4. 
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All algorithms presented are invariant under transla- 
tions and Algorithms 44.3, 44.7, and 44.8 are invariant 
under rotations of the coordinate system, provided that 
the initial x is translated and rotated accordingly. 

Parameter control can introduce yet further invari- 
ances. All algorithms presented are scale invariant due 
to step-size adaptation. Furthermore, ellipsoidal func- 
tions that are in the reach of the mutation operator of 
the ES presented in Sects. 44.3.2—44.3.7 are eventually 
transformed, effectively, into spherical functions. These 


44.4 Theory 


There is ample empirical evidence, that on many uni- 
modal functions ES with step-size control, as those 
outlined in the previous section, converge fast and 
with probability one to the global optimum. Conver- 
gence proofs supporting this evidence are discussed 
in Sect. 44.4.3. On multimodal functions on the other 
hand, the probability to converge to the global opti- 
mum (in a single run of the same strategy) is generally 
smaller than one (but larger than zero), as suggested 
by observations and theoretical results [44.55]. Without 
parameter control on the other hand, elitist strategies 
always converge to the essential global optimum, how- 
ever at a much slower rate (compare random search in 
Fig. 44.2). On a bounded domain and with mutation 
variances bounded away from zero, nonelitist strategies 
generate a subsequence of x-values converging to the 
essential global optimum. 

In this section, we use a time index f to denote iter- 
ation and assume, for notational convenience and with- 
out loss of generality (due to translation invariance), 
that the optimum of f is in x* = 0. This simplifies writ- 
ing x —x* to simply x and then ||x || measures the 
distance to the optimum of the parental centroid in time 
step t. 

Linear convergence plays a central role for ES. For a 
deterministic sequence x linear convergence (toward 
zero) takes place if there exists a c > 0 such that 


[xT || 


SES (44.6) 
Ix || 


= exp(—c) , 


i—oo 


which means, loosely speaking, that for ¢ large enough, 
the distance to the optimum decreases in every step 
by the constant factor exp(—c). Taking the logarithm 
of (44.6), then exchanging the logarithm and the limit 


ES are invariant under the respective affine transforma- 
tions of the search space, given the initial conditions are 
chosen respectively. 


Variants 
Evolution strategies have been extended and com- 
bined with other approaches in various ways. We 
mention here constraint handling [44.49, 50], fitness 
surrogates [44.51], multiobjective variants [44.52, 53], 
and exploitation of fitness values [44.54]. 


and taking the Cesaro mean yields 


(t+1) 
537 ad 
L S kO] 


— 
ai n 


=t og Ix ||/lx® I 


= —c. (44.7) 


For a sequence of random vectors, we define linear con- 
vergence based on (44.7) as follows. 


Definition 44.1 Linear Convergence 
The sequence of random vectors x converges almost 
surely linearly to 0 if there exists a c > 0 such that 


= im toe BOI 
C= poo TF kO *° 
T-1 
foe 1l eer] 


t= 


The sequence converges in expectation linearly to 0 if 
there exists a c > 0 such that 


[xt || 


——— (44.9) 
Ix || 


—c= lim Elog 


t> co 
The constant c is the convergence rate of the algorithm. 


Linear convergence, hence, means that asymptoti- 
cally in ¢, the logarithm of the distance to the optimum 
decreases linearly in ¢ like —ct. This behavior has been 
observed in Fig. 44.2 for the (1+1)-ES with 1/5th suc- 
cess rule on a unimodal spherical function. 

Note that A function evaluations are performed per 
iteration and it is then often useful to consider a conver- 
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gence rate per function evaluation, i. e., to normalize the 
convergence rate by À. 

The progress rate measures the reduction of the dis- 
tance to optimum within a single generation [44.1]. 


Definition 44.2 Progress Rate 
The normalized progress rate is defined as the expected 
relative reduction of ||x® || 

xO, °) 


a (= =e | 
a, °)) , (44.10) 


Ix || 
(+1) 
x 
=n{1-E bel 
Ix || 
where the expectation is taken over xt) 
given (x,s). In situations commonly consid- 
ered in theoretical analyses, g* does not depend 
on x and is expressed as a function of strategy 
parameters 5. 


Definitions 44.1 and 44.2 are related, in that for 
a given x 


eer? | 
y* < -n log E—.— (44.11) 
IIx || 
lx€+» |] 
< -n E log =nc. (44.12) 


Ix | 


Therefore, progress rate y* and convergence rate nc do 
not agree and we might observe convergence (c > 0) 
while y* < 0. However for n —> ov, we typically have 
y* = nc [44.56]. 

The normalized progress rate y* for ES has 
been extensively studied in various situations, see 
Sect. 44.4.2. Scale-invariance and (sometimes artificial) 
assumptions on the step-size typically ensure that the 
progress rates do not depend on t. 

Another way to describe how fast an algorithm 
approaches the optimum is to count the number 
of function evaluations needed to reduce the dis- 
tance to the optimum by a given factor 1/e or, 
similarly, the runtime to hit a ball of radius e€ 
around the optimum, starting, e.g., from the distance 
one. 


Definition 44.3 Runtime 
The runtime is the first hitting time of a ball around the 
optimum. Specifically, the runtime in number of func- 


tion evaluations as a function of € reads 


A x min f i jx || < € x |x I} 


(1) 
x 
=axminfr I I <e . (44.13) 


kel ~ 


Linear convergence with rate c as given in (44.9) im- 
plies that, for € — 0, the expected runtime divided 
by log(1/e) goes to the constant A/c. 


44.4.1 Lower Runtime Bounds 


Evolution strategies with a fixed number of parent and 
offspring individuals cannot converge faster than lin- 
early and with a convergence rate of O(1/n). This 
means that their runtime is lower bounded by a constant 
times log(1/e”) = nlog(1/e) [44.57-61]. This result 
can be obtained by analyzing the branching factor of the 
tree of possible paths the algorithm can take. It therefore 
holds for any optimization algorithm taking decisions 
based solely on a bounded number of comparisons be- 
tween fitness values [44.57-59]. 

More specifically, the runtime of any (1 t A)-ES 
with isotropic mutations cannot be asymptotically faster 
than « nlog(1/e) A/log(A) [44.62]. Considering more 
restrictive classes of algorithms can provide more pre- 
cise nonasymptotic bounds [44.60, 61]. Different ap- 
proaches address in particular the (1+1)- and (1, A)-ES 
and precisely characterize the fastest convergence rate 
that can be obtained with isotropic normal distributions 
on any objective function with any step-size adaptation 
mechanism [44.56, 63—65]. 

Considering the sphere function, the optimal con- 
vergence rate is attained with distance proportional 
step-size, that is, a step-size proportional to the dis- 
tance of the parental centroid to the optimum, o = 
const x ||x|| = o* ||x||/n. Optimal step-size and optimal 
convergence rate according to (44.8) and (44.9) can 
be expressed in terms of expectation of some random 
variables that are easily simulated numerically. The 
convergence rate of the (1+1)-ES with distance pro- 
portional step-size is shown in Fig. 44.4 as a function 
of the normalized step-size o* = no/||x||. The peak of 
each curve is the upper bound for the convergence rate 
that can be achieved on any function with any form of 
step-size adaptation. As for the general bound, the evo- 
lution strategy converges linearly and the convergence 
rate c decreases to zero like 1/n for n > œœ [44.56, 65, 
66], which is equivalent to linear scaling of the runtime 
in the dimension. The asymptotic limit for the conver- 
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Fig. 44.4 Normalized convergence rate nc versus nor- 
malized step-size no/||x|| of the (1+1)-ES with distance 
proportional step-size for n = 2, 3,5, 10, 20, œœ (top to bot- 
tom). The peaks of the graphs represent the upper bound 
for the convergence rate of the (1+1)-ES with isotropic 
mutation (corresponding to the lower runtime bound). The 
limit curve for n to infinity (lowest curve) reveals the 
optimal normalized progress rate of y* ~ 0.202 of the 
(1+1)-ES on sphere functions for n —> oo 


gence rate of the (1+1)-ES, as shown in the lowest curve 
in Fig. 44.4, coincides with the progress rate expression 
given in the next section. 


44.4.2 Progress Rates 


This section presents analytical approximations to 
progress rates of ES for sphere, ridge, and cigar func- 
tions in the limit n — oo. Both one-generation results 
and those that consider multiple time steps and cumula- 
tive step-size adaptation are considered. 

The first analytical progress rate results date back 
to the early work of Rechenberg [44.1] and Schwe- 
fel [44.3], who considered the sphere and corridor mod- 
els and very simple strategy variants. Further results 
have since been derived for various ridge functions, sev- 
eral classes of convex quadratic functions, and more 
general constrained linear problems. The strategies that 
results are available for have increased in complex- 
ity as well and today include multiparent strategies 
employing recombination as well as several step-size 
adaptation mechanisms. Only strategy variants with 
isotropic mutation distributions have been considered 
up to this point. However, parameter control strate- 
gies that successfully adapt the shape of the mutation 


Fig. 44.5 Decomposition of mutation vector z into a com- 
ponent Zq in the direction of the negative of the gradient 
vector of the objective function and a perpendicular com- 
ponent ze 


distribution (such as CMA-ES) effectively transform el- 
lipsoidal functions into (almost) spherical ones; thus 
lending extra relevance to the analysis of sphere and 
sphere-like functions. 

The simplest convex quadratic functions to be opti- 
mized are variants of the sphere function (see also the 
discussion of invariance in Sect. 44.3.8) 


o= = =k, 


i=l 


where R denotes the distance from the optimal solu- 
tion. Expressions for the progress rate of ES on sphere 
functions can be computed by decomposing mutation 
vectors into two components zo and zo as illustrated 
in Fig. 44.5. Component Z@ is the projection of z onto 
the negative of the gradient vector Vf of the objective 
function. It contributes positively to the fitness of off- 
spring candidate solution 


y=x+z 
if and only if 
-Vf (x) z>0. 


Component zg = Z—Z@ is perpendicular to the gradi- 
ent direction and contributes negatively to the offspring 
fitness. Its expected squared length exceeds that of zo 
by a factor of n—1. Considering normalized quanti- 
ties o* =on/R and y* = gn/R allows giving concise 
mathematical representations of the scaling properties 
of various ES on spherical functions as shown below. 
Constant o* corresponds to the distance proportional 
step-size from Sect. 44.4.1. 
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(1+1)-ES on Sphere Functions 
The normalized progress rate of the (1+1)-ES on sphere 
functions is 


* : *2 * 
g* = l emn [i ert (2 )| 
V20 4 J8 


(44.14) 


in the limit of n —> oo [44.1]. The expression in square 
brackets is the success probability (i.e., the probabil- 
ity that the offspring candidate solution is superior to 
its parent and thus replaces it). The first term in (44.14) 
is the contribution to the normalized progress rate from 
the component zo of the mutation vector that is paral- 
lel to the gradient vector. The second term results from 
the component ze that is perpendicular to the gradient 
direction. 

The black curve in Fig. 44.4 illustrates how the 
normalized progress rate of the (1+1)-ES on sphere 
functions in the limit n — oo depends on the normal- 
ized mutation strength. For small normalized mutation 
strengths, the normalized progress rate is small as 
the short steps that are made do not yield significant 
progress. The success probability is nearly one-half. 
For large normalized mutation strengths, progress is 
near zero as the overwhelming majority of steps re- 
sult in poor offspring that are rejected. The normalized 
progress rate assumes a maximum value of g* = 0.202 
at normalized mutation strength o* = 1.224. The range 
of step-sizes for which close to optimal progress is 
achieved is referred to as the evolution window [44.1]. 
In the runs of the (1+1)-ES with constant step-size 
shown in Fig. 44.2, the normalized step-size initially is 
to the left of the evolution window (large relative dis- 
tance to the optimal solution) and in the end to its right 
(small relative distance to the optimal solution), achiev- 
ing maximal progress at a point in between. 


(u/n, à)-ES on Sphere Functions 
The normalized progress rate of the (j2/,A)-ES on 
sphere functions is described by 


2 
o* 


gr =O" Cuma yy (44.15) 
in the limit n > oo [44.2]. The term cy, is the 
expected value of the average of the u largest or- 
der statistics of A independent standard normally dis- 
tributed random numbers. For À fixed, Cu/u,a de- 
creases with increasing u. For the fixed truncation ratio 
U/À,Cu/u,.a approaches a finite limit value as A and ju 
increase [44.8, 15]. 
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Fig. 44.6 Maximal normalized progress per offspring of 
the (u/u,À)-ES on sphere functions for n —> oo plotted 
against the truncation ratio. The curves correspond to, from 
bottom to top, A = 4, 10,40, 100, co. The dotted line rep- 
resents the maximal progress rate of the (1+1)-ES 


It is easily seen from (44.15) that the normalized 
progress rate of the (/j1,A)-ES is maximized by nor- 
malized mutation strength 


O* = WCu/u.à - (44.16) 


The normalized progress rate achieved with that setting 
is 


2 
x _ HCu/ mA 


7 (44.17) 


p 
The progress rate is negative if o* > 2ucy/y.a- 
Figure 44.6 illustrates how the optimal normalized 
progress rate per offspring depends on the population 
size parameters u and À. Two interesting observations 
can be made from the figure: 


© For all but the smallest values of A, the (w/w, A)- 
ES with u> 1 is capable of significantly more 
rapid progress per offspring than the (1,A)-ES. 
This contrasts with findings for the (4/1,A)-ES, 
the performance of which on sphere functions for 
n— oo monotonically deteriorates with increas- 
ing u [44.8]. 

© For large A, the optimal truncation ratio is z/A = 
0.27, and the corresponding progress per offspring 
is 0.202. Those values are identical to the opti- 
mal success probability and resulting normalized 
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progress rate of the (1+1)-ES. Beyer [44.8] shows 
that the correspondence is no coincidence and 
indeed exact. The step-sizes that the two strate- 
gies employ differ widely, however. The optimal 
step-size of the (1+1)-ES is 1.224; that of the 
(n/u, À)-ES is UCu/u.a and for fixed truncation 
ratio u/À increases (slightly superlinearly) with the 
population size. For example, optimal step-sizes of 
(u/u,4u)-ES for u € {1,2,3} are 1.029, 2.276, 
and 3.538, respectively. If offspring candidate solu- 
tions can be evaluated in parallel, the (u/u,A)-ES 
is preferable to the (1+1)-ES, which does not ben- 
efit from the availability of parallel computational 
resources. 


Equation (44.15) holds in the limit n — oo for any 
finite value of A. In finite but high dimensional search 
spaces, it can serve as an approximation to the nor- 
malized progress rate of the (j1/j4,A)-ES on sphere 
functions in the vicinity of the optimal step-size pro- 
vided that A is not too large. A better approximation 
for finite n is derived in [44.8, 15] (however compare 
also [44.56]). 

The improved performance of the (w/j,A)-ES for 
u > 1 compared to the strategy that uses u = 1 is a con- 
sequence of the factor jz in the denominator of the term 
in (44.15) that contributes negatively to the normalized 
progress rate. The components zo of mutation vectors 
selected for survival are correlated and likely to point 
in the direction opposite to the gradient vector. The per- 
pendicular components Zo in the limit n — oo have no 
influence on whether a candidate solution is selected for 
survival and are thus uncorrelated. The recombinative 
averaging of mutation vectors results in a length of the 
Z@-component similar to those of individual mutation 
vectors. However, the squared length of the components 
perpendicular to the gradient direction is reduced by 
a factor of u, resulting in the reduction of the nega- 
tive term in (44.15) by a factor of u. Beyer [44.15] has 
coined the term genetic repair for this phenomenon. 

Weighted recombination (compare Algorithms 44.7 
and 44.8) can significantly increase the progress rate of 
(u/u,à)-ES on sphere functions. If n is large, the k- 
th best candidate solution is optimally associated with 
a weight proportional to the expected value of the 
k-th largest order statistic of a sample of A indepen- 
dent standard normally distributed random numbers. 
The resulting optimal normalized progress rate per off- 
spring candidate solution for large values of A then 
approaches a value of 0.5, exceeding that of optimal un- 
weighted recombination by a factor of almost two and 


a half [44.13]. The weights are symmetric about zero. 
If only positive weights are employed and u = |A/2], 
the optimal normalized progress rate per offspring with 
increasing A approaches a value of 0.25. The weights in 
Algorithms 44.7 and 44.8 closely resemble those posi- 
tive weights. 


(4/4, A)-ES on Noisy Sphere Functions 

Noise in the objective function is most commonly 
modeled as being Gaussian. If evaluation of a candi- 
date solution x yields a noisy objective function value 
f(x) +a-N (0, 1), then inferior candidate solutions will 
sometimes be selected for survival and superior ones 
discarded. As a result, progress rates decrease with 
increasing noise strength oe. Introducing normalized 
noise strength ož = o¢n/(2R?), in the limit n —> oo, the 
normalized progress rate of the (u/u,A)-ES on noisy 

sphere functions is 
* *2 
x_ I Cu/ma T 


g“ = ; 
V1+0? 2u 


where 3 = ož /o* is the noise-to-signal ratio that the 
strategy operates under [44.67]. Noise does not impact 
the term that contributes negatively to the strategy’s 
progress. However, it acts to reduce the magnitude of 
the positive term stemming from the contributions of 
mutation vectors parallel to the gradient direction. Note 
that unless the noise scales such that ož is independent 
of the location in search space (i.e., the standard de- 
viation of the noise term increases in direct proportion 
to f(x), such as in a multiplicative noise model with 
constant noise strength), (44.18) describes progress in 
single time steps only rather than a rate of convergence. 

Figure 44.7 illustrates for different offspring pop- 
ulation sizes A how the optimal progress rate per off- 
spring depends on the noise strength. The curves have 
been obtained from (44.18) for optimal values of o* 
and u. As the averaging of mutation vectors results in 
a vector of reduced length, increasing A (and u along 
with it) allows the strategy to operate using larger and 
larger step-sizes. Increasing the step-size reduces the 
noise-to-signal ratio Ŷ that the strategy operates under 
and thereby reduces the impact of noise on selection for 
survival. Through genetic repair, the (u/j4,A)-ES thus 
implicitly implements the rescaling of mutation vectors 
proposed in [44.2] for the (1, A)-ES in the presence of 
noise. Compare c,, and 7, in Algorithms 44.7 and 44.8 
that, for values smaller than one, implement the ex- 
plicit rescaling. It needs to be emphasized though that 
in finite-dimensional search spaces, the ability to in- 


(44.18) 
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crease À without violating the assumptions made in the 
derivation of (44.18) is severely limited. Nonetheless, 
the benefits resulting from genetic repair are signifi- 
cant, and the performance of the (j4/1,A)-ES is much 
more robust in the presence of noise than that of the 
(1+1)-ES. 


Cumulative Step-Size Adaptation 
All progress rate results discussed up to this point 
consider single time steps of the respective ES only. 
Analyses of the behavior of ES that include some form 
of step-size adaptation are considerably more difficult. 
Even for objective functions as simple as sphere func- 
tions, the state of the strategy is described by several 
variables with nonlinear, stochastic dynamics, and sim- 
plifying assumptions need to be made in order to arrive 
at quantitative results. 

In the following, we consider the (u/p,A)-ES 
with cumulative step-size adaptation (Algorithm 44.6 
with (44.1) in place of Line 9 for mathematical conve- 
nience) and parameters set such that cg — 0 as n —> oo 
and d = ©(1). The state of the strategy on noisy sphere 
functions with of = const (i.e., noise that decreases 
in strength as the optimal solution is approached) is 
described by the distance R of the parental centroid 
from the optimal solution, normalized step-size o*, the 
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Fig. 44.7 Optimal normalized progress rate per offspring 
of the (u/u,àÀ)-ES on noisy sphere functions for n> 
co plotted against the normalized noise strength. The 
solid lines depict results for, from bottom to top, A = 
4, 10, 40, 100, oo and optimally chosen jz. The dashed line 
represents the optimal progress rate of the (1+ 1)-ES (af- 
ter [44.68]) 


length of the search path s parallel to the direction of 
the gradient vector of the objective function, and that 
path’s overall squared length. After initialization effects 
have faded, the distribution of the latter three quantities 
is time invariant. Mean values of the time invariant dis- 
tribution can be approximated by computing expected 
values of the variables after a single iteration of the 
strategy in the limit n —> oo and imposing the condi- 
tion that those be equal to the respective values before 
that iteration. Solving the resulting system of equations 
for ož < V2UCu/u.à yields 


x 2 
—*_) (44.19) 


oO” = Mey /u,a4f2—- ( 
HCu/u,à 

for the average normalized mutation strength assumed 

by the strategy [44.69, 70]. The corresponding normal- 

ized progress rate 


is obtained from (44.18). Both the average mutation 
strength and the resulting progress rate are plotted 
against the noise strength in Fig. 44.8. For small 
noise strengths, cumulative step-size adaptation gen- 
erates mutation strengths that are larger than optimal. 
The evolution window continually shifts toward smaller 
values of the step-size, and adaptation remains behind 
its target. However, the resulting mutation strengths 
achieve progress rates within 20% of optimal ones. For 
large noise strengths, the situation is reversed and the 
mutation strengths generated by cumulative step-size 
adaptation are smaller than optimal. However, increas- 
ing the population size parameters u and À allows 
shifting the operating regime of the strategy toward the 
left-hand side of the graphs in Fig. 44.8, where step- 
sizes are near optimal. As above, it is important to keep 
in mind the limitations of the results derived in the limit 
n — oo. In finite-dimensional search spaces the ability 
to compensate for large amounts of noise by increas- 
ing the population size is more limited than (44.19) 
and (44.20) suggest. 


Parabolic Ridge Functions 
A class of test functions that poses difficulties very 
different from those encountered in connection with 
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Fig. 44.8a,b Normalized mutation strength and normalized progress rate of the (u/ m, A)-ES with cumulative step size 
adaptation on noisy sphere functions for n —> oo plotted against the normalized noise strength. The dashed lines depict 


optimal values 


sphere functions are ridge functions, 


ï a/2 
fen +e ($) =x, HERY, 
i=2 

which include the parabolic ridge for œ = 2. The xı- 
axis is referred to as the ridge axis, and R denotes 
the distance from that axis. Progress can be made by 
minimizing the distance from the ridge axis or by 
proceeding along it. The former requires decreasing 
step-sizes and is limited in its effect as R > 0. The latter 
allows indefinite progress and requires that the step- 
size does not decrease to zero. Short- and long-term 
goals may thus be conflicting, and inappropriate step- 
size adaptation may lead to stagnation. 

As an optimal solution to the ridge problem does 
not exist, the progress rate g of the (j4/j,A)-ES on 
ridge functions is defined as the expectation of the step 
made in the direction of the negative ridge axis. For con- 
stant step-size, the distance R of the parental centroid 
from the ridge axis assumes a time-invariant limit dis- 
tribution. An approximation to the mean value of that 
distribution can be obtained by identifying that value 
of R for which the expected change is zero. Using this 
value yields 

2 
p= PMC) à (44.21) 
n&(1+ J/1+ (2ucu/u.a/nE0))}?) 
for the progress rate of the (u/u,A)-ES on parabolic 
ridge functions [44.71]. The strictly monotonic behav- 


ior of the progress rate, increasing from a value of 
zero for o = 0 to 9 = Ucy/p.a>/(n&) for o > œ, is 
fundamentally different from that observed on sphere 
functions. However, the derivative of the progress rate 
with regard to the step-size for large values of ø tends to 
zero. The limited time horizon of any search as well as 
the intent of using ridge functions as local rather than 
global models of practically relevant objective func- 
tions both suggest that it may be unwise to increase the 
step-size without bounds. 

The performance of cumulative step-size adaptation 
on parabolic ridge functions can be studied using the 
same approach as described above for sphere functions, 
yielding 


_ Ueu/n.à 


ae 


for the (finite) average mutation strength [44.72]. 
From (44.21), the corresponding progress rate 


(44.22) 


= UCu/ wa’ 


Onk (44.23) 


is greater than half of the progress rate attained with any 
finite step size. 


Cigar Functions 
While parabolic ridge functions provide an environ- 
ment for evaluating whether step-size adaptation mech- 
anisms are able to avoid stagnation, the ability to 
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make continual meaningful positive progress with some 
constant nonzero step-size is, of course, atypical for 
practical optimization problems. A class of ridge-like 
functions that requires continual adaptation of the mu- 
tation strength and is thus a more realistic model of 
problems requiring ridge following are cigar functions: 


f@ =x) =p +eR’, 


i=2 


with parameter € > 1 being the condition number of 
the Hessian matrix. Small values of £ result in sphere- 
like characteristics, large values in ridge-like ones. As 
above, R measures the distance from the x-axis. 

Assuming successful adaptation of the step-size, 
ES exhibit linear convergence on cigar functions. The 
expected relative per iteration change in the objective 
function value of the population centroid is referred to 
as the quality gain A and determines the rate of conver- 
gence. In the limit n — œ it is described by 


*2 
E£-—1 
ifo” < 2UCu/u. à —— 


3 


SA 
2u(é— 1) 


A* = 
*2 


Cu/p.ao” -53 otherwise , 


where o* =on/R and A* = An/2 [44.73]. That re- 
lationship is illustrated in Fig. 44.9 for several values 
of the conditioning parameter. The parabola for E = 
1 reflects the simple quadratic relationship for sphere 
functions seen in (44.15). (For the case of sphere func- 
tions, normalized progress rate and normalized quality 
gain are the same.) For cigar functions with large values 
of £, two separate regimes can be identified. For small 
step-sizes, the quality gain of the strategy is limited by 
the size of the steps that can be made in the direction of 
the x,-axis. The x;-component of the population cen- 
troid virtually never changes sign. The search process 
resembles one of ridge following, and we refer to the 
regime as the ridge regime. In the other regime, the 
step-size is such that the quality gain of the strategy is 
effectively limited by the ability to approach the optimal 
solution in the subspace spanned by the x2, . . . , X,-axes. 
The x,-component of the population centroid changes 
sign much more frequently than in the ridge regime, 
as is the case on sphere functions. We thus refer to the 
regime as the sphere regime. 

The approach to the analysis of the behavior of 
cumulative step-size adaptation explained above for 


Quality gain ANU, 2) 
0.64 


b > 
0 0.5 1 1) 2 25 
Mutation strength OMUCuj, a) 


Fig. 44.9 Normalized quality gain of (j/j,A)-ES on 
cigar functions for n — oo plotted against the normal- 
ized mutation strength for € € {1,4, 100}. The vertical line 
represents the average normalized mutation strength gen- 
erated by cumulative step-size adaptation 


sphere and parabolic ridge functions can be applied to 
cigar functions as well, yielding 


o* = V2 u/ pA 


for the average normalized mutation strength generated 
by cumulative step-size adaptation [44.73]. The corre- 
sponding normalized quality gain is 


Ja 
~v2-—1 


(V2 I)Uhcu/u.a? if&< 
KS 
2 
HCu/u.à 


= 


Both are compared with optimal values in Fig. 44.10. 
For small condition numbers, (w/p, A)-ES operate in 
the sphere regime and are within 20% of the opti- 
mal quality gain as seen earlier. For large condition 
numbers, the strategy operates in the ridge regime and 
achieves a quality gain within a factor of 2 of the 
optimal one, in accordance with the findings for the 
parabolic ridge above. 


otherwise . 


Further Work 
Further research regarding the progress rate of ES in 
different test environments includes work analyzing the 
behavior of mutative self-adaptation for linear [44.22], 
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Fig. 44.10a,b Normalized mutation strength and normalized quality gain of the (uw/j,A)-ES with cumulative step-size 
adaptation on cigar functions for n + oo plotted against the condition number of the cigar. The dashed curves represent 


optimal values 


spherical [44.74], and ridge functions [44.75]. Hierar- 
chically organized ES have been studied when applied 
to both parabolic ridge and sphere functions [44.76, 
77]. Several step-size adaptation techniques have been 
compared for ridge functions, including, but not lim- 
ited to, parabolic ones [44.78]. A further class of convex 
quadratic functions for which quality gain results have 
been derived is characterized by the occurrence of only 
two distinct eigenvalues of the Hessian, both of which 
occur with high multiplicity [44.79, 80]. 

An analytical investigation of the behavior of the 
(1+1)-ES on noisy sphere functions finds that failure 
to re-evaluate the parental candidate solution results in 
the systematic overvaluation of the parent and thus in 
potentially long periods of stagnation [44.68]. Contrary 
to what might be expected, the increased difficulty of 
replacing parental candidate solutions can have a pos- 
itive effect on progress rates as it tends to prevent the 
selection for survival of offspring candidate solutions 
solely due to favorable noise values. The convergence 
behavior of the (1+1)-ES on finite-dimensional sphere 
functions is studied by Jebalia et al. [44.81] who show 
that the additive noise model is inappropriate in fi- 
nite dimensions unless the parental candidate solution 
is re-evaluated, and who suggest a multiplicative noise 
model instead. An analysis of the behavior of (jz, A)- 
ES (without recombination) for noisy sphere functions 
finds that in contrast to the situation in the absence of 
noise, strategies with u > 1 can outperform (1, 1)-ES 
if there is noise present [44.82]. The use of nonsingle- 


ton populations increases the signal-to-noise ratio and 
thus allows for more effective selection of good can- 
didate solutions. The effects of non-Gaussian forms of 
noise on the performance of (u/ u, A)-ES applied to the 
optimization of sphere functions have also been inves- 
tigated [44.83]. 

Finally, there are some results regarding the 
optimization of time-varying objectives [44.84] as 
well as analyses of simple constraint handling tech- 
niques [44.85—87]. 


44.4.3 Convergence Proofs 


In the previous section, we have described theoretical 
results that involve approximations in their derivation 
and consider the limit for n —> oo. In this section, exact 
results are discussed. 

Convergence proofs with only mild assumptions on 
the objective function are easy to obtain for ES with 
a step-size that is effectively bounded from below and 
above (and, for nonelitist strategies, when addition- 
ally the search space is bounded) [44.12, 64]. In this 
case, the expected runtime to reach an e€-ball around 
the global optimum (see also Definition 44.3) cannot 
be faster than œ 1/e”, as obtained with pure random 
search for € — 0 orn — oo. If the mutation distribution 
is not normal and exhibits a singularity in zero, conver- 
gence can be much faster than with random search even 
when the step-size is bounded away from zero [44.88]. 
Similarly, convergence proofs can be obtained for adap- 
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tive strategies that include provisions for using a fixed 
step-size and covariance matrix with some constant 
probability. 

Convergence proofs for strategy variants that do 
not explicitly ensure that long steps are sampled for 
a sufficiently long time typically require much stronger 
restrictions on the set of objective functions that they 
hold for. Such proofs, however, have the potential to 
reveal much faster, namely linear convergence. Evolu- 
tion strategies with the artificial distance proportional 
step-size, o = const x ||x||, exhibit, as shown above, 
linear convergence on the sphere function with an as- 
sociated runtime proportional to log(1/e) [44.63, 65, 
81, 89]. This result can be easily proved by using a law 
of large numbers, because ||x“F! || /||x© || are indepen- 
dent and identically distributed for all t. 

Without the artificial choice of step-size, o/||x|| 
becomes a random variable. If this random variable 
is a homogeneous Markov chain and stable enough 
to satisfy the law of large numbers, linear conver- 
gence is maintained [44.64, 89]. The stability of the 
Markov chain associated with the self-adaptive (1, A)- 
ES on the sphere function has been shown in dimension 
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45. Estimation of Distribution Algorithms 


Martin Pelikan, Mark W. Hauschild, Fernando G. Lobo 


Estimation of distribution algorithms (EDAs) guide 
the search for the optimum by building and sam- 
pling explicit probabilistic models of promising 
candidate solutions. However, EDAs are not only 
optimization techniques; besides the optimum or 
its approximation, EDAs provide practitioners with 
a series of probabilistic models that reveal a lot 
of information about the problem being solved. 
This information can in turn be used to design 
problem-specific neighborhood operators for lo- 
cal search, to bias future runs of EDAs on similar 
problems, or to create an efficient computational 
model of the problem. This chapter provides an 
introduction to EDAs as well as a number of point- 
ers for obtaining more information about this class 
of algorithms. 
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45.1.1 Problem Definition.................... 900 
45.1.2 EDA Procedure................c:cccceeees 900 
45.1.3 Simulation of an EDA by Hand.... 902 

45.2 Taxonomy of EDA Models...................... 903 
45.2.1 Classification Based 

on Problem Decomposition........ 904 
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in Graphical Models................... 906 


Estimation of distribution algorithms (EDAs) [45.1- 
8], also called probabilistic model-building genetic al- 
gorithms (PMBGAs) and iterated density estimation 
evolutionary algorithms (IDEAs), view optimization as 
a series of incremental updates of a probabilistic model, 
starting with the model encoding the uniform distri- 
bution over admissible solutions and ending with the 
model that generates only the global optima. In the 
past decade and a half, EDAs have been applied to 
many challenging optimization problems [45.9—21]. In 
many of these studies, EDAs were shown to solve 
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problems that were intractable with other techniques 
or no other technique could achieve comparable re- 
sults. However, the motive for the use of EDAs in 
practice is not only that these algorithms can solve 
difficult optimization problems, but that in addition to 
the optimum or its approximation EDAs provide prac- 
titioners with a compact computational model of the 
problem represented by a series of probabilistic mod- 
els [45.22—24]. These probabilistic models reveal a lot 
of information about the problem domain, which can 
in turn be used to bias optimization of similar prob- 
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lems, create problem-specific neighborhood operators, 
and many other tasks. While many metaheuristics exist 
that essentially sample implicit probability distributions 
by using a combination of stochastic search operators, 
the insight into the problem represented by a series 
of explicit probabilistic models of promising candi- 
date solutions gives EDAs an edge over most of other 
metaheuristics. 

This chapter provides an introduction to EDAs. Ad- 
ditionally, the chapter presents numerous pointers for 
obtaining additional information about this class of 
algorithms. 


45.1 Basic EDA Procedure 
45.1.1 Problem Definition 


An optimization problem may be defined by specify- 
ing (1) a set of potential solutions to the problem and 
(2) a procedure for evaluating the quality of these so- 
lutions. The set of potential solutions is often defined 
using a general representation of admissible solutions 
and a set of constraints. The procedure for evaluating 
the quality of candidate solutions can either be defined 
as a function that is to be minimized or maximized 
(often referred to as an objective function or fitness 
function) or as a partial ordering operator. The task is 
to find a solution from the set of potential solutions 
that maximizes quality as defined by the evaluation 
procedure. 

As an example, let us consider the quadratic assign- 
ment problem (QAP), which is one of the fundamental 
NP-hard combinatorial problems [45.25]. In QAP, the 
input consists of distances between n locations and 
flows between n facilities. The task is to find a one- 
to-one assignment of facilities to locations so that the 
overall cost is minimized. The cost for a pair of loca- 
tions is defined as the product of the distance between 
these locations and the flow between the facilities as- 
signed to these locations; the overall cost is the sum of 
the individual costs for all pairs of locations. Therefore, 
in QAP, potential solutions are defined as permutations 
that define assignments of facilities to locations and 
the solution quality is evaluated using the cost func- 
tion discussed above. The task is to minimize the cost. 
As another example, consider the maximum satisfiabil- 
ity problem for propositional logic formulas defined in 
the conjunctive normal form with 3 literals per clause 
(MAX3SAT). In MAX3SAT, each potential solution 


The chapter is organized as follows. Section 45.1 
outlines the basic procedure of an EDA. Section 45.2 
presents a taxonomy of EDAs based on the type of 
decomposition encoded by the model and the type of 
local distributions used in the model. Section 45.3 re- 
views some of the most popular EDAs. Section 45.4 
discusses major research directions and the past results 
in theoretical modeling of EDAs. Section 45.5 fo- 
cuses on efficiency enhancement techniques for EDAs. 
Section 45.6 gives pointers for obtaining additional in- 
formation about EDAs. Section 45.7 summarizes and 
concludes the chapter. 


defines one interpretation of propositions (making each 
proposition either true or false), and the quality of a so- 
lution is measured by the number of clauses that are 
satisfied by the specific interpretation. The task is to find 
an interpretation that maximizes the number of satisfied 
clauses. 

Without additional assumptions about the problem, 
one way to find the optimum is to repeat three main 
steps: 


@ Generate candidate solutions. 

© Evaluate the generated solutions. 

© Update the procedure for generating new candidate 
solutions according to the results of the evaluation. 


Ideally, the quality of generated solutions would 
improve over time and after a reasonable number of 
iterations, the execution of these three steps would gen- 
erate the global optimum or its accurate approximation. 
Different algorithms implement the above three steps in 
different ways, but the key idea remains the same — it- 
eratively update the procedure for generating candidate 
solutions so that generated candidate solutions continu- 
ally improve in quality. 


45.1.2 EDA Procedure 


In EDAs, the central idea is to maintain an explicit prob- 
abilistic model that represents a probability distribution 
over candidate solutions. In each iteration, the model is 
adjusted based on the results of the evaluation of can- 
didate solutions so that it will generate better candidate 
solutions in the subsequent iterations. Note that using 
an explicit probabilistic model makes EDAs quite dif- 
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ferent from many other metaheuristics, such as genetic 
algorithms [45.26, 27] or simulated annealing [45.28, 
29], in which the probability distribution used to gen- 
erate new candidate solutions is often defined implicitly 
by a search operator or a combination of several search 
operators. Researchers often distinguish two main types 


of 


EDAs: 


Population-based EDAs. Population-based EDAs 
maintain a population (multiset) of candidate solu- 
tions, starting with a population generated at ran- 
dom according to the uniform distribution over all 
admissible solutions. Each iteration starts by creat- 
ing a population of promising candidate solutions 
using the selection operator, which gives preference 
to solutions of higher quality. Any popular selection 
method for evolutionary algorithms can be used, 
such as truncation or tournament selection [45.30, 
31]. For example, truncation selection selects the 
top t% members of the population. A probabilistic 
model is then built for the selected solutions. New 
solutions are created by sampling the distribution 
encoded by the built model. The new solutions are 
then incorporated into the original population using 
a replacement operator. In full replacement, for ex- 
ample, the entire original population of candidate 
solutions is replaced by the new ones. A pseu- 
docode of a population-based EDA is shown in 
Algorithm 45.1. 

Incremental EDAs. In incremental EDAs, the pop- 
ulation of candidate solutions is fully replaced by 
a probabilistic model. The model is initialized so 
that it encodes the uniform distribution over all 
admissible solutions. The model is then updated 
incrementally by repeating the process of (1) sam- 
pling several candidate solutions from the current 
model and (2) adjusting the model based on the 
evaluation of these candidate solutions and their 
comparison so that the model becomes more likely 
to generate high-quality solutions in subsequent it- 
erations. A pseudocode of an incremental EDA is 
shown in Algorithm 45.2. 


Algorithm 45.1 Population-based estimation of 
distribution algorithm 


1: 


2 
3 
4: 
5 
6 


t40 

: generate population P(0) of random solutions 

: while termination criteria not satisfied, repeat do 
evaluate all candidate solutions in P(t) 

select promising solutions S(t) from P(t) 
build a probabilistic model M(t) for S(t) 


7: generate new solutions O(t) by sampling M (t) 
8: create P(t + 1) by combining O(t) and P(t) 
9: t<t+l 

10: end while 


Algorithm 45.2 Incremental estimation of distri- 
bution algorithm 
1: t<0O 
2: initialize model M(0) to represent the uniform dis- 
tribution over admissible solutions 
3: while termination criteria not satisfied, repeat do 
4: generate population P(t) of candidate solutions 
by sampling M(t) 
5: evaluate all candidate solutions in P(t) 
6: create new model M(t+ 1) by adjusting M(t) ac- 
cording to evaluated P(t) 
T: t<t+l 
8: end while 


Incremental EDAs often generate only a few can- 
didate solutions at a time, whereas population-based 
EDAs often work with a large population of candidate 
solutions, building each model from scratch. Nonethe- 
less, it is easy to see that the two approaches are 
essentially the same because even the population-based 
EDAs can be reformulated in an incremental-based 
manner. 

The main components of a population-based EDA 
thus include: 


(1) A selection operator for selecting promising solu- 
tions. 

(2) An assumed class of probabilistic models to use for 
modeling and sampling. 

(3) A procedure for learning a probabilistic model for 
the selected solutions. 

(4) A procedure for sampling the built probabilistic 
model. 

(5) A replacement operator for combining the popula- 
tions of old and new candidate solutions. 


The main components of an incremental EDA in- 
clude: 


(1) An assumed class of probabilistic models. 

(2) A procedure for adjusting the probabilistic model 
based on new candidate solutions and their evalua- 
tions. 

(3) A procedure for sampling the probabilistic model. 
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The procedure for learning a probabilistic model 
usually requires two subcomponents: a metric for eval- 
uating the probabilistic models from the assumed class, 
and a search procedure for choosing a particular model 
based on the metric used. EDAs differ mainly in 
the class of probabilistic models and the procedures 
used for evaluating candidate models and searching for 
a good model. 

The general outline of a population-based EDA is 
quite similar to that of a traditional evolutionary al- 
gorithm (EA) [45.32]; both guide the search toward 
promising solutions by iteratively performing selection 
and variation, the two key ingredients of any EA. In par- 
ticular, components (1) and (5) are precisely the same 
as those used in other EAs. Components (2), (3), and 
(4), however, are unique to EDAs, and constitute their 
way of producing variation, as opposed to using recom- 
bination and mutation operators as is often done with 
other EAs. 

As we shall see, this alternative perspective opens 
a way for designing search procedures from principled 
grounds by bringing to the evolutionary computation 
domain a vast body of knowledge from the machine 
learning literature, and in particular from probabilis- 
tic graphical models. The key idea of EDAs is to look 
at a population of previously visited good solutions as 
data, learn a model (or theory) of that data, and use 
the resulting model to infer where other good solutions 
might be. This approach is powerful, allowing a search 
algorithm to learn and adapt itself with respect to the 
optimization problem being solved, while it is being 
solved. 


45.1.3 Simulation of an EDA by Hand 


To better understand the EDA procedure, this section 
presents a simple EDA simulation by hand. The purpose 
of presenting the simulation is to clarify the components 
of the basic EDA procedure and to build intuition about 
the dynamics of an EDA run. 

The simulation assumes that candidate solutions are 
represented by binary strings of fixed length n > 0. The 
objective function to maximize is onemax, which is de- 
fined as the sum of the bits in the input binary string 
(Xi, X2,..., Xn) 

fonemax (Xi; X2... Xn) = J Xi. (45.1) 

i=1 
The quality of a candidate solution improves with the 
number of 1s in the input string, and the optimum is the 
string of all 1s. 


To model and sample candidate solutions, the sim- 
ulation uses a probability vector [45.1, 6, 33]. A proba- 
bility vector p for n-bit binary strings has n components, 
P =(p1,p2,---,Pn). The component p; represents the 
probability of observing a 1 in position 7 of a solution 
string. To learn the probability vector, p; is set to the 
proportion of 1s in position i observed in the selected 
set of solutions. To sample a new candidate solution 
(X1, X2, . . . , Xn), the components of the probability vec- 
tor are polled and each X; is set to 1 with probability p;, 
and to 0 with probability 1 — p;. 

The expected outcome of the learning and sampling 
of the probability vector is that the population of se- 
lected solutions and the population of new candidate 
solutions have the same proportion of 1s in each po- 
sition. However, since the sampling considers each new 
candidate solution independently of others, the actual 
proportions may vary a little from their expected val- 
ues. The probability-vector EDA described above is 
typically referred to as the univariate marginal distri- 
bution algorithm (UMDA) [45.6]; other EDAs based on 
the probability vector model [45.1, 33, 34] will be dis- 
cussed in Sect. 45.3.1. 

To keep the simulation simple, we consider a 5- 
bit onemax, a population of size N = 6, and truncation 
selection with threshold t = 50%. Recall that the trun- 
cation selection with t = 50% selects the top half of the 
current population. 

Figure 45.1 shows the first two iterations of the 
EDA simulation. The initial population of candidate 
solutions is generated at random. Truncation selection 
then selects the best 50% of candidate solutions based 
on their evaluation using onemax to form the set of 
promising solutions. Next, the probability vector is cre- 
ated based on the selected solutions and the distribution 
encoded by the probability vector is sampled to gener- 
ate new candidate solutions. The resulting population 
replaces the original population and the procedure re- 
peats. 

In both iterations of the simulation, the average ob- 
jective-function value in the new population is greater 
than the average value in the population before se- 
lection. The increase in the average quality of the 
population is good news for us because we want to max- 
imize the objective function, but why does this happen? 
Since for onemax the solutions with more 1s are better 
than those with fewer 1s, selection should increase the 
number of 1s in the population. The learning and sam- 
pling of the probability vector is not expected to create 
or destroy any bits and that is why the new population 
of candidate solutions should contain more 1s than the 
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original population (both in the proportion and in the 
actual number). Since onemax value increases with the 
number of 1s, we can expect the overall quality of the 
population to increase over time. Ideally, every itera- 
tion should increase the objective-function values in the 
population unless no improvement is possible. 
Nonetheless, the increase of the average objective- 
function value tells only half the story. A similar in- 
crease in the quality of the population in the first 
iteration would be achieved by just repeating selec- 
tion alone without the use of the probabilistic model. 
However, by applying selection alone, no new solutions 
are ever created and the resulting algorithm produces 
no variation at all (i.e., there is no exploration of new 
candidate solutions). Since the initial population is gen- 
erated at random, the EDA with selection alone would 
be just a poor algorithm for obtaining the best solution 
from the initial population. The learning and sampling 
of the probabilistic model provides a mechanism for 


45.2 Taxonomy of EDA Models 


This section provides a high-level overview of the dis- 
tinguishing characteristics of probabilistic models. The 
characteristics are discussed with respect to (1) the 


Fig. 45.1 Simple simulation of an 
EDA based on the probability-vector 
model for onemax. The fitness val- 
ues of candidate solutions are shown 
inside parentheses 
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both (1) improving quality of new candidate solutions 
(under certain assumptions), and (2) facilitating explo- 
ration of the set of admissible solutions. 

What we have seen in this simulation was an exam- 
ple of the simplest kind of EDAs. The assumed class of 
probabilistic models, the probability vector, has a fixed 
structure. Under these circumstances, the procedure for 
learning it becomes trivial because there are really no 
alternative models to choose from. This class of EDAs 
is quite limited in what it can do. As we shall see in 
a moment, there are other classes of EDAs that allow 
richer probabilistic models capable of capturing inter- 
actions among the variables of a given problem. More 
importantly, these interactions can be learned automat- 
ically on a problem by problem basis. This results of 
course in a more complex model building procedure, 
but the extra effort has been shown to be well worth 
it, especially when solving more difficult optimization 
problems [45.4, 5, 8, 22, 35-37]. 


types of interactions covered by the model and (2) the 
types of local distributions. This section only focuses 
on the key characteristics of the probabilistic models; 
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a more detailed overview of EDAs for various repre- 
sentations of candidate solutions will be covered by the 
following sections. 


45.2.1 Classification Based 
on Problem Decomposition 


To make the estimation and sampling tractable with 
reasonable sample sizes, most EDAs use probabilistic 
models that decompose the problem using uncondi- 
tional or conditional independence. The way in which 
a model decomposes the problem provides one impor- 
tant characteristic that distinguishes different classes 
of probabilistic models. Classification of probabilistic 
models based on the way they decompose a prob- 
lem is relevant regardless of the types of the under- 
lying distributions or the representation of problem 
variables. 

Most EDAs assume that candidate solutions are rep- 
resented by fixed-length vectors of variables and they 
use graphical models to represent the underlying prob- 
lem structure. Graphical models allow practitioners to 
represent both direct dependencies between problem 
variables as well as independence assumptions. One 
way to classify graphical models is to consider a hi- 
erarchy of model types based on the complexity of 
a model (see Fig. 45.2 for illustrative examples) [45.3, 
4,7]: 


@ No dependencies. In models that assume full in- 
dependence, every variable is assumed to be inde- 
pendent of any other variable. That is, the prob- 
ability distribution P(X,, X2,...,X,) of the vector 
(X,,X2,...,X,) of n variables is assumed to con- 
sist of a product of the distributions of individual 


a) Univariate model b) Chain model 
© © 

O © 
© © 


d) Marginal product model e) Bayesian network 


f) Markov network 


{> {p 


variables 


P(X1, X2, ..., Xn) = ] [?. 


i=1 


(45.2) 


The simulation presented in Sect. 45.1.3 was based 
on a model that assumed full independence of bi- 
nary problem variables. EDAs based on univariate 
models that assume full independence of problem 
variables include the equilibrium genetic algorithm 
(EGA) [45.33], the population-based incremental 
learning (PBIL) [45.1], the UMDA [45.6], the com- 
pact genetic algorithm (cGA) [45.34], the stochastic 
hill climbing with learning by vectors of normal dis- 
tributions [45.38], and the continuous PBIL [45.39]. 
Pairwise dependencies. In this class of models, de- 
pendencies between variables form a tree or forest 
graph. In a tree graph, each variable except for the 
root of the tree is conditioned on its parent in a tree 
that contains all variables. A forest graph, on the 
other hand, is a collection of disconnected trees. 
Again, the forest contains all problem variables. 
Denoting by R the set of roots of the trees in a for- 
est, and by X = (X1, X2, . . . , Xn) the entire vector of 
variables, the distribution from this class can be ex- 
pressed as 


P(X,,Xo,...,Xn) 
=|]? 
XiER 


x I] P (X;|parent(X;)) . 
XiEX\R 


(45.3) 


A special type of a tree model is sometimes dis- 
tinguished, in which the variables form a sequence 


c) Forest model 


Fig. 45.2a-f Illustrative exam- 
ples of graphical models. Problem 
variables are displayed as circles 
and dependencies are shown as 
edges between variables or clus- 
ters of variables. (a) Univariate 
model. (b) Chain model. (c) Forest 
model. (d) Marginal product model. 
(e) Bayesian network. (f) Markov 
network 
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(or a chain), and each variable except for the first 
one depends directly on its predecessor. Denoting 
by x (i) the index of the ith variable in the sequence, 
the distribution is given by 


P(X1,X2,...,Xn) =PXxay) | | 
i=2 


x P(X ro |XxG-1)) - 
(45.4) 


EDAs based on models with pairwise dependencies 
include the mutual information maximizing input 
clustering (MIMIC) [45.36], EDA based on de- 
pendency trees [45.35], and the bivariate marginal 
distribution algorithm (BMDA) [45.40]. 
Multivariate dependencies. Multivariate models 
represent dependencies using either directed acyclic 
graphs or undirected graphs. Two representative 
models are popular in EDAs: (1) Bayesian networks 
and (2) Markov networks. A Bayesian network 
is represented by a directed acyclic graph where 
each node corresponds to a variable and each edge 
defines a direct conditional dependence. The prob- 
ability distribution encoded by a Bayesian network 
can be written as 


P(X, X2,...,Xn) = | [ Pilparents(x;)) : 
i=l 


(45.5) 


A Bayesian network represents problem decom- 
position by conditional independence assumptions; 
each variable is assumed to be independent of any 
of its antecedents in the ancestral ordering of the 
variables, given the values of the variable’s parents. 
Note that all models discussed thus far were spe- 
cial cases of Bayesian networks. In fact, a Bayesian 
network can represent an arbitrary multivariate dis- 
tribution; however, for such a model to be practical, 
it is often desirable to consider Bayesian networks 
of limited complexity. 

In Markov networks (Markov random field mod- 
els), two variables are assumed to be independent of 
each other given a subset of variables defining the 
condition if every path between these variables is 
separated by one or more variables in the condition. 
A special subclass of multivariate models is some- 
times considered in which the variables are divided 
into disjoint clusters, which are independent of each 


other. These models are called marginal product 
models (MPM). Polytrees also represent a subclass 
of multivariate models in which a directed acyclic 
graph is used as the basic dependency structure but 
the graph is restricted so that at most one undirected 
path exists between any two vertices. 
EDAs based on models with multivariate depen- 
dencies include the factorized distribution algorithm 
(FDA) [45.37], the learning FDA (LFDA) [45.37], 
the estimation of Bayesian network algorithm 
(EBNA) [45.41], the Bayesian optimization algo- 
rithm (BOA) [45.42, 43] and its hierarchical version 
(hBOA) [45.44],the extended compact genetic algo- 
rithm (ECGA) [45.45], the polytree EDA [45.46], 
the continuous iterated density estimation algo- 
rithm [45.47], the estimation of multivariate nor- 
mal algorithm (EMNA) [45.48], and the real-coded 
BOA (rBOA) [45.49]. 

© Full dependence. Models may be used that do 
not make any independence assumptions. However, 
such models must typically impose a number of 
other restrictions on the distribution to ensure that 
the models remain tractable for a moderate-to-large 
number of variables. 


There are two additional types of probabilistic mod- 
els that have been used in EDAs and that provide 
a somewhat different mechanism for decomposing the 
problem: 


© Grammar models. Some EDAs use stochastic or 
deterministic grammars to represent the probabil- 
ity distribution over candidate solutions. The ad- 
vantage of grammars is that they allow model- 
ing of variable-length structures. Because of this, 
grammar distributions are mostly used as the ba- 
sis for implementing genetic programming using 
EDAs [45.50], which represents candidate solu- 
tions using labeled trees of variable size. Gram- 
mar models are used, for example, in the proba- 
bilistic-grammar based EDA for genetic program- 
ming [45.51], the program distribution estimation 
with grammar model (PRODIGY) [45.52], or the 
EDA based on probabilistic grammars with latent 
annotations [45.53]. 

@ Feature-based models. Feature-based models en- 
code the distribution of the neighborhood of a can- 
didate solution using position-independent sub- 
structures, which can be found in a variety of 
positions in fixed-length or variable-length solu- 
tions. This approach is used in the feature-based 
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BOA [45.54]. Other features may be discovered, 
encoded, and used for guiding the exploration of 
the space of candidate solutions. Model-directed 
neighborhood structures are also used in other EDA 
variants, as will be discussed in Sect. 45.5.2. 


45.2.2 Classification Based 
on Local Distributions 
in Graphical Models 


Regardless of how a graphical model decomposes the 
problem, each model must also assume one or more 
classes of distributions to encode local conditional 
and marginal distributions. Some of the most common 
classes of local distributions are discussed below: 


© Probability tables. For discrete representations, 
conditional and marginal probabilities can be en- 
coded using probability tables, which define a prob- 
ability for each relevant combination of values in 
each conditional or marginal probability term. This 
was the case, for example, in the simulation in 
Sect. 45.1.3, in which the probability distribution 
for each string position i was represented by the 
probability p; of a 1; the probability of a O in 
the same position was simply 1— p;. As another 
example, in Bayesian networks, for each variable 
a probability table can be used to define conditional 
probabilities of any value of the variable given any 
combination of values of the variable’s parents. 
While probability tables cannot directly represent 
continuous probability distributions, they can be 
used even for real-valued representations in combi- 
nation with a discretization method that maps real- 
valued variables into discrete categories; each of 
the discrete categories can then be represented us- 
ing a single probability entry. Probability tables are 


>< 
N 
Pasi 
W 
Pei 
KN 


p(X1 | X2, X3, X4) 
0.75 
0.25 
0.25 
0.25 
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Fig. 45.3 A conditional probability table for p(X) |X2, X3, X4) and 
a corresponding decision tree that reduces the number of parameters 
(probabilities) from 8 to 4 


used, for example, in UMDA [45.6], BOA [45.43] 
and ECGA [45.45]. An example conditional proba- 
bility table is shown in Fig. 45.3. 

Decision trees or graphs, default tables. To avoid 
excessively large probability tables when many 
probabilities are either similar or negligible, more 
advanced local structures such as decision trees, de- 
cision graphs, or default tables may be used. In 
decision trees, for example, probabilities are stored 
in leaves of a decision tree in which each internal 
node represents a test on a variable and the children 
of the node correspond to the different outcomes of 
the test. Decision trees and decision graphs can also 
be used in combination with real-valued variables, 
in which the leaves store a continuous distribution 
in some way. More advanced structures such as de- 
cision trees and decision graphs are used, for exam- 
ple, in the decision-graph BOA (dBOA) [45.55], the 
hierarchical BOA (hBOA) [45.44], and the mixed 
BOA (mBOA) [45.56]. An example decision tree 
for representing conditional probabilities is shown 
in Fig. 45.3. 

Multivariate, continuous distributions. The nor- 
mal distribution is by far the most popular dis- 
tribution used in EDAs to represent univariate 
or multivariate distributions of real-valued vari- 
ables. A multivariate normal distribution can en- 
code a linear correlation between the variables 
using the covariance matrix, but it is often ineffi- 
cient in representing many other types of interac- 
tions [45.56, 57]. Normal distributions were used in 
many EDAs for real-valued vectors [45.38, 39, 47, 
48], although in many real-valued EDAs more ad- 
vanced distributions were used as well. Examples 
of multivariate normal distributions are shown in 
Fig. 45.4a-c. 

Mixtures of distributions. A mixture distribution 
consists of multiple components. Each compo- 
nent is represented by a specific local probabilistic 
model, such as a normal distribution, and each 
component is assigned a probability. Mixture dis- 
tributions were used in EDAs especially to en- 
able EDAs for real-valued representations to deal 
with real-valued distributions with multiple basins 
of attraction, in which a single-peak distribution 
does not suffice. Mixture distributions were used, 
for example, in the real-valued iterated density 
estimation algorithms [45.47] or the real-coded 
BOA [45.49]. The use of mixture distributions is 
more popular in EDAs for real-valued represen- 
tations, although mixture distributions were also 
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Fig. 45.4a-d Local models for continuous distributions over real-valued variables. (a) Multivariate normal distribution 
with equal standard deviations and no covariance, (b) Multivariate normal distribution with arbitrary standard devia- 
tions for each variable (diagonal covariance matrix), (c) Multivariate normal distribution with an arbitrary (nondiagonal) 
covariance matrix, (d) Joint normal kernels distribution 


used to represent distributions over discrete rep- 
resentations in which the population consists of 
multiple dissimilar clusters [45.58] and in multi- 
objective EDAs [45.59, 60]. An example of a mix- 
ture of normal kernel distributions is shown in 
Fig. 45.4d. 

Histograms. In a number of EDAs for real-valued 
representations, to encode local distributions, real- 
valued variables or sets of such variables are di- 
vided into rectangular regions using a histogram- 
like model, and a separate probabilistic model is 


used to represent the distribution in each region. 
Histogram models can be seen as a special sub- 
class of the decision-tree models for real-valued 
variables. In real-valued EDAs, histograms were 
used, for example, in the histogram-based con- 
tinuous EDA [45.61]. Histogram models can also 
be used for other representations; for example, 
when optimizing permutations, histograms can be 
used to represent different relative ordering con- 
straints and their importance with respect to solu- 
tion quality [45.62, 63]. 
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45.3 Overview of EDAs 


This section gives an overview of EDAs based on the 
representation of candidate solutions; although some of 
the EDAs can be used across several representations. 
Due to the large volume of work in EDASs in the past 
two decades, we do not aim to list every single variant 
of an EDA discussed in the past; instead, we focus on 
some of the most important representatives. 


45.3.1 EDAs for Fixed-Length Strings 
over Finite Alphabets 


EDAs for candidate solutions represented by fixed- 
length strings over a finite alphabet can use a vari- 
ety of model types, from simple univariate models to 
complex Bayesian networks with local structures. This 
section reviews some of the work in this area. Candi- 
date solutions are assumed to be represented by binary 
strings of fixed length n, although most methods pre- 
sented here can be extended to optimization of strings 
over an arbitrary finite alphabet. The section classifies 
EDAs based on the order of interactions in the under- 
lying dependency model along the lines discussed in 
Sect. 45.2.1 [45.3, 4, 7]. 


No Interactions 

The EGA [45.33] and the population-based incremental 
learning (PBIL) [45.1] replace the population of candi- 
date solutions represented as fixed-length binary strings 
by a probability vector (p1, p2,..., Pn), Where n is the 
number of bits in a string and p; denotes the probability 
of a 1 in the ith position of solution strings. Each p; is 
initially set to 0.5, which corresponds to a uniform dis- 
tribution over the set of all solutions. In each iteration, 
PBIL generates s candidate solutions according to the 
current probability vector where s > 2 denotes the se- 
lection pressure. Each value is generated independently 
of its context (remaining bits) and thus no interactions 
are considered (Fig. 45.2a). The best solution from the 
generated set of s solutions is then used to update the 
probability-vector entries using 


Pi = pit Axi- pi), 


where A € (0, 1) is the learning rate (say, 0.02), and x; 
is the ith bit of the best solution. Using the above up- 
date rule, the probability p; of a 1 in the ith position 
increases if the best solution contains a 1 in that position 
and decreases otherwise. In other words, probability- 
vector entries move toward the best solution and, con- 
sequently, the probability of generating this solution 


increases. The process of generating new solutions and 
updating the probability vector is repeated until some 
termination criteria are met; for instance, the run can 
be terminated if all probability-vector entries are suffi- 
ciently close to either 0 or 1. 

Prior work refers to PBIL also as the hill 
climbing with learning (HCwL) [45.64] and the in- 
cremental univariate marginal distribution algorithm 
(IUMDA) [45.65]. 

PBIL is an incremental EDA, because it proceeds 
by executing incremental updates of the model using 
a small sample of candidate solutions. However, there is 
a strong correlation between the learning rate in PBIL 
and the population size in population-based EDAs or 
other evolutionary algorithms; essentially, decreasing 
the learning rate à corresponds to increasing the pop- 
ulation size. 

The cGA [45.34, 66] reduces the gap between PBIL 
and traditional steady-state genetic algorithms. Like 
PBIL, cGA replaces the population by a probability 
vector and all entries in the probability vector are ini- 
tialized to 0.5. Each iteration updates the probability 
vector by mimicking the effect of a single competition 
between two sampled solutions, where the best replaces 
the worst, in a hypothetical population of size N. De- 
noting the bit in the ith position of the best and worst of 
the two sampled solutions by x; and y;, respectively, the 
probability-vector entries are updated as follows: 


1 
Pit g if x, and y 


Pi = : if x; = 0 and y; = 1 
! T xi = i= 
PiN y 


Di otherwise . 


Although cGA uses the probability vector instead 
of a population, updates of the probability vec- 
tor correspond to replacing one candidate solution 
by another one using a population of size N and 
shuffling the resulting population using a univariate 
model that assumes full independence of problem 
variables. 

The UMDA [45.6] maintains a population of so- 
lutions. Each iteration of UMDA starts by selecting 
a population of promising solutions using an arbitrary 
selection method of evolutionary algorithms. A prob- 
ability vector is then computed using the selected 
population of promising solutions and new solutions 
are generated by sampling the probability vector. The 
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new solutions replace the old ones and the process is 
repeated until termination criteria are met. Although 
UMDA uses a probabilistic model as an intermediate 
step between the original and new populations unlike 
PBIL and cGA, the performance, dynamics and limita- 
tions of PBIL, cGA, and UMDA are similar. 

PBIL, cGA, and UMDA can solve problems de- 
composable into subproblems of order | in a linear or 
quadratic number of fitness evaluations. However, if de- 
composition into single-bit subproblems misleads the 
decision making away from the optimum, these algo- 
rithms scale up poorly with problem size [45.60, 67, 
68]. 


Pairwise Interactions 
EDAs based on pairwise probabilistic models, such 
as a chain, a tree or a forest, represent the first step 
toward EDAs being capable of learning variable inter- 
actions and therefore solving decomposable problems 
of bounded order (difficulty) in a scalable manner. 

The MIMIC algorithm [45.36] uses a chain distri- 
bution (Fig. 45.2b) specified by 


(1) an ordering of string positions (variables), 

(2) a probability of a 1 in the first position of the chain, 
and 

(3) conditional probabilities of every other position 
given the value in the previous position in the chain. 


A chain probabilistic model encodes the probabil- 
ity distribution where all positions except the first are 
conditionally dependent on the previous position in the 
chain. After selecting promising solutions and com- 
puting marginal and conditional probabilities, MIMIC 
uses a greedy algorithm to maximize mutual informa- 
tion between the adjacent positions in the chain. In this 
fashion, the Kullback—Leibler divergence [45.69] be- 
tween the chain and actual distributions is minimized. 
Nonetheless, the greedy algorithm does not guarantee 
global optimality of the constructed model (with respect 
to Kullback—Leibler divergence). The greedy algorithm 
starts in the position with the minimum unconditional 
entropy. The chain is expanded by adding a new posi- 
tion that minimizes the conditional entropy of the new 
variable given the last variable in the chain. Once the 
full chain is constructed for the selected population of 
promising solutions, new solutions are generated by 
sampling the distribution encoded by the chain. The use 
of pairwise interactions was one of the most important 
steps in the development of EDAs capable of solving 
decomposable problems of bounded difficulty scalably. 


MIMIC was the first discrete EDA to not only learn and 
use a fixed set of st atistics, but it was also capable of 
identifying the statistics that should be considered to 
solve the problem efficiently. 

Baluja and Davies [45.35] use dependency trees 
(Fig. 45.2c) to model promising solutions. Like in 
PBIL, the population is replaced by a probability vec- 
tor but in this case the probability vector contains all 
pairwise probabilities. The probabilities are initialized 
to 0.25. Each iteration adjusts the probability vector ac- 
cording to new promising solutions acquired on the fly. 
A dependency tree encodes the probability distribution 
where every variable except for the root is condi- 
tioned on the variable’s parent in the tree. A variant 
of Prim’s algorithm for finding the minimum spanning 
tree [45.70] can be used to construct an optimal tree 
distribution. Here the task is to find a tree that maxi- 
mizes mutual information between parents (nodes with 
successors) and their children (successors). This can be 
done by first randomly choosing a variable to form the 
root of the tree, and hanging new variables to the ex- 
isting tree so that the mutual information between the 
parent of the new variable and the variable itself is max- 
imized. In this way, the Kullback—Leibler divergence 
between the tree and actual distributions is minimized 
as shown in [45.71]. Once a full tree is constructed, 
new solutions are generated according to the distribu- 
tion encoded by the constructed dependency tree and 
the conditional probabilities computed from the proba- 
bility vector. 

The BMDA [45.40] uses a forest distribution 
(a set of mutually independent dependency trees, see 
Fig. 45.2c). This class of models is even more general 
than the class of dependency trees, because any forest 
that contains two or more disjoint trees cannot be gen- 
erally represented by a tree. As a measure to determine 
whether to connect two variables, BMDA uses a Pear- 
son’s chi-square test [45.72]. This measure is also used 
to discriminate the remaining dependencies in order to 
construct the final model. To learn a model, BMDA uses 
a variant of Prim’s algorithm [45.70]. 

Pairwise models capture some interactions in 
a problem with reasonable computational overhead. 
EDAs with pairwise probabilistic models can identify, 
propagate, and juxtapose partial solutions of order 2, 
and therefore they work well on problems decompos- 
able into subproblems of order at most two [45.35, 36, 
40, 65, 73]. Nonetheless, capturing only some pairwise 
interactions has still been shown to be insufficient for 
solving all decomposable problems of bounded diffi- 
culty scalably [45.40, 73]. 
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Multivariate Interactions 
Using general multivariate models allows powerful 
EDAs capable of solving problems of bounded diffi- 
culty quickly, accurately, and reliably [45.4, 5, 8, 22, 
37]. On the other hand, learning distributions with mul- 
tivariate interactions necessitates more complex model- 
learning algorithms that require significant computa- 
tional time and still do not guarantee global optimality 
of the resulting model. Nonetheless, many difficult 
problems are intractable using simple models and the 
use of complex models and algorithms is necessary. 

The FDA [45.74] uses a fixed factorized distribution 
throughout the whole run. The model is allowed to con- 
tain multivariate marginal and conditional probabilities, 
but FDA learns only the probabilities, not the structure 
(dependencies and independencies). To solve a problem 
using FDA, we must first decompose the problem and 
then factorize the decomposition. While it is useful to 
incorporate prior information about the regularities in 
the search space, FDA necessitates that the practitioner 
is able to decompose the problem using a probabilistic 
model ahead of time. FDA does not learn what statistics 
are important to process within the EDA framework, it 
must be given that information in advance. A variant of 
FDA where probabilistic models are restricted to poly- 
trees was also proposed [45.46]. 

The ECGA [45.45] uses an MPM that partitions 
the variables into disjoint subsets (Fig. 45.2d). Each 
partition (subset) is treated as a single variable and dif- 
ferent partitions are considered to be mutually indepen- 
dent. To decide between alternative MPMs, ECGA uses 
a variant of the minimum description length (MDL) 
metric [45.75-77], which favors models that allow 
higher compression of data (in this case, the selected set 
of promising solutions). More specifically, the Bayesian 
information criterion (BIC) [45.78] is used. To find 
a good model, ECGA uses a greedy algorithm that 
starts with each variable forming one partition (like in 
UMDA). Each iteration of the greedy algorithm merges 
two partitions that maximize the improvement of the 
model with respect to BIC. If no more improvement 
is possible, the current model is used. ECGA provides 
robust and scalable solution for problems that can be 
decomposed into independent subproblems of bounded 
order (separable problems) [45.79-8 1]. However, many 
real-world problems contain overlapping dependencies, 
which cannot be accurately modeled by dividing the 
variables into disjoint partitions; this can result in poor 
performance of ECGA. 

The dependency-structure matrix genetic algorithm 
(DSMGA) [45.82-84] uses a similar class of models as 


ECGA that splits the variables into independent clusters 
or linkage groups. However, DSMGA builds models via 
dependency structure matrix clustering techniques. 

The BOA [45.42] builds a Bayesian network for the 
population of promising solutions (Fig. 45.2e) and sam- 
ples the built network to generate new candidate solu- 
tions. BOA uses the Bayesian—Dirichlet metric subject 
to a maximum model-complexity constraint [45.85- 
87] to discriminate competing models, but other met- 
rics (such as BIC) have been analyzed in BOA as 
well [45.88]. In all variants of BOA, the model is con- 
structed by a greedy algorithm that iteratively adds 
a new dependency in the model that maximizes the 
model quality. Other elementary graph operators — such 
as edge removals and reversals — can be incorporated, 
but edge additions are most important. The construction 
is terminated when no more improvement is possible. 
The greedy algorithm used to learn a model in BOA is 
similar to the one used in ECGA. However, Bayesian 
networks can encode more complex dependencies and 
independencies than models used in ECGA. Therefore, 
BOA is also applicable to problems with overlapping 
dependencies. BOA uses an equivalent class of mod- 
els as FDA; however, BOA learns both the structure 
and the probabilities of the model. Although BOA 
does not require problem-specific knowledge in ad- 
vance, prior information about the problem can be 
incorporated using Bayesian statistics, and the relative 
influence of prior information and the population of 
promising solutions can be tuned by the user [45.89, 
90]. 

A discussion of the use of Bayesian networks 
as an extension to tree models can also be found 
in Baluja’s and Davies’ work [45.91]. An EDA that 
uses Bayesian networks to model promising solutions 
was independently developed by Etxeberria and Lar- 
rañaga [45.41], who called it the EBNA. Miihlenbein 
and Mahnig [45.37] improved the original FDA by 
using Bayesian networks together with the greedy al- 
gorithm for learning the networks described above; the 
modification of FDA was named the (LFDA). An incre- 
mental version of BOA, the incremental BOA (iBOA) 
was proposed by Pelikan et al. [45.92]. 

The hierarchical BOA (hBOA) [45.44] extends 
BOA by employing local structures to represent lo- 
cal distributions instead of using standard conditional 
probability tables. This enables hBOA to more effi- 
ciently represent distributions with high-order interac- 
tions. Furthermore, hBOA incorporates a niching tech- 
nique called restricted tournament selection [45.93] to 
ensure effective diversity preservation. The two exten- 
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sions enable hBOA to solve problems decomposable 
into subproblems of bounded order over a number of 
levels of difficulty of a hierarchy [45.44, 94]. 

Markov networks are yet another class of models 
that can be used to identify and use multivariate in- 
teractions in EDAs. Markov networks are undirected 
graphical models (Fig. 45.2f). Compared to Bayesian 
networks, Markov networks may sometimes cover the 
same distribution using fewer edges in the dependency 
model, but the sampling of these models becomes 
more complicated than the sampling of Bayesian net- 
works. Markov networks are used, for example, in the 
Markov network EDA (MN-EDA) [45.95] and the den- 
sity estimation using Markov random fields algorithm 
(DEUM) [45.96, 97]. 

Helmholtz machines used in the Bayesian evolu- 
tionary algorithm proposed by Zhang and Shin [45.98] 
can also encode multivariate interactions. Helmholtz 
machines encode interactions by introducing new, hid- 
den variables, which are connected to every variable. 

EDAs that use models capable of covering multi- 
variate interactions can solve a wide range of prob- 
lems in a scalable manner; promising results were 
reported on a broad range of problems, including 
several classes of spin-glass systems [45.22, 99-101], 
graph partitioning [45.90, 102, 103], telecommunica- 
tion network optimization [45.104], silicon cluster op- 
timization [45.80], scheduling [45.105], forest man- 
agement [45.13], ground water remediation system 
design [45.106, 107], multiobjective knapsack [45.20], 
and others. 


45.3.2 EDAs for Real-Valued Vectors 


There are two basic approaches to extending EDAs for 
discrete, fixed-length strings to other domains such as 
real-valued vectors: 


@ Map the other representation to the domain of fixed- 
length discrete strings, solve the discrete problem, 
and map the solution back to the problem’s original 
representation. 

© Extend or modify the class of probabilistic models 
to other domains. 


A number of studies have been published about the 
mapping of real-valued representations into a discrete 
one in evolutionary computation [45.26, 108—111]; this 
section focuses on EDAs from the second category. The 
approaches are classified along the lines presented in 
Sect. 45.2 [45.7, 22]. 


Single-Peak Normal Distributions 

The stochastic hill climbing with learning by vec- 
tors of normal distributions (SHCLVND) [45.38] is 
a straightforward extension of PBIL to vectors of 
real-valued variables using a normal distribution to 
model each variable. SHCLVND replaces the popula- 
tion of real-valued solutions by a vector of means u = 
([41,-.-+ Hn), Where u; denotes a mean of the distribu- 
tion for the ith variable. The same standard deviation o 
is used for all variables. See Fig. 45.4a for an example 
model. In each generation (iteration), a random set of 
solutions is first generated according to u and o. The 
best solution out of this subset is then used to update 
the entries in u by shifting each u; toward the value 
of the ith variable in the best solution using an update 
tule similar to the one used in PBIL. Additionally, each 
generation reduces the standard deviation to make the 
future exploration of the search space narrower. A sim- 
ilar algorithm was independently developed by Sebag 
and Ducoulombier [45.39], who also discussed several 
approaches to evolving a standard deviation for each 
variable. 


Mixtures of Normal Distributions 
The probability density function of a normal distribu- 
tion is centered around its mean and decreases exponen- 
tially with square distance from the mean. If there are 
multiple clouds of values, a normal distribution must ei- 
ther focus on only one of these clouds, or it can embrace 
multiple clouds at the expense of including the low- 
density area between them. In both cases, the resulting 
distribution cannot model the data accurately. One way 
of extending standard single-peak normal-distribution 
models to enable coverage of multiple groups of sim- 
ilar points is to use a mixture of normal distributions. 
Each component of the mixture of normal distributions 
is a normal distribution by itself. A coefficient is spec- 
ified for each component of the mixture to denote the 
probability that a random point belongs to this compo- 
nent. The probability density function of a mixture is 
thus computed by multiplying the density function of 
each mixture component by the probability that a ran- 
dom point belongs to the component, and adding these 
weighted densities together. 

Gallagher etal. [45.112,113] extended EDAs 
based on single-peak normal distributions by using an 
adaptive mixture of normal distributions to model each 
variable. The parameters of the mixture (including the 
number of components) evolve based on the discov- 
ered promising solutions. Using mixture distributions 
is a significant improvement compared to single-peak 
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normal distributions, because mixtures allow simulta- 
neous exploration of multiple basins of attraction for 
each variable. 

Within the IDEA framework, Bosman and 
Thierens [45.47] proposed IDEAs using the joint 
normal kernels distribution, where a single normal 
distribution is placed around each selected solution 
(Fig. 45.4d). A joint normal kernels distribution 
can be therefore seen as an extreme use of mixture 
distributions with one mixture component per point 
in the training sample. The variance of each normal 
distribution can be fixed to a relatively small value, but 
it should be preferable to adapt variances according to 
the current state of search. Using kernel distributions 
corresponds to using a fixed zero-mean normally 
distributed mutation for each promising solution as 
is often done in evolution strategies [45.114]. That 
is why it is possible to directly take up strategies for 
adapting the variance of each kernel from evolution 
strategies [45.114-117]. 


Joint Normal Distributions and Their Mixtures 
What changes when instead of fitting each variable 
with a separate normal distribution or a mixture of nor- 
mal distributions, groups of variables are considered 
together? Let us first consider using a single-peak nor- 
mal distribution. In multivariate domains, a joint normal 
distribution can be defined by a vector of n means 
(one mean per variable) and a covariance matrix of 
size n x n. Diagonal elements of the covariance matrix 
specify the variances for all variables, whereas nondi- 
agonal elements specify linear dependencies between 
pairs of variables. Considering each variable separately 
corresponds to setting all nondiagonal elements in a co- 
variance matrix to 0. Using different deviations for 
different variables allows for squeezing or stretching 
the distribution along the axes. On the other hand, us- 
ing nondiagonal entries in the covariance matrix allows 
rotating the distribution around its mean. Figure 45.4b 
and c illustrates the difference between a joint normal 
distribution using only diagonal elements of the covari- 
ance matrix and a distribution using the full covariance 
matrix. Therefore, using a covariance matrix introduces 
another degree of freedom and improves the expressive- 
ness of a distribution. Again, one can use a number 
of joint normal distributions in a mixture, where each 
component consists of its mean, covariance matrix, and 
weight. 

A joint normal distribution including a full or 
partial covariance matrix was used within the IDEA 
framework [45.47] and in the estimation of Gaussian 


networks algorithm (EGNA) [45.48]. Both these algo- 
rithms can be seen as extensions of EDAs that model 
each variable by a single normal distribution, which 
allow also the use of nondiagonal elements of the co- 
variance matrix. 

Bosman and Thierens [45.118] proposed mixed 
IDEAs as an extension of EDAs that use a mixture 
of normal distributions to model each variable. Mixed 
IDEAs allow multiple variables to be modeled by a sep- 
arate mixture of joint normal distributions. At one 
extreme, each variable can have a separate mixture; 
at another extreme, one mixture of joint distributions 
covering all the variables is used. Despite that learning 
such a general class of distributions is quite difficult and 
a large number of samples is necessary for reasonable 
accuracy, good results were reported on single-objec- 
tive [45.118] as well as multiobjective problems [45.59, 
119, 120]. Using mixture models for all variables was 
also proposed as a technique for reducing model com- 
plexity in discrete EDAs [45.58]. 

Real-valued EDAs presented so far are applicable 
to real-valued optimization problems without requiring 
differentiability or continuity of the underlying prob- 
lem. However, if it is possible to at least partially 
differentiate the problem, gradient information can be 
used to incorporate some form of gradient-based local 
search and the performance of real-valued EDAs can 
be significantly improved. A study on combining real- 
valued EDAs within the IDEA framework with gra- 
dient-based local search can be found, for example, 
in [45.121]. 

One of the crucial limitations of using estimation 
of real-valued distributions is that real-valued EDAs 
have a tendency to lose diversity too fast even when the 
problem is relatively easy to solve [45.122]; for exam- 
ple, maximum likelihood estimation and sampling of 
a normal distribution will lead to diversity loss even 
while climbing a simple linear slope. That is why sev- 
eral EDAs were proposed that aim to control variance 
of the probabilistic model so that the loss of variance is 
avoided and yet the effective exploration is not ham- 
pered by an overly large variance of the model. For 
example, the adapted maximum-likelihood Gaussian 
model iterated density-estimation evolutionary algo- 
rithm (AMaLGaM) scales up the covariance matrix 
to prevent premature convergence on slopes [45.123, 
124]. 


Other Real-Valued EDAs 
Using normal distributions is not the only approach to 
modeling real-valued distributions. Other density func- 
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tions are frequently used to model real-valued proba- 
bility distributions, including histogram distributions, 
interval distributions, and others. A brief review of real- 
valued EDAs that use other than normal distributions or 
their mixtures follows. 

In the algorithm proposed by Server et al. [45.125], 
an interval (a;,b;) and a number z; € (0, 1) are stored 
for each variable. By z;, the probability that the ith vari- 
able is in the lower half of (a;, bi) is denoted. Each z; is 
initialized to 0.5. To generate a new candidate solution, 
the value of each variable is selected randomly from the 
corresponding interval. The best solution is then used to 
update the value of each z;. If the value of the ith vari- 
able of the best solution is in a lower half of (a;, bi), zi is 
shifted toward 0; otherwise, z; is shifted toward 1. When 
zi gets close to 0, interval (a;, b;) is reduced to its lower 
half; if z; gets close to 1, interval (a;,b;) is reduced to 
its upper half. 

EDAs proposed in [45.47, 126] use empirical his- 
tograms to model each variable as opposed to using 
a single normal distribution or a mixture of normal dis- 
tributions. In these approaches, a histogram for each 
single variable is constructed. New points are then gen- 
erated according to the distribution encoded by the 
histograms for all variables. The sampling of a his- 
togram proceeds by first selecting a particular bin based 
on its relative frequency, and then generating a ran- 
dom point from the interval corresponding to the bin. 
It is straightforward to replace the histograms in the 
above methods by various classification and discretiza- 
tion methods of statistics and machine learning (such as 
k-means clustering) [45.108]. 

Pelikan et al. [45.111,127] use an adaptive map- 
ping from the continuous domain to the discrete one 
in combination with discrete EDAs. The population 
of promising solutions is first discretized using equal- 
width histograms, equal-height histograms, k-means 
clustering, or other classification techniques. A popu- 
lation of promising discrete solutions is then selected. 
New points are created by applying a discrete recombi- 
nation operator to the selected population of promising 
discrete solutions. For example, new solutions can be 
generated by building and sampling a Bayesian net- 
work like in BOA. The resulting discrete solutions are 
then mapped back into the continuous domain by sam- 
pling each class (a bin or a cluster) using the original 
values of the variables in the selected population of con- 
tinuous solutions (before discretization). The resulting 
solutions are perturbed using one of the adaptive mu- 
tation operators of evolution strategies [45.114-117]. 
In this way, competent discrete EDAs can be com- 


bined with advanced methods based on adaptive local 
search in the continuous domain. A related approach 
was proposed by Chen and Chen [45.109], who pro- 
pose a split-on-demand adaptive discretization method 
to use in combination with ECGA and report promis- 
ing results on several benchmarks and one real-world 
problem. 

The mixed Bayesian optimization algorithm 
(mBOA) developed by Ocenasek and Schwarz [45.56] 
models vectors of real-valued variables using an 
extension of Bayesian networks with local structures. 
A model used in mBOA consists of a decision tree for 
each variable. Each internal node in the decision tree 
for a variable is a test on the value of another variable. 
Each test on a variable is specified by a particular value, 
which is also included in the node. The test considers 
two cases: the value of the variable is greater or equal 
than the value in the node or it is smaller. Each internal 
node has two children, each child corresponding to 
one of the two results of the test specified in this node. 
Leaves in a decision tree thus correspond to rectangular 
regions in the search space. For each leaf, the decision 
tree for the variable specifies a single-variable mixture 
of normal distributions centered around the values 
of this variable in the solutions consistent with the 
path to the leaf. Thus, for each variable, the model in 
mBOA divides the space reduced to other variables 
into rectangular regions, and it uses a single-variable 
normal kernels distribution to model the variable in 
each region. The adaptive variant of mBOA (am- 
BOA) [45.128] extends mBOA by employing variance 
adaptation with the goal of maximizing effective- 
ness of the search for the optimum on real-valued 
problems. 


45.3.3 EDAs for Genetic Programming 


In genetic programming [45.129], the task is to solve 
optimization problems with candidate solutions repre- 
sented by labeled trees that encode computer programs 
or symbolic expressions. Internal nodes of a tree repre- 
sent functions or commands; leaves represent functions 
with no arguments, variables, and constants. There are 
two key challenges that one must deal with when apply- 
ing EDAs to genetic programming. Firstly, the length 
of programs is expected to vary and it is difficult to es- 
timate how large the solution will be without solving 
the problem first. Secondly, small changes in parent- 
child relationships often lead to large changes in the 
performance of a candidate solution, and often the re- 
lationship between nodes in the program trees is more 
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important than their actual position. Despite these chal- 
lenges, even in this problem domain, EDAs have been 
quite successful. In this section, we briefly outline some 
EDAs for genetic programming. 

The probabilistic incremental program evolution 
(PIPE) algorithm [45.130,131] uses a probabilistic 
model in the form of a tree of a specified maximum al- 
lowable size. Nodes in the model specify probabilities 
of functions and terminals. PIPE does not capture any 
interactions between the nodes in the model. The model 
is updated by adjusting the probabilities based on the 
population of selected solutions using an update rule 
similar to the one in PBIL [45.1]. New program trees 
are generated in a top-down fashion starting in the root 
and continuing to lower levels of the tree. More specif- 
ically, if the model generates a function in a node and 
that function requires additional arguments, the succes- 
sors (children) of the node are generated to form the 
arguments of the function. If a terminal is generated, the 
generation along this path terminates. An extension of 
PIPE named hierarchical probabilistic incremental pro- 
gram evolution (H-PIPE) was later proposed [45.132]. 
In H-PIPE, nodes of a model are allowed to contain sub- 
routines, and both the subroutines as well as the overall 
program are evolved. 

Handley [45.133] used tree probabilistic models to 
represent populations of programs (trees) in genetic 
programming. Although the goal of this work was to 
compress the population of computer programs in ge- 
netic programming, Handley’s approach can be used 
within the EDA framework to model and sample can- 
didate solutions represented by computer programs or 
symbolic expressions. A similar model was used in es- 
timation of distribution programming (EDP) [45.134], 
which extended PIPE by employing parent-child de- 
pendencies in candidate labeled trees. Specifically, in 
EDP the content of each node is conditioned on the 
node’s parent. 

The extended compact genetic programming 
(ECGP) [45.135] assumes a maximum tree of max- 
imum branching like PIPE. Nonetheless, ECGP 
uses an MPM which partitions nodes into clusters 
of strongly correlated nodes. This allows ECGP to 
capture and exploit interactions between nodes in 
program trees, and solve problems that are difficult for 
conventional genetic programming and PIPE. There 
are four main characteristics that distinguish ECGP and 
EDP. ECGP is able to capture dependencies between 
more than two nodes, it learns the dependency structure 
based on the promising candidate trees, and it is not 
restricted to the dependencies between parents and 


their children. On the other hand, ECGP is somewhat 
limited in its ability to efficiently encode long-range 
interactions compared to probabilistic models that 
do not assume that groups of variables must be fully 
independent of each other. 

Looks etal. [45.136] proposed to use Bayesian 
networks to model and sample program trees. Com- 
binatory logic is used to represent program trees in 
a unified manner. Program trees translated with combi- 
natory logic are then modeled with Bayesian networks 
of BOA, EBNA, and LFDA. Contrary to most other 
EDAs for genetic programming presented in this sec- 
tion, in the approach of Looks et al. the size of computer 
programs is not limited, but solutions are allowed to 
grow over time. Looks later developed a more power- 
ful framework for competent program evolution using 
EDAs, which was named meta-optimizing semantic 
evolutionary search (MOSES) [45.54, 137, 138]. The 
key facets of MOSES include the division of the pop- 
ulation into demes, the reduction of the problem of 
evolving computer programs to the one of building 
a representation with tunable features (knobs), and the 
use of hierarchical BOA [45.44] or another competent 
evolutionary algorithm to model demes and sample new 
candidate program solutions. 

Several EDAs for genetic programming used proba- 
bilistic models based on grammar rules [45.51, 52, 139, 
140]. Most grammar-based EDAs for genetic program- 
ming use a context-free grammar. The stochastic gram- 
mar-based genetic programming (SG-GP) [45.140, 
140] started with a fixed context-free grammar with 
a default probability for each rule; the probabilities at- 
tached to the different rules were gradually adjusted 
based on the best candidate programs. The program 
evolution with explicit learning (PEEL) [45.139] used 
a probabilistic L-system with rules applicable at spe- 
cific depths and locations; the probabilities of the rules 
were adapted using a variant of ant colony optimiza- 
tion (ACO) [45.141]. Another grammar-based EDA for 
genetic programming was proposed by Bosman and de 
Jong [45.51], who used a context-free grammar that 
was initialized to a minimum stochastic context-free 
grammar and adjusted to better fit promising candidate 
solutions by expanding rules and incorporating depth 
information into the rules. Grammar model-based pro- 
gram evolution (GMPE) [45.52, 142] also uses a proba- 
bilistic context-free grammar. In GMPE, new rules are 
allowed to be created and old rules may be eliminated 
from the model. A variant of the minimum-message- 
length metric is used in GMPE to compare grammars 
according to their quality. Tanev [45.143] incorporated 
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stochastic context-sensitive grammars into the gram- 
mar-guided genetic programming [45.144-146]. 


45.3.4 EDAs for Permutation Problems 


In many problems, candidate solutions are most nat- 
urally represented by permutations. This is the case, 
for example, in many scheduling or facility location 
problems. These types of problems often contain two 
specific types of features or constraints that EDAs need 
to capture. The first is the absolute position of a symbol 
in a string and the second is the relative ordering of spe- 
cific symbols. In some problems, such as the traveling- 
salesman problem, relative ordering constraints matter 
the most. In others, such as the QAP, both the relative 
ordering and the absolute positions matter. 

One approach to permutation problems is to apply 
an EDA for problems not involving permutations in 
combination with a mapping function between the EDA 
representation and the admissible permutations. For ex- 
ample, one may use the random key encoding [45.147] 
to transfer the problem of finding a good permutation 
into the problem of finding a high-quality real-valued 
vector, allowing the use of EDAs for optimization of 
real-valued vectors in solving permutation-based prob- 
lems [45.148, 149]. Random key encoding represents 
a permutation as a vector of real numbers. The permu- 
tation is defined by the reordering of the values in the 
vector that sorts the values in ascending order. The main 
advantage of using random keys is that any real-valued 
vector defines a valid permutation and any EDA capable 
of solving problems defined on vectors of real numbers 
can thus be used to solve permutation problems. How- 
ever, since EDAs do not process the aforementioned 
types of regularities in permutation problems directly 
their performance can often be poor [45.148, 150]. That 
is why several EDAs were developed that aim to encode 
either type of constraints for permutation problems ex- 
plicitly. 

To solve problems where candidate solutions are 
permutations of a string, Bengoetxea et al. [45.151] 
start with a Bayesian network model built using the 
same approach as in EBNA [45.41]. However, the sam- 
pling method is changed to ensure that only valid 
permutations are generated. This approach was shown 
to have promise in solving the inexact graph match- 
ing problem. In much the same way, the dependency- 
tree EDA (dtEDA) of Pelikan et al. [45.152] starts with 
a dependency-tree model [45.35,71] and modifies the 
sampling to ensure that only valid permutations are 


generated. dtEDA for permutation problems was used 
to solve structured QAPs with great success [45.152]. 
Bayesian networks and tree models are capable of 
encoding both the absolute position and the relative or- 
dering constraints, although for some problem types, 
such models may turn out to be rather inefficient. 

Bosman and Thierens [45.148] extended the real- 
valued EDA to the permutation domain by storing the 
dependencies between different positions in a permu- 
tation in the induced chromosome element exchanger 
(ICE). ICE works by first using a real-valued EDA, 
which encodes permutations as real-valued vectors us- 
ing the random keys encoding. ICE extends the real- 
valued EDA by using a specialized crossover operator. 
By applying the crossover directly to permutations in- 
stead of simply sampling the model, relative ordering is 
taken into account. The resulting algorithm was shown 
to outperform many real-valued EDAs that use the ran- 
dom key encoding alone [45.148]. 

The edge-histogram-based sampling algorithm 
(EHBSA) [45.63, 153] works by creating an edge his- 
togram matrix (EHM). For each pair of symbols, EHM 
stores the probabilities that one of these symbols will 
follow the other one in a permutation. To generate new 
solutions, EHBSA starts with a randomly chosen sym- 
bol. EHM is then sampled repeatedly to generate new 
symbols in the solution, normalizing the probabilities 
based on what values have already been generated. 
EHM does not take into account absolute positions 
at all; in order to address problems in which abso- 
lute positions are important, EHBSA was extended 
to use templates [45.153]. To generate new solutions, 
first a random string from the population was picked 
as a template. New solutions were then generated by 
removing random parts of the template string and gen- 
erating the missing parts by sampling from EHM. The 
resulting algorithm was shown to be better than most 
other EDAs on the traveling salesman problem. In 
another study, the node-histogram based sampling algo- 
rithm (NHBSA) was proposed by Tsutsui et al. [45.63], 
which used a model capable of storing node frequencies 
in each position (thereby encoding absolute position 
constraints) and also used a template. 

Zhang etal. [45.154-156] proposed to use 
guided mutation to optimize both permutation prob- 
lems [45.154] as well as graph problems [45.156]. In 
guided mutation, the parts of the solution that are to 
be modified using a stochastic neighborhood operator 
are identified by analyzing a probabilistic model of the 
population of promising candidate solutions. 
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45.4 EDA Theory 


Along with the design and application of EDAs, the 
theoretical understanding of these algorithms has im- 
proved significantly since the first EDAs were pro- 
posed. One way to classify key areas of theoretical 
study of EDAs follows [45.3]: 


© Convergence proofs. Some of the most important 
results in EDA theory focus on the number of iter- 
ations of an EDA on a particular class of problems 
or the conditions that allow EDAs to provably con- 
verge to a global optimum. The convergence time 
(number of iterations until convergence) of UMDA 
on onemax for selection methods with fixed se- 
lection intensity was derived by Miihlenbein and 
Schlierkamp-Voosen [45.157]. The convergence of 
FDA on separable additively decomposable func- 
tions (ADFs) was explored by Miihlenbein and 
Mahnig [45.158], who developed an exact formula 
for convergence time when using fitness-propor- 
tionate selection. Since, in practice, fitness-propor- 
tionate selection is rarely used because of its sensi- 
tivity to some linear and many other transformations 
of the objective function, truncation selection was 
also examined and an equation was derived giv- 
ing the approximate time to convergence from the 
analysis of the onemax function. Later, Miihlenbein 
and Mahnig [45.37] adapted the theoretical model 
to the class of general ADFs where subproblems 
were allowed to interact. Under the assumption 
of Boltzmann selection, theory of graphical mod- 
els was used to derive sufficient conditions for an 
FDA model so that FDA with a large enough pop- 
ulation is guaranteed to converge to a model that 
generates only the global optima. Zhang [45.159] 
analyzed stability of fixed points of limit models of 
UMDA and FDA, and showed that at least for some 
problems the chance of converging to the global 
optimum is indeed increased when using higher or- 
der models of FDA rather than only the probability 
vector of UMDA. Convergence properties of PBIL 
were studied, for example, in [45.64, 160, 161]. 

@ Population sizing. The convergence proofs men- 
tioned above assumed infinite populations in order 
to simplify calculations. However, in practice us- 
ing an infinite population is not possible and the 
choice of an adequate population size is crucial, 
similarly as for other population-based evolutionary 
algorithms [45.162—165]. Using a population that is 
too small can lead to convergence to solutions of 


low quality and inability to reliably find the global 
optimum. On the other hand, using a population 
that is too large can lead to an increased complex- 
ity of building and sampling probabilistic models, 
evaluating populations, and executing other EDA 
components. Similar to genetic algorithms, EDAs 
must have a population size sufficiently large to pro- 
vide an adequate initial supply of partial solutions in 
an adequate problem decomposition [45.163, 166] 
and to ensure that good decisions are made between 
competing partial solutions [45.165]. However, the 
population must also be large enough for EDAs 
to make good decisions about the presence or the 
absence of statistically significant variable interac- 
tions. To examine this topic, Pelikan et al. [45.166] 
analyzed the population size required for BOA to 
solve decomposable problems of bounded difficulty 
with uniformly and nonuniformly scaled subprob- 
lems. The results showed that the population sizes 
required grew nearly linearly with the number of 
subproblems (or problem size). The results also 
showed that the approximate number of evaluations 
grew subquadratically for uniformly scaled sub- 
problems but was quadratic on some nonuniformly 
scaled subproblems. Yu et al. [45.167] refined the 
model of Pelikan et al. [45.166] to provide a more 
accurate bound for the adequate population size in 
multivariate entropy-based EDAs such as ECGA 
and BOA, and also examined the effects of the se- 
lection pressure on the population size. Population 
sizing was also empirically analyzed in FDA by 
Miihlenbein [45.168]. 

Diversity loss. Stochastic errors in sampling can 
lead to a loss of diversity that may sometimes ham- 
per EDA performance. Shapiro [45.169] examined 
the susceptibility of UMDA to diversity loss and 
discussed how it is necessary to set the learning 
parameters in such a way that this does not hap- 
pen. Bosman et al. [45.170] examined diversity loss 
in EDAs for solving real-valued problems and the 
approaches to alleviating this difficulty. The re- 
sults showed that due to diversity loss some of 
the state-of-the-art EDAs for real-valued problems 
could still fail on slope-like regions in the search 
space. The authors proposed using anticipated mean 
shift (AMS) to shift the mean of new solutions each 
generation in order to effectively maintain diversity. 
Memory complexity. Another factor of importance 
in EDA problem solving is the memory required to 
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solve the problem. Gao and Culberson [45.171] ex- 
amined the space complexity of the FDA and BOA 
on additively decomposable functions where over- 
lap was allowed between subfunctions. Gao and 
Culberson [45.171] proved that the space complex- 
ity of FDA and BOA is exponential in the problem 
size even with very sparse interaction between vari- 
ables. While these results are somewhat negative, 
the authors point out that this only shows that EDAs 
have limitations and work best when the interaction 
structure is of bounded size. Note that one way to 
reduce the memory complexity of EDAs is to use in- 
cremental EDAs, such as PBIL [45.1], cGA [45.34] 
or the incremental Bayesian optimization algorithm 
(iBOA) [45.92]. 

@ Model accuracy. A number of studies exam- 
ined the accuracy of models in EDAs. Hauschild 
et al. [45.172] analyzed the models generated by 
hBOA when solving concatenated traps, random ad- 
ditively decomposable problems, hierarchical traps 
and two-dimensional Ising spin glasses. The mod- 
els generated were then compared to the underlying 
problem structure by analyzing the number of spu- 
rious and correct dependencies. The results showed 
that the models corresponded closely to the struc- 
ture of the underlying problems and that the models 
did not change significantly between consequent 
iterations of hBOA. The relationship between the 
probabilistic models learned by BOA and the under- 
lying problem structure was also explored by Lima 
et al. [45.173]. One of the most important contribu- 


tions of this study was to demonstrate the dramatic 
effect that selection has on spurious dependencies. 
The results showed that model accuracy was signif- 
icantly improved when using truncation selection 
compared to tournament selection. Motivated by 
these results, the authors modified the complex- 
ity penalty of BOA model building to take into 
account tournament sizes when using binary tour- 
nament selection. Echegoyen etal. [45.174] also 
analyzed the structural accuracy of the models us- 
ing EBNA on concatenated traps, two variants of 
Ising spin glass and MAXSAT. In this work, two 
variations of EBNA were compared, one that was 
given the complete model structure based on the 
underlying problem and another that learned the 
approximate structure. The authors then examined 
the probability at any generation that the models 
would generate the optimal solution. The results 
showed that it was not strictly necessary to have 
all the interactions that were in the complete model 
in order to solve the problems. Finally, the effects 
of spurious linkages on EDA performance were 
examined by Radetic and Pelikan [45.175]. The au- 
thors started by proposing a theoretical model to 
describe the effects of spurious (unnecessary) de- 
pendencies on the population sizing of EDAs. This 
model was then tested empirically on onemax and 
the results showed that while it would be expected 
that spurious dependencies would have little effect 
on population size, when niching was included the 
effects were substantial. 


45.5 Efficiency Enhancement Techniques for EDAs 


EDAs can solve many classes of important problems 
in a robust and scalable manner, oftentimes requiring 
only a low-order polynomial growth of the number of 
function evaluations with respect to the number of de- 
cision variables [45.4,5, 8,22, 74, 166, 176]. However, 
even a low-order polynomial complexity is sometimes 
insufficient for practical application of EDAs especially 
when the number of decision variables is extremely 
large, when evaluation of candidate solutions is compu- 
tationally expensive, or when there are many conflicting 
objectives to optimize. The good news is that a number 
of approaches exist that can be used to further en- 
hance efficiency of EDAs. Some of these techniques can 
be adopted from genetic and evolutionary algorithms 
with little or no change. However, some techniques 
are directly targeted at EDAs because these techniques 


exploit some of the unique advantages of EDAs over 
most other metaheuristics. Specifically, some efficiency 
enhancements capitalize on the facts that the use of 
probabilistic models in EDAs provides a rigorous and 
flexible framework for incorporating prior knowledge 
about the problem into optimization, and that EDAs 
provide practitioners with a series of probabilistic mod- 
els that reveal a lot of information about the problem. 
This section reviews some of the most important ef- 
ficiency enhancement techniques for EDAs with main 
focus on techniques developed specifically for EDAs. 


45.5.1 Parallelization 


One of the most straightforward approaches to speed- 
ing up any algorithm is to distribute the computation 
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over a number of computational nodes so that several 
computational tasks can be executed in parallel. There 
are two main bottlenecks of EDAs that are typically 
addressed by parallelization: (1) fitness evaluation, and 
(2) model building and sampling. If fitness evaluation is 
computationally expensive, a master-slave architecture 
can be used for distributing fitness evaluations and col- 
lecting the results [45.177]. If most computational time 
is spent in model building and sampling, model building 
and sampling should be parallelized [45.4, 178, 179]. 

Many parallelization techniques and much of the 
theory can be adopted from research on parallelization 
in genetic and evolutionary algorithms [45.177]. In the 
context of EDAs, parallelization of model building was 
discussed, for example, by Ocenasek et al. [45.178— 
181] who proposed the parallel BOA, mBOA and 
hBOA, and by Larrañaga and Lozano [45.4] who 
parallelized model building in EBNA. One of the 
most impressive results in parallelization of EDAs was 
published by Sastry et al. [45.14, 182] who proposed 
a highly efficient, fully parallelized implementation of 
cGA to solve large-scale problems with millions to bil- 
lions of variables even with a substantial amount of 
external noise in the objective function. 


45.5.2 Hybridization 


An optimization hybrid combines two or more opti- 
mizers in a single procedure [45.183—185]. Typically, 
a global procedure and a local procedure are combined; 
the global procedure is expected to find promising re- 
gions and the local procedure is expected to find local 
optima quickly within reasonable basins of attraction. 
Global and local search are used in concert to find good 
solutions faster and more reliably than would be possi- 
ble using either procedure alone. 

Numerous studies have proposed to combine EDAs 
with variants of local search both in the discrete 
domain [45.22,99,186] and in the real-valued do- 
main [45.187]. The main reason for combining EDAs 
with local search is that by reducing the search space 
to the local optima, the structure of the problem can 
be identified more easily and the population-sizing re- 
quirements can be significantly decreased [45.22, 99]. 
Furthermore, the search reduces to the space of basins 
of attraction around each local optimum as opposed to 
the space of all admissible solutions. 

However, hybridization of EDAs is not restricted to 
the combination of an EDA with simple local search. 
As was already pointed out, probabilistic models often 
contain a lot of information about the problem. By min- 


ing these models for information about the structure and 
other properties of the problem landscape, decisions 
can be made about the nature and likely effectiveness 
of particular local search procedures and appropriate 
neighborhood structures for those procedures [45.188- 
192]. In turn, subsequent local search as well as the 
coordination of the global and local search in a hybrid 
can be managed so that excellent solutions are found 
quickly, reliably and accurately. 

There are two main approaches to the design of 
EDA-based (model-directed) hybrids with advanced 
neighborhoods: (1) Belief propagation, which uses the 
probabilistic model to generate the maximum likely 
instance [45.189-191] and (2) local search with an ad- 
vanced neighborhood structure derived from an EDA 
model [45.188, 192]. However, it is important to note 
that the use of EDA models is not limited to advanced 
neighborhood structures or belief propagation, and one 
may envision the use of probabilistic models to control 
the division of time resources between the global and 
local searcher and in a number of other tasks. 

Local search based on advanced neighborhood 
structures in a hill-climbing like procedure [45.193, 
194] is strongly related to model-directed hybridization 
using EDAs, although in this approach no estimation 
of distributions takes place. The basic idea is to use 
a linkage learning approach to detect important inter- 
actions between problem variables, and then run a local 
search based on a neighborhood defined by the under- 
lying problem decomposition. 


45.5.3 Time Continuation 


To achieve the same solution quality, one may run an 
EDA or another population-based metaheuristic with 
a large population for one convergence epoch, or run the 
algorithm with a small population for a large number 
of convergence epochs with controlled restarts between 
these epochs [45.195]. Similar tradeoffs are involved 
in the design of efficient and reliable hybrid proce- 
dures where an appropriate division of computational 
resources between the component algorithms is critical. 
The term time continuation is used to refer to the trade- 
offs involved [45.196]. 

Two important studies related to time continua- 
tion in EDAs were published by Sastry and Gold- 
berg (45.197, 198]. Based on a theoretical model of an 
ECGA-based hybrid, Sastry showed that under certain 
assumptions, the neighborhoods created from EDA- 
built models provide sufficient information for local 
search to succeed on its own even on classes of prob- 


Estimation of Distribution Algorithms 


45.5 Efficiency Enhancement Techniques for EDAs 


lems for which local search with standard neighbor- 
hoods performs poorly. However, in many other cases, 
EDA-driven search in a hybrid with local search based 
on the adaptive neighborhood should perform better, es- 
pecially if the structure of the problem is complex and 
the problem is affected by external noise. 

One of the promising research directions related 
to time continuation in EDAs is to mine probabilistic 
models discovered by EDAs to find an optimal way 
to exploit time continuation tradeoffs, be it in an EDA 
alone or in an EDA-based hybrid. 


45.5.4 Using Prior Knowledge and Learning 
from Experience 


The use of prior knowledge has had longstanding study 
and use in optimization. For example, promising partial 
solutions may be used to bias the initial population of 
candidate solutions, specialized search operators can be 
designed to solve a particular class of problems, or rep- 
resentations can be biased in order to make the search 
for the optimum an easier task. However, one of the lim- 
itations of most of these approaches is that the prior 
knowledge must be incorporated by hand and the ap- 
proaches are limited to one specific problem domain. 
The use of probabilistic models provides EDAs with 
a unique framework for incorporating prior knowledge 
into optimization because of the possibility of using 
Bayesian statistics to combine prior knowledge with 
data in the learning of probabilistic models [45.23, 90, 
199-201]. Furthermore, the use of probabilistic models 
in EDAs provides a basis for learning from previ- 
ous runs in order to solve new problem instances of 
similar type with increased speed, accuracy, and reli- 
ability [45.22—24]. Practitioners can thus incorporate 
two sources of bias into EDAs: (1) prior knowledge and 
(2) information obtained from models from prior EDA 
runs on similar problems (or runs of some other algo- 
rithm); these two sources can of course be combined 
using Bayesian statistics or in another way [45.23, 
24, 90, 200]. Then, the bias can be incorporated into 
EDAs either by restricting the class of allowable mod- 
els [45.199] or by increasing scores of models that ap- 
pear to be more likely than others [45.23, 24, 90, 200]. 
For example, Hauschild et al. [45.24, 89] proposed 
to use a probability coincidence matrix to store prob- 
abilities of Bayesian-network dependencies between 
pairs of problem variables in prior hBOA runs and to 
bias the model building in hBOA on future problem in- 
stances of similar type using the matrix. Other related 


approaches were proposed [45.24, 200] that were based 
on combining a distance metric on problem variables 
and the pool of models obtained in previous runs on 
problems of similar type. The use of a distance metric 
in combination with prior EDA runs is somewhat more 
broadly applicable and promises to be more useful for 
practitioners. One of the main reasons for this is that 
this approach allows the use of bias derived from prior 
runs on problems of smaller size to bias optimization of 
larger problems. Furthermore, the approach is applica- 
ble even in cases where the meaning of a variable and its 
context change significantly from one problem instance 
to another. 


45.5.5 Fitness Evaluation Relaxation 


To reduce the number of the objective function eval- 
uations, a model of the objective function can be 
built [45.202—204]. While models of the objective func- 
tion can be created for any optimization method, EDAs 
enable the use of probabilistic models for creating 
relatively complex computational models of the prob- 
lem in a fully automated manner. Specifically, if an 
advanced EDA is used that contains a complex prob- 
abilistic model, the model can be mined to provide a set 
of statistics that can be estimated for an accurate, ef- 
ficient computational model of the objective function. 
The model can then used to replace some of the eval- 
uations, possibly most of them. It was shown that the 
use of adequate models of the objective function can 
yield multiplicative speedups of several tens [45.202- 
204]. 


45.5.6 Incremental and Sporadic Model 
Building 


With sporadic model-building, the probabilities (pa- 
rameters) of the model are updated in every itera- 
tion, but the structure of the probabilistic model is 
rebuilt only once in every few iterations (genera- 
tions) [45.205]. Sporadic model building was shown 
to yield significant speedups that increased with prob- 
lem size, mainly because building model structure 
is the most computationally expensive part of model 
building but model structure often changes only little 
between consequent iterations of an EDA. With incre- 
mental model building, the model is built incrementally 
starting from the structure discovered in the previous 
iteration [45.41]. This can often reduce computational 
resources required to learn an accurate model. 
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45.6 Starting Points for Obtaining Additional Information 


This section provides pointers for obtaining additional 
information about EDAs. 


45.6.1 Introductory Books and Tutorials 


Numerous books and other publications exist that 
provide introduction to EDAs and additional starting 
points. The following list of references includes some 
of them: [45.2-5, 7, 8, 22, 206]. 


45.6.2 Software 


The following list includes some of the popular 
EDA implementations available online. These im- 
plementations should provide a good starting point 
for the interested reader. Entries in the list are or- 
dered alphabetically. Note that the list is not exhaus- 
tive: 


© Adapted maximum-likelihood Gaussian model it- 
erated density estimation evolutionary algorithm 
(AMaLGaM) [45.124]: http://homepages.cwi.nl/ 
~bosman/source_code.php 

© Bayesian optimization algorithm (BOA) [45.43]; 
BOA with decision graphs [45.55]; dependency- 
tree EDA [45.35]: http://medal-lab.org/ 

@ Demos of aggregation pheromone system 
(APS [45.207] and histogram-based EDAs for 
permutation-based problems (EHBSA) [45.63]: 
http://www.hannan-u.ac.jp/~tsutsui/research-e. 
html 

@ Distribution estimation using Markov random 
fields (DEUM) [45.96, 97]: http://sidshakya.com/ 
Downloads/Main.html 

@ Extended compact genetic algorithm [45.45], &- 
ary ECGA, BOA [45.43], BOA with decision 
trees/graphs [45.55], and others: http://illigal. 
org/ 

@ Mixed BOA (mBOA) [45.56], adaptive mBOA (am- 
BOA) [45.128]: http://jiri.ocenasek.com/ 

@ Probabilistic incremental program evolution (PIPE) 
[45.131]: ftp://ftp.idsia.ch/pub/rafal/ 

@ Real-coded BOA (rBOA) [45.49], multiobjective 
rBOA [45.208]: http://www.evolution.re.kr/ 

@ Regularity model based multiobjective EDA 
(RM-MEDA) [45.209]; hybrid of differential 
evolution and EDA [45.210]; model-based multiob- 
jective evolutionary algorithm (MMEA) [45.155], 


and others: http://cswww.essex.ac.uk/staff/qzhang/ 
mypublication.htm 


45.6.3 Journals 


The following journals are key venues for papers on 
EDAs and evolutionary computation, although papers 
on EDAs can be found in many other journals focusing 
on optimization, artificial intelligence, machine learn- 
ing, and applications: 


© Evolutionary Computation (MIT Press): 
http://www.mitpressjournals.org/loi/evco 

@ Evolutionary Intelligence (Springer): 
http://www.springer.com/engineering/journal/ 
12065 

© Genetic Programming and Evolvable Machines 
(Springer): 
http://www.springer.com/computer/ai/journal/ 
10710 

© JEEE Transactions on Evolutionary Computation 
(IEEE Press): 
http://ieeexplore.ieee.org/servlet/opac? 
punumber=4235 

© Natural Computing (Springer): 
http://www.springer.com/computer/ 
theoretical+computer+science/journal/1 1047 

@ Swarm and Evolutionary Computation (Elsevier): 
http://www.journals.elsevier.com/swarm-and- 
evolutionary-computation/ 


45.6.4 Conferences 


The following conferences provide the most important 
venues for publishing papers on EDAs and evolutionary 
computation, although similarly as for journals, papers 
on EDAs are often published in other venues: 


© ACM SIGEVO Genetic and Evolutionary Computa- 
tion Conference (GECCO) 

@ European Workshops on Applications of Evolution- 
ary Computation (EvoWorkshops) 

© JEEE Congress on Evolutionary Computation 
(CEC) 

@ Main European Events on Evolutionary Computa- 
tion (EvoStar) 

© Parallel Problem Solving in Nature (PPSN) 

© Simulated Evolution and Learning (SEAL) 
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45.7 Summary and Conclusions 


EDAs are a class of stochastic optimization algorithms 
that have been gaining popularity due to their ability to 
solve a broad array of complex problems with excellent 
performance and scalability. Moreover, while many of 
these algorithms have been shown to perform well with 
little or no problem-specific information, such informa- 
tion can be used advantageously if available. 

EDAs have their roots in the fields of evolutionary 
computation and machine learning. From evolutionary 
computation, EDAs borrow the idea of using a pop- 
ulation of solutions that evolves through iterations of 
selection and variation. From machine learning, EDAs 
borrow the idea of learning models from data, and they 
use the resulting models to guide the search for better 
solutions. This approach is powerful especially because 
it allows the search algorithm to adapt to the problem 
being solved, giving EDAs the possibility of being an 
effective black-box search algorithm. Since most real- 
world problems have some sort of inherent structure (as 
opposed to being completely random), there is a hope 
that EDAs can learn such a structure, or at least parts of 
it, and put that knowledge to good use in searching for 
optima. 

Another key characteristic of EDAs, and one that 
sets them apart from other metaheuristics, lies in the 
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46. Parallel Evolutionary Algorithms 


Dirk Sudholt 

Evolutionary algorithms (EAs) have given rise to iE Cellular EAS. 933 
many parallel variants, fuelled by the rapidly 46.1.5 A Unified Hypergraph Model 

increasing number of CPU cores and the ready for Population Structures ........... 935 
availability of computation power through GPUs 46.1.6 Hybrid Models...............0.c::ccceeee 935 
and cloud computing. A very popular approach 

is to parallelize evolution in island models, or 46.2 Effects of Parallelization....................... 935 
coarse-grained EAs, by evolving different popula- 46.2.1 Performance Measures 

tions on different processors. These populations för Parallel EAS... 5c. sessiccscaaesdescess 935 
run independently most of the time, but they 46.2.2 Superlinear Speedups................ 937 
periodically communicate genetic information to 

coordinate search. Many applications have shown 46.3 On the Spread of Information 

that island models can speed up computation in Parallel EAS...................c.ccecceeeceeeee ees 938 


significantly, and that parallel populations can 46.3.1 Logistic Models 


further increase solution diversity. for Growth Curves... Sadat Means sae 938 
The aim of this book chapter is to give a gen- 46.3.2 Rigorous Takeover Times ............ 939 
tle introduction into the design and analysis of boris Vaea ONE E aisan ii 
: : ; 6:34 Propagatiðm. -csi ieionserssises 941 
parallel evolutionary algorithms, in order to un- m 
derstand how parallel EAs work, and to explain 46.4 Examples Where Parallel A 
when and how speedups over sequential EAs can aa E T 943 m 
be obtained. 46.4.1 Independent RUNS .............. 943 = 
Understanding how parallel EAs work is a 46.4.2 Offspring Populations ................ 945 ov 
challenging goal as they represent interacting 46.4.3 Island Models ...............::0ccceeeees 945 
stochastic processes, whose dynamics are deter- 46.4.4 Crossover Between 
mined by several parameters and design choices. PSPS E EES 948 
This chapter uses a theory-guided perspective to 
explain how key parameters affect performance, 46.5 Speedups by Parallelization.................. 949 
based on recent advances on the theory of paral- 46.5.1 A General Method 
lel EAs. The presented results give insight into the for Analyzing Parallel EAs........... 949 
fundamental working principles of parallel EAs, as- 46.5.2 Speedups in Combinatorial 
sess the impact of parameters and design choices Optimization EET 953 
; z 46.5.3 Adaptive Numbers 
on performance, and contribute to an informed df Islands 955 
deian of efiecive paciia 22s SS a 
46.6 Conclusions ..................ccccccceeeeeseeeeeeeees 956 
46.1 Parallel Models... 931 46.6.1 Further Reading...............ccccc0 956 
46.1.1 Master-Slave Models................. 931 
46.1.2 Independent Runs ................... 931 Referentes enaena 957 


Recent years have witnessed the emergence of a huge ery desktop or notebook PC, and even mobile phones, 
number of parallel computer architectures. Almost ev- come with several CPU cores built in. Also GPUs 
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have been discovered as a source of massive com- 
putation power at no extra cost. Commercial IT so- 
lutions often use clusters with hundreds and thou- 
sands of CPU cores and cloud computing has become 
an affordable and convenient way of gaining CPU 
power. 

With these resources readily available, it has be- 
come more important than ever to design algorithms 
that can be implemented effectively in a parallel ar- 
chitecture. Evolutionary algorithms (EA) are popular 
general-purpose metaheuristics inspired by the natural 
evolution of species. By using operators like mutation, 
recombination, and selection, a multi-set of solutions — 
the population — is evolved over time. The hope is that 
this artificial evolution will explore vast regions of the 
search space and yet use the principle of survival of 
the fittest to generate good solutions for the problem 
at hand. Countless applications as well as theoretical 
results have demonstrated that these algorithms are ef- 
fective on many hard optimization problems. 

One of many advantages of EAs is that they are 
easy to parallelize. The process of artificial evolution 
can be implemented on parallel hardware in various 
ways. It is possible to parallelize specific operations, or 
to parallelize the evolutionary process itself. The latter 
approach has led to a variety of search algorithms called 
island models or cellular evolutionary algorithms. They 
differ from a sequential implementation in that evo- 
lution happens in a spatially structured network. Sub- 
populations evolve on different processors and good 
solutions are communicated between processors. The 
spread of information can be tuned easily via key pa- 
rameters of the algorithm. A slow spread of information 
can lead to a larger diversity in the system, hence in- 
creasing exploration. 

Many applications have shown that parallel EAs can 
speed up computation and find better solutions, com- 
pared to a sequential EA. This book chapter reviews 
the most common forms of parallel EAs. We highlight 
what distinguishes parallel EAs from sequential EAs. 
We also we make an effort to understand the search dy- 
namics of parallel EA. This addresses a very hot topic 
since, as of today, even the impact of the most basic pa- 
rameters of a parallel evolutionary algorithms are not 
well understood. 

The chapter has a particular emphasis on theoretical 
results. This includes runtime analysis, or computa- 


tional complexity analysis. The goal is to estimate the 
expected time until an EA finds a satisfactory solution 
for a particular problem, or problem class, by rigorous 
mathematical studies. This area has led to very fruit- 
ful results for general EAs in the last decade [46.1, 
2]. Only recently have researchers turned to investigat- 
ing parallel evolutionary algorithms from this perspec- 
tive [46.3-7]. The results help to get insight into the 
search behavior of parallel EAs and how parameters 
and design choices affect performance. The presenta- 
tion of these results is kept informal in order to make 
it accessible to a broad audience. Instead of present- 
ing theorems and complete formal proofs, we focus on 
key ideas and insights that can be drawn from these 
analyses. 

The outline of this chapter is as follows. In 
Sect. 46.1 we first introduce parallel models of evo- 
lutionary algorithms, along with a discussion of key 
design choices and parameters. Section 46.2 considers 
performance measures for parallel EAs, particularly no- 
tions for speedup of a parallel EA when compared to 
sequential EAs. 

Section 46.3 deals with the spread of information 
in parallel EAs. We review various models used to de- 
scribe how the number of good solutions increases in a 
parallel EA. This also gives insight into the time until 
the whole system is taken over by good solutions, the 
so-called takeover time. 

In Sect. 46.4 we present selected examples 
where parallel EAs were shown to outperform se- 
quential evolutionary algorithms. Drastic speedups 
were shown on illustrative example functions. This 
holds for various forms of parallelization, from in- 
dependent runs to offspring populations and island 
models. 

Section 46.5 finally reviews a general method for 
estimating the expected running time of parallel EAs. 
This method can be used to transfer bounds for a se- 
quential EA to a corresponding parallel EA, in an 
automated fashion. We go into a bit more detail here, 
in order to enable the reader to apply this method 
by her-/himself. Illustrative example applications are 
given that also include problems from combinatorial 
optimization. 

The chapter finishes with conclusions in Sect. 46.6 
and pointers to further literature on parallel evolution- 
ary algorithms. 
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46.1 Parallel Models 
46.1.1 Master-Slave Models 


There are many ways how to use parallel machines. 
A simple way of using parallelization is to execute 
operations on separate processors. This can concern 
variation operators like mutation and recombination as 
well as function evaluations. In fact, it makes most 
sense for function evaluations as these operations can be 
performed independently and they are often among the 
most expensive operations. This kind of architecture is 
known as master—slave model. One machine represents 
the master and it distributes the workload for executing 
operations to several other machines called slaves. It is 
well suited for the creation of offspring populations as 
offspring can be created and evaluated independently, 
after suitable parents have been selected. 

The system is typically synchronized: the master 
waits until all slaves have completed their operations 
before moving on. However, it is possible to use asyn- 
chronous systems where the master does not wait for 
slaves that take too long. 

The behavior of synchronized master-slave mod- 
els is not different from their sequential counterparts. 
The implementation is different, but the algorithm — and 
therefore search behavior — is the same. 


46.1.2 Independent Runs 


Parallel machines can also be used to simulate differ- 
ent, independent runs of the same algorithm in parallel. 
Such a system is very easy to set up as no communica- 
tion during the runtime is required. Only after all runs 
have been stopped, do the results need to be collected 
and the best solution (or a selection of different high- 
quality solutions) is output. 

Alternatively, all machines can periodically com- 
municate their current best solutions so that the system 
can be stopped as soon as a satisfactory solution has 
been found. As for master-slave models, this pre- 
vents us from having to wait until the longest run has 
finished. 

Despite its simplicity, independent runs can be quite 
effective. Consider a setting where a single run of 
an algorithm has a particular success probability, i.e., 
a probability of finding a satisfactory solution within 
a given time frame. Let this probability be denoted p. 
By using several independent runs, this success prob- 
ability can be increased significantly. This approach is 
commonly known as probability amplification. 


The probability that in A independent runs no run is 
successful is (1 —p)*. The probability that there is at 
least one successful run among these is, therefore, 


1—(1—p)*. (46.1) 


Figure 46.1 illustrates this amplified success probability 
for various choices of À and p. 

We can see that for a small number of proces- 
sors the success probability increases almost linearly. 
If the number of processors is large, a saturation effect 
occurs. The benefit of using ever more processors de- 
creases with the number of processors used. The point 
where saturation happens depends crucially on p; for 
smaller success probabilities saturation happens only 
with a fairly large number of processors. 

Furthermore, independent runs can be set up with 
different initial conditions or different parameters. This 
is useful to effectively explore the parameter space and 
to find good parameter settings in a short time. 


46.1.3 Island Models 


Independent runs suffer from obvious drawbacks: once 
a run reaches a situation where its population has be- 
come stuck in a difficult local optimum, it will most 
likely remain stuck forever. This is unfortunate since 
other runs might reach more promising regions of the 
search space at the same time. It makes more sense to 


Amplified success probability 
A 
1 


0.8 
0.6 
0.4 


0.2 


0 5 10 15 20 25 
Number of independent runs 


Fig. 46.1 Plots of the amplified success probability 
1—(1—p)* of a parallel system with A independent runs, 
each having success probability p 
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establish some form of communication between the dif- 
ferent runs to coordinate search, so that runs that have 
reached low-quality solutions can join in on the search 
in more promising regions. 

In island models, also called distributed EAs, the 
coarse-grained model, or the multi-deme model, the 
population of each run is regarded an island. One of- 
ten speaks of islands as subpopulations that together 
form the population of the whole island model. Is- 
lands evolve independently as in the independent run 
model, for most of the time. However, periodically solu- 
tions are exchanged between islands in a process called 
migration. 

The idea is to have a migration topology, a directed 
graph with islands as its nodes and directed edges con- 
necting two islands. At certain points of time selected 
individuals from each island are sent off to neighbored 
islands, i. e., islands that can be reached by a directed 
edge in the topology. These individuals are called mi- 
grants and they are included in the target island after 
a further selection process. This way, islands can com- 
municate and compete with one another. Islands that 
get stuck in low-fitness regions of the search space can 
be taken over by individuals from more successful is- 
lands. This helps to coordinate search, focus on the 
most promising regions of the search space, and use the 
available resources effectively. An example of an island 
model is given in Fig. 46.2. Algorithm 46.1 shows the 
general scheme of a basic island model. 


Algorithm 46.1 Scheme of an island model with 
migration interval t 

1: Initialize a population made up of subpopulations 

or islands, P® = PO, 408 , PO, 

: Lett:= 1. 
: loop 
for each island i do in parallel 

if t mod t = 0 then 


Fig. 46.2 Sketch of an island model with six islands and 
an example topology 


6: Send selected individuals from island po 
to selected neighbored islands. 
T: Receive immigrants igs from islands for 


which island PpO is a neighbor. 
8: Replace pP by a subpopulation resulting 


from a selection among pe and i? i 
9: end if 


10: Produce pee by applying reproduction op- 
erators and selection to PË. 

11: end for 

12: Lett:=t+1. 

13: end loop 


There are many design choices that affect the be- 
havior of such an island model: 


© Emigration policy. When migrants are sent, they 
can be removed from the sending island. Alter- 
natively, copies of selected individuals can be 
emigrated. The latter is often called pollina- 
tion. Also the selection of migrants is impor- 
tant. One might select the best, worst, or random 
individuals. 

© Immigration policy. Immigrants can replace the 
worst individuals in the target population, random 
individuals, or be subjected to the same kind of se- 
lection used within the islands for parent selection 
or selection for replacement. Crowding mechanisms 
can be used, such as replacing the most similar 
individuals. In addition, immigrants can be recom- 
bined with individuals present on the island before 
selection. 

@ Migration interval. The time interval between mi- 
grations determines the speed at which information 
is spread throughout an island model. Its reciprocal 
is often called migration frequency. Frequent migra- 
tions imply a rapid spread of information, while rare 
migrations allow for more exploration. Note that 
a migration interval of oo yields independent runs 
as a special case. 

© Number of migrants. The number of migrants, also 
called migration size, is another parameter that de- 
termines how quickly an island can be taken over by 
immigrants. 

@ Migration topology. Also the choice of the migra- 
tion topology impacts search behavior. The topol- 
ogy can be a directed or undirected graph — after 
all, undirected graphs can be seen as special cases 
of directed graphs. Common topologies include 
unidirectional rings (a ring with directed edges 
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only in one direction), bidirectional rings, torus or 
grid graphs, hypercubes, scale-free graphs [46.8], 
random graphs [46.9], and complete graphs. Fig- 
ure 46.3 sketches some of these topologies. An 
important characteristic of a topology T = (V, E) 
is its diameter: the maximum number of edges on 
any shortest path between two vertices. Formally, 
diam(7) = max„ vey dist(u, v), where dist(u, v) is 
the graph distance, the number of edges on a short- 
est path from u to v. The diameter gives a good 
indication of the time needed to propagate infor- 
mation throughout the topology. Rings and torus 
graphs have large diameters, while hypercubes, 
complete graphs, and many scale-free graphs have 
small diameters. 


Island models with non-complete topologies are 
also called stepping stone models. The impact of these 
design choices will be discussed in more detail in 
Sect. 46.3. 

If all islands run the same algorithm under identical 
conditions, we speak of a homogeneous island model. 
Heterogeneous island models contain islands with dif- 
ferent characteristics. Different algorithms might be 
used, different representations, objective functions, or 
parameters. Using heterogeneous islands might be use- 
ful if one is not sure what the best algorithm is for 
a particular problem. It also makes sense in the context 
of multiobjective optimization or when a diverse set of 
solutions is sought, as the islands can reflect different 
objective functions, or variations of the same objective 
functions, with an emphasis on different criteria. 

Skolicki [46.10] proposed a two-level view of search 
dynamics in island models. The term intra-island evolu- 
tion describes the evolutionary process that takes place 
within each island. On a higher level, inter-island evolu- 
tion describes the interaction between different islands. 
He argues that islands can be regarded as individuals in 
a higher-level evolution. Islands compete with one an- 
other and islands can take over other islands, just like 


Fig. 46.3 Sketches of common 
topologies: a unidirectional ring, 

a torus, and a complete graph. Other 
common topologies include bidi- 
rectional rings where all edges are 
undirected and grid graphs where the 
edges wrapping around the torus are 
removed 


individuals can replace other individuals in a regular 
population. One conclusion is that with this perspective 
an island models looks more like a compact entity. 

The two levels of evolution obviously interact with 
one another. Which level is more important is deter- 
mined by the migration interval and the other parame- 
ters of the system that affect the spread of information. 


46.1.4 Cellular EAs 


Cellular EAs represent a special case of island mod- 
els with a more fine-grained form of parallelization. 
Like in the island model we have islands connected 
by a fixed topology. Rings and two-dimensional torus 
graphs are the most common choice. The most striking 
characteristic is that each island only contains a single 
individual. Islands are often called cells in this context, 
which explains the term cellular EA. Each individual is 
only allowed to mate with its neighbors in the topology. 
This kind of interaction happens in every generation. 
This corresponds to a migration interval of 1 in the con- 


Fig. 46.4 Sketch of a cellular EA on a 7x7 grid graph. 
The dashed line indicates the neighborhood of the high- 
lighted cell 
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text of island models. Figure 46.4 shows a sketch of 
a cellular EA. A scheme of a cellular EA is given in 
Algorithm 46.2. 


Algorithm 46.2 Scheme of a cellular EA 
1: Initialize all cells to form a population P = 
(P,...,P }. Lett:=0. 


2: loop 

3: for each cell i do in parallel 

4: Select a set S; of individuals from pe out of 
all cells neighbored to cell i. 

5: Create a set R; by applying reproduction oper- 
ators to S;. 

6: Create pee by selecting an individual 
from {p> U Ri. 

7: endfor 

8 Lett:=t+1. 

9: end loop 


Cellular EAs yield a much more fine-grained sys- 
tem; they have therefore been called fine-grained mod- 
els, neighborhood models, or diffusion models. The 
difference to island models is that no evolution takes 
place on the cell itself, i.e., there is no intra-island 
evolution. Improvements can only be obtained by cells 
interacting with one another. It is, however, possible 
that an island can interact with itself. 

In terms of the two-level view on island models, 
in cellular EAs the intra-island dynamics have effec- 
tively been removed. After all, each island only contains 
a single individual. Fine-grained models are well suited 
for investigations of inter-island dynamics. In fact, the 
first runtime analyses considered fine-grained island 
models, where each island contains a single individ- 
ual [46.4, 5]. Other studies dealt with fine-grained sys- 
tems that use a migration interval larger than 1 [46.3, 6, 
7]. 

For replacing individuals the same strategies as 
listed for island models can be used. All cells can 
be updated synchronously, in which case we speak of 
a synchronous cellular EA. A common way of imple- 
menting this is to create a new, temporary population. 
All parents are taken from the current population and 
new individuals are written into the temporary popula- 
tion. At the end of the process, the current population is 
replaced by the temporary population. 

Alternatively, cells can be updated sequentially, re- 
sulting in an asynchronous cellular EA. This is likely 
to result in a different search behavior as individu- 


als can mate with offspring of their neighbors. Alba 
et al. [46.11] define the following update strategies. The 
terms are tailored towards two-dimensional grids or 
torus graphs as they are inspired by cellular automata. 
It is, however, easy to adapt these strategies to arbitrary 
topologies: 


© Uniform choice: the next cell to be updated is cho- 
sen uniformly at random. 

© Fixed line sweep: the cells are updated sequentially, 
line by line in a grid/torus topology. 

© Fixed random sweep: the cells are updated sequen- 
tially, according to some fixed order. This order 
is determined by a permutation of all cells. This 
permutation is created uniformly at random dur- 
ing initialization and kept throughout the whole 
run. 

@ New random sweep: this strategy is like fixed ran- 
dom sweep, but after each sweep is completed a new 
permutation is created uniformly at random. 


A time step or generation is defined as the time 
needed to update m cells, m being the number of cells 
in the grid. The last three strategies ensure that within 
each time step each cell is updated exactly once. This 
yields a much more balanced treatment for all cells. 
With the uniform choice model is it likely that some 
cells must wait for a long time before being updated. In 
the limit, the waiting time for updates follows a Poisson 
distribution. Consider the random number of updates 
until the last cell has been updated at least once. This 
random process is known as the coupon collector prob- 
lem [46.12, page 32], as it resembles the process of 
collecting coupons, which are drawn uniformly at ran- 
dom. A simple analysis shows that the expected number 
of updates until the last cell has been updated in the uni- 
form choice model (or all coupons have been collected) 
equals 


m 


m- 1/i ~ m-In(m). 


i=1 
This is equivalent to 


m 


Š 1/ix Inm 


i=l 


time steps, which can be significantly larger than 1, the 
time for completing a sweep in any given order. 
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Cellular EAs are often compared to cellular au- 
tomata. In the context of the latter, it is common practice 
to consider a two-dimensional grid and different neigh- 
borhoods. The neighborhood in Fig. 46.2 is called the 
von Neumann neighborhood or Linear 5. It includes 
the cell itself and its four neighbors along the directions 
north, south, west, and east. The Moore neighborhood 
or Compact 9 in addition also contains the four cells 
to the north west, north east, south west, and south 
east. Also larger neighborhoods are common, contain- 
ing cells that are further away from the center cell. 

Note that using a large neighborhood on a two- 
dimensional grid is equivalent to considering a graph 
where, starting with a torus graph, for each vertex 
edges to nearby vertices have been added. We will, 
therefore, in the remainder of this chapter stick to the 
common notion of neighbors in a graph (i.e., vertices 
connected by an edge), unless there is a good reason 
not to. 


46.1.5 A Unified Hypergraph Model 
for Population Structures 


Sprave [46.13] proposed a unified model for popula- 
tion structures. It is based on hypergraphs; an extension 
of graphs where edges can connect more than two ver- 
tices. We present an informal definition to focus on 
the ideas; for formal definitions we refer to [46.13]. 
A hypergraph contains a set of vertices and a collec- 
tion of hyperedges. Each hyperedge is a non-empty set 
of vertices. Two vertices are neighbored in the hyper- 
graph if there is a hyperedge that contains both vertices. 
Note that the special case where each hyperedge con- 
tains two different vertices results in an undirected 
graph. 

In Sprave’s model each vertex represents an indi- 
vidual. Hyperedges represent the set of possible parents 
for each individual. The model unifies various common 
population models: 


46.2 Effects of Parallelization 


An obvious effect of parallelization is that the computa- 
tion time can be reduced by using multiple processors. 
This section describes performance measured that can 
be used to define this speedup. We also consider ben- 
eficial effects of using parallel EAs that can lead to 
superlinear speedups. 


© Panmictic populations: for panmictic populations 
we have a set of vertices V and there is a sin- 
gle hyperedge that equals the whole vertex set. 
This reflects the fact that in a panmictic popula- 
tion each individual has all individuals as potential 
parents. 

@ Island models with migration: if migration is under- 
stood in the sense that individuals are removed, the 
set of potential parents for an individual contains all 
potential immigrants as well as all individuals from 
its own island, except for those that are being emi- 
grated. 

@ Island models with pollination: if pollination is 
used, the set of potential parents contains all immi- 
grants and all individuals on its own island. 

© Cellular EAs: For each individual, the potential par- 
ents are its neighbors in the topology. 


In the case of coarse-grained models, the hy- 
pergraph may depend on time. More precisely, we 
have different sets of potential parents when migration 
is used, compared to generations without migration. 
Sprave considers this by defining a dynamic population 
structure: instead of considering a single, fixed hyper- 
graph, we consider a sequence of hypergraphs over 
time. 


46.1.6 Hybrid Models 


It is also possible to combine several of the above 
approaches. For instance, one can imagine an island 
model where each island runs a cellular EA to fur- 
ther promote diversity. Or one can think of hierarchical 
island models where islands are island models them- 
selves. In such a system it makes sense that the inner- 
layer island models use more frequent migrations than 
the outer-layer island model. Island models and cellular 
EAs can also be implemented as master-slave models 
to achieve a better speedup. 


46.2.1 Performance Measures 
for Parallel EAs 


The computation time of a parallel EA can be defined 
in various ways. It makes sense to use wall-clock time 
as the performance measure as this accounts for the 
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overhead by parallelization. Under certain conditions, 
it is also possible to use the number of generations or 
function evaluations. This is feasible if these measures 
reflect the real running time in an adequate way, for in- 
stance if the execution of a generation (or a function 
evaluation) dominates the computational effort, includ- 
ing the effort for coordinating different machines. It is 
also feasible if one can estimate the overhead or the 
communication costs separately. 

We consider settings where an EA is run until a cer- 
tain goal is fulfilled. Goals can be reaching a global or 
local optimum or reaching a certain minimum fitness. In 
such a setting the goal is fixed and the running time of 
the EA can vary. This is in contrast to setups where the 
running time is fixed to a predetermined number of gen- 
erations and then the quality or accuracy of the obtained 
solutions is compared. As Alba pointed out [46.14], per- 
formance comparisons of parallel and sequential EAs 
only make sense if they reach the same accuracy. In 
the following, we focus on the former setting where the 
same goal is used. 

Still, defining speedup formally is far from trivial. 
It is not at all clear against what algorithm a parallel al- 
gorithm should be compared. However, this decision is 
essential to clarify the meaning of speedup. Not clarify- 
ing it, or using the wrong comparison, can easily yield 
misleading results and false claims. We present a taxon- 
omy inspired by Alba [46.14], restricted to cases where 
a fixed goal is given: 


© Strong speedup: the parallel run time of a par- 
allel algorithm is compared against the sequen- 
tial run time of the best known sequential algo- 
rithm. It was called absolute speedup by Barr 
and Hickman [46.15]. This measure captures in 
how far parallelization can improve upon the 
best known algorithms. However, it is often dif- 
ficult to determine the best sequential algorithm. 
Most researchers, therefore, do not use strong 
speedup [46.14]. 

@ Weak speedup: the parallel run time of an algorithm 
is compared against its own sequential run time. 
This gives rise to two subcases where the notion of 
its own sequential run time is made precise: 

— Single machine/panmixia: the parallel EA is 
compared against a canonical, panmictic ver- 
sion of it, running on a single machine. For 
instance, we might compare an island model 
with m islands against an EA running a single 
island. Thereby, the EA run on all islands is the 
same in both cases. 


— Orthodox: the parallel EA running on m ma- 
chines is compared against the same parallel 
EA running on a single machine. This kind of 
speedup was called relative speedup by Barr 
and Hickman [46.15]. 


In the light of these essential differences, it is essen- 
tial for researchers to clarify their notion of speedup. 

Having clarified the comparison, we can now de- 
fine the speedup and other measures. Let T, denote the 
time for m machines to reach the goal. Let T; denote 
the time for a single machine, where the algorithm is 
chosen according to one of the definitions of speedup 
defined above. 

The idea is to consider the ratio of T, and the time 
for a single machine, Tı, as speedup. However, as we 
are dealing with randomized algorithms, Tı and T, are 
random variables and so the ratio of both is a random 
variable as well. It makes more sense to consider the 
ratio of expected times for both the parallel and the se- 
quential algorithm as speedup 


_ EM) 
Ey) ` 


Sm 


Note that T; and T, might have very dissimilar prob- 
ability distributions. Even when both are re-scaled ap- 
propriately to obtain the best possible match between 
the two, they might still have different shapes and dif- 
ferent variances. In some cases it might make sense to 
consider the median or other statistics instead of the ex- 
pectation. 

According to the speedup Sm we distinguish the fol- 
lowing cases: 


© Sublinear speedup: if Sm < m we speak of a sublin- 
ear speedup. This implies that the total computation 
time across all machines is larger than the total com- 
putation time of the single machine (assuming no 
idle times in the parallel algorithm). 

© Linear speedup: the case Sm = m is known as linear 
speedup. There, the parallel and the sequential al- 
gorithm have the same total time. This outcome is 
very desirable as it means that parallelization does 
not come at a cost. There is no noticeable overhead 
in the parallel algorithm. 

© Superlinear speedup: if Sn >m we have a super- 
linear speedup. The total computation time of the 
parallel algorithm is even smaller than that of the 
single machine. This case is considered in more de- 
tail in the following section. 
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a) Total effort for operation 
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Fig. 46.5a,b Total effort for executing an operation on a single, panmictic population of size u = 100 (sequential al- 
gorithm) and a parallel algorithm with m processors and m subpopulations of size u /m = 100/m each. The effort on 
a population of size n is assumed to be n Inn (a) and n? (b). Note that no overhead is considered for the parallel algorithm 


Speedup is the best known measure, but not the only 
one used regularly. For the sake of completeness, we 
mention other measures. The efficiency is a normaliza- 
tion of the speedup 


Sm 
em = 


Obviously, em = 1 is equivalent to a linear speedup. 
Lower efficiencies correspond to sublinear speedups, 
higher ones to superlinear speedups. 

Another measure is called incremental efficiency 
and it measures the speedup when moving from m -— 1 
processors to m processors 


— (m T 1) i E(Tm—1) 
gE m: E(Tn) 


There is also a generalized form where m—1 is re- 
placed by m’ < m in the above formula. This reflects 
the speedup when going from m’ processors to m pro- 
cessors. 


46.2.2 Superlinear Speedups 


At first glance superlinear speedups seem astonish- 
ing. How can a parallel algorithm have a smaller 
total computation time than a sequential counterpart? 
After all, parallelization usually comes with signif- 
icant overhead that slows down the algorithm. The 
existence of superlinear speedups has been discussed 


controversially in the literature. However, there are 
convincing reasons why a superlinear speedup might 
occur. 

Alba [46.14] mentions physical sources as one pos- 
sible reason. A parallel machine might have more 
resources in terms of memory or caches. When moving 
from a single machine to a parallel one, the algorithm 
might — purposely or not — make use of these additional 
resources. Also, each machine might only have to deal 
with smaller data packages. It might be that the smaller 
data fits into the cache while this was not the case for 
the single machine. This can make a significant perfor- 
mance difference. 

When comparing a single panmictic population 
against smaller subpopulations, it might be easier to 
deal with the subpopulations. This holds even when the 
total population sizes of both systems are the same. In 
particular, a parallel system has an advantage if oper- 
ations need time which grows faster than linearly with 
the size of the (sub)population. 

We give two illustrative examples. Compare a single 
panmictic population of size u with m subpopulations 
of size u/m each. Some selection mechanisms, like 
ranking selection, might have to sort the individuals in 
the population according to their fitness. In a straight- 
forward implementation one might use well-known 
sorting algorithms such as (randomized) QuickSort, 
MergeSort, or HeapSort. All of these are known to take 
time ©(nInn) for sorting n elements, on average. Let 
us disregard the hidden constant and the randomness of 
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randomized QuickSort and assume that the time is pre- 
cisely nInn. 

Now the effort of sorting the panmictic population 
is u ln u. The total effort for sorting m populations of 
size u /m each is 


m: u/m-In(u/m) = p:ln(u/m) 
= pln(u)— p: ln(m). 


So, the parallel system executes this operation faster, 
with a difference of m - In(m) time steps in terms of the 
total computation time. 

This effect becomes more pronounced the more 
expensive operations are used (with respect to the popu- 
lation size). Assume that some selection mechanism or 
diversity mechanism is used, which compares every in- 
dividual against every other one. Then the effort for the 
panmictic population is roughly u? time steps. How- 
ever, for the parallel EA and its subpopulations the total 


effort would only be 
m: (u/m} =p°/m. 


This is faster than the panmictic EA by a factor of m. 

The above two growth curves are actually very typ- 
ical running times for operations that take more than 
linear time. A table with time bounds for common 
selection mechanisms can be found in Goldberg and 
Deb [46.16]. Figure 46.5 shows plots for the total ef- 
fort in both scenarios for a population size of u = 100. 
One can see that even with a small number of pro- 
cessors the total effort decreases quite significantly. 
To put this into perspective, most operations require 
only linear time. Also the overhead by paralleliza- 
tion was not accounted for. However, the discussion 
gives some hints as to why the execution time for 
smaller subpopulations can decrease significantly in 
practice. 


46.3 On the Spread of Information in Parallel EAs 


In order to understand how parallel EAs work, it is 
vital to get an idea on how quickly information is 
propagated. The spread of information is the most 
distinguishing aspect of parallel EAs, particularly dis- 
tributed EAs. This includes island models and cellular 
EAs. Many design choices can tune the speed at which 
information is transmitted: the topology, the migration 
interval, the number of migrants, and the policies for 
emigration and immigration. 


46.3.1 Logistic Models for Growth Curves 


Many researchers have turned to investigating the selec- 
tion pressure in distributed EAs in a simplified model. 
Assume that in the whole system we only have two 
types of solutions: current best individuals and worse 
solutions. No variation is used, i.e., we consider EAs 
using neither mutation nor crossover. The question is 
the following. Using only selection and migration, how 
long does it take for the best solutions to take over the 
whole system? This time, starting from a single best so- 
lution, is referred to as takeover time. 

It is strongly related to the study of growth curves: 
how the number of best solutions increases over time. 
The takeover time is the first point of time at which the 
number of best solutions has grown to the whole popu- 
lation. 


Growth curves are determined by both inter-island 
dynamics and intra-island dynamics: how quickly cur- 
rent best solutions spread in one island’s population, 
and how quickly they populate neighbored islands, un- 
til the whole topology is taken over. Both dynamics are 
linked: intra-island dynamics can have a direct impact 
on inter-island dynamics as the fraction of best indi- 
viduals can decide how many (if any) best individuals 
emigrate. 

For intra-island dynamics one can consider results 
on panmictic EAs. Logistic curves have been proposed 
and found to fit simulations of takeover times very well 
for common selection schemes [46.16]. These curves 
are defined by the following equation. If P(t) is the pro- 
portion of best individuals in the population at time f, 
then 


1 
1 =a 
1+(py-!e t 


where a is called the growth coefficient. One can see 
that the proportion of best individuals increases expo- 
nentially, but then the curve saturates as the proportion 
approaches 1. 

Sarma and De Jong [46.17] considered growth 
curves in cellular EAs. They presented a detailed em- 
pirical study of the effects of the neighborhood size and 


P(t) = 
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the shape of the neighborhood for different selection 
schemes. They showed that logistic curves as defined 
above can model the growth curves in cellular EAs rea- 
sonably well. 

Alba and Luque [46.18] proposed a logistic model 
called LOG tailored towards distributed EAs with pe- 
riodic migration. If t denotes the migration interval 
and m is the number of islands, then 


m—1 1 /m 
Proc(t) = 3 Ipa eT) ' 

In this model a and b are adjustable parameters. The 
model counts subsequent increases of the proportion of 
best individuals during migrations. However, it does not 
include any information about the topology and the au- 
thors admit that it only works appropriately on the ring 
topology [46.19, Section 4.2]. They, therefore, present 
an even more detailed model called TOP, which in- 
cludes the diameter diam(T) of the topology T. 


diam(T)— 1 


Prop(t) = > 


i=0 


1/m 
Ita: e™b(t— Tti) 


m-— diam(T)/m 
lta: e—(t—T-diam(7)) . 


Simulations show that this model yields very accurate 
fits for ring, star, and complete topologies [46.19, Sec- 
tion 4.3]. 

Luque and Alba [46.19, Section 4.3] proceed by 
analyzing the effect of the migration interval and the 
number of migrants. With a large migration interval, 
the growth curves tend to make jumps during migration 
and flatten out quickly to form plateaus during periods 
without migration. The resulting curves look like step 
functions, and the size of these steps varies with the mi- 
gration interval. 

Varying the number of migrants changes the slope 
of these steps. A large number of migrants has a bet- 
ter chance of transmitting best individuals than a small 
number of migrants. However, the influence of the num- 
ber of migrants was found to be less drastic than the 
impact of the migration interval. When a medium or 
large migration frequency is used, the impact of the 
number of migrants is negligible [46.19, Section 4.5]. 
The same conclusion was made earlier by Skolicki and 
De Jong [46.20]. 

Luque and Alba also presented experiments with 
a model based on the Sprave’s hypergraph formulation 
of distributed EAs [46.13]. This model gave a better fit 


than the simple logistic model LOG, but it was less ac- 
curate than the model TOP that included the diameter. 

For the sake of completeness, we also mention that 
Giacobini et al. [46.21] proposed an improved model 
for asynchronous cellular EAs, which is not based on 
logistic curves. 


46.3.2 Rigorous Takeover Times 


Rudolph [46.22, 23] rigorously analyzed takeover times 
in panmictic populations, for various selection schemes. 
In [46.22] he also dealt with the probability that the best 
solution takes over the whole population; this is not ev- 
ident for non-elitistic algorithms. In [46.23] Rudolph 
considered selection schemes made elitistic by undoing 
the last selection in case the best solution would be- 
come extinct otherwise. Under this scheme the expected 
takeover time in a population of size jz is O(u log u). 

In [46.24] Rudolph considered spatially structured 
populations in a fine-grained model. Each population 
has size 1, therefore vertices in the migration topology 
can be identified with individuals. Migration happens in 
every generation. Assume that initially only one vertex i 
in the topology is a best individual. If in every gen- 
eration each non-best vertex is taken over by the best 
individual in its neighborhood, then the takeover time 
from vertex i equals 


max dist(i,/) , 
jEV 


where V is the set of vertices and dist(i,j) denotes the 
graph distance, the number of edges on a shortest path 
from i toj. 

Rudolph defines the takeover time in a setting where 
the initial best solution has the same chance of evolving 
at every vertex. Then 


is the expected takeover time if, as above, best solutions 
are always propagated to their neighbors with probabil- 
ity 1. If this probability is lower, the expected takeover 
time might be higher. The above formula still represents 
a lower bound. Note that in non-elitist EAs it is possible 
that all best solutions might get lost, leading to a posi- 
tive extinction probability [46.24]. 

Note that maxjey dist(i, j) is bounded by the diam- 
eter of the topology. The diameter is hence a trivial 
lower bound on the takeover times. Rudolph [46.24] 
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conjectures that the diameter is more important than the 
selection mechanism used in the distributed EA. 

In [46.25] the author generalizes the above argu- 
ments to coarse-grained models. Islands can contain 
larger populations and migration happens with a fixed 
frequency. In his model the author assumes that in 
each island new best individuals can only be gener- 
ated by immigration. Migration always communicates 
best individuals. Hence, the takeover time boils down 
to a deterministic time until the last island has been 
reached, plus a random component for the time until 
all islands have been taken over completely. 

Rudolph [46.25] gives tight bounds for unidirec- 
tional rings, based on the fact that each island with 
a best individual will send one such individual to each 
neighbored island. Hence, on the latter island the num- 
ber of best individuals increases by 1, unless the island 
has been taken over completely. For more dense topolo- 
gies he gives a general upper bound, which may not be 
tight for all graphs. If there is an island that receives 
best individuals from k > 1 other islands, the number 
of best individuals increases by k. (The number k could 
even increase over time.) It was left as an open problem 
to derive more tight bounds for interesting topologies 
other than unidirectional rings. 

Other researchers followed up on Rudolph’s sem- 
inal work. Giacobini et al. [46.26] presented theoret- 
ical and empirical results for the selection pressure 
on ring topologies, or linear cellular EAs. Giacobini 
etal. [46.27] did the same for toroidal cellular EAs. 
In particular, they considered takeover times for asyn- 
chronous cellular EAs, under various common update 
schemes. Finally, Giacobini et al. investigated growth 
curves for small-world graphs [46.9]. 
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The assumption from Rudolph’s model that only 
immigration can create new best individuals is not al- 
ways realistic. If standard mutation operators are used, 
there is a constant probability of creating a clone of a se- 
lected parent simply by not flipping any bits. This can 
lead to a rapid increase in the number of high-fitness 
individuals. 

This argument on the takeover of good solutions 
in panmictic populations has been studied as part of 
rigorous runtime analyses of population-based EAs. 
Witt [46.28] considered a simple (u +1) EA with 
uniform parent selection, standard bit mutations, no 
crossover, and cut selection at the end of the generation. 
From his work it follows that good solutions take over 
the population in expected time O(u log jz). More pre- 
cisely, if currently there is at least one individual with 
current best fitness i, then after O(u log p) generations 
all individuals in the population will have fitness i at 
least. 

Sudholt [46.29, Lemma 2] extended these argu- 
ments to a (u +À) EA and proved an upper bound of 
O(u/A- log u + log u). Note that, in contrast to other 
studies of takeover times, both results apply real EAs 
that actually use mutation. Extending these arguments 
to distributed EAs is an interesting topic for future 
work. 


46.3.3 Maximum Growth Curves 


Now, we consider inter-island dynamics in more de- 
tail. Assume for simplicity that intra-island takeover 
happens quickly: after each migration transmitting at 
least one best solution, the target island is completely 
taken over by best solutions before the next migra- 


Fig. 46.6 Plots of growth curves 
in an island model with 64 islands. 
We assume that in between two 
migrations all islands containing 

a current best solution completely 
take over all neighbored islands in 
the topology. Initially, one island 
contains a current best solution and 
all other islands are worse. The 
curves show the fraction of current 
best solutions in the system for dif- 
ferent topologies: a unidirectional 
ring, a bidirectional ring, a square 
torus, a hypercube, and a complete 
graph 
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tion. We start with only one island containing a best 
solution, assuming that all individuals on this island 
are best solutions. We call such an island an optimal 
island. If migrants are not subject to variation while 
emigrating or immigrating, we will always select best 
solutions for migration and, hence, successfully trans- 
mit best solutions. 

These assumptions give rise to a deterministic 
spread of best solutions: after each migration, each opti- 
mal island will turn all neighbored islands into optimal 
islands. This is very similar to Rudolph’s model [46.25], 
but it also accounts for a rapid intra-island takeover in 
between migrations. 

We consider growth curves on various graph 
classes: unidirectional and bidirectional rings, square 
torus graphs, hypercubes, and complete graphs. Fig- 
ure 46.6 shows these curves for all these graphs 
on 64 vertices. The torus graph has side lengths 
8x8. The hypercube has dimension 6. Each vertex 
has a label made of 6bits. All possible values for 
this bit string are present in the graph. Two vertices 
are neighbored if their labels differ in exactly one 
bit. 

For the unidirectional ring, after i— 1 migrations 
we have exactly i optimal islands, if i < m. The growth 
curve is, therefore, linear. For the bidirectional ring in- 
formation spreads twice as fast as it can spread in two 
directions. After i— 1 migrations we have 2i— 1 optimal 
islands if 2i— 1 < m. 

The torus allows communication in two dimensions. 
After one migration there are 1+4 = 5 optimal islands. 
After two migrations this number is 1 + 4+ 8, and after 
three migrations it is 1+4-+8-+ 12. In general, after 
i— 1 migrations we have 


i—1 
1+) 0 4j=142i(i-1) = 142? -2i 
j=l 


optimal islands, as long as the optimal islands can freely 
spread out in all four directions, north, south, west, and 
east. At some point the ends of the region of optimal is- 
lands will meet, i. e., the northern tip meets the southern 
one and the same goes for west and east. Afterwards, we 
observe regions of non-optimal islands that constantly 
shrink, until all islands are optimal. The growth curve 
for the torus is hence quadratic at first and then it starts 
to saturate. The deterministic growth on torus graphs 
was also considered in [46.30]. 

For the hypercube, we can without loss of gener- 
ality assume that the initial optimal island has a label 


containing only zeros. After one migration all islands 
whose label contains a single one become optimal. Af- 
ter two migrations the same holds for all islands with 
two ones, and so on. The number of optimal islands 
after i migrations in a d-dimensional hypercube (i. e., 
m = 2f) is hence X=). This number is close to d 
during the first migrations and then at some point starts 
to saturate. The complete graph is the simplest one to 
analyze here as it will be completely optimal after one 
migration. 

These arguments and Fig. 46.6 show that the 
growth curves can depend tremendously on the mi- 
gration topology. For sparse topologies like rings or 
torus graphs, in the beginning the growth is linear or 
quadratic, respectively. This is much slower than the 
exponential growth observed in logistic curves. Further- 
more, for the ring there is no saturation; linear curves 
are quite dissimilar to logistic curves. 

This suggests that logistic curves might not be the 
best models for growth curves across all topologies. 
The plots by Luque and Alba [46.19, Section 4.3] show 
a remarkably good overall fit for their TOP model. 
However, this might be due to the optimal choice of 
the parameters a and b and the fact that logistic curves 
are easily adaptable to various curves of roughly sim- 
ilar shape. We believe that it is possible to derive even 
more accurate models for common topologies, based on 
results by Giacobini et al. [46.9, 26, 27]. This is an in- 
teresting challenge for future work. 


46.3.4 Propagation 


So far, we have only considered models where migra- 
tion always successfully transmits best individuals. For 
non-trivial selection of emigrants, this is not always 
given. Also if crossover is used during migration, due 
to disruptive effects migration is not always successful. 
If we consider randomized migration processes, things 
become more interesting. 

Rowe et al. [46.31] considered a model of propa- 
gation in networks. Consider a network where vertices 
are either informed or not. In each round, each in- 
formed vertex tries to inform each of its neighbors. 
Every such trial is successful with a given probability p, 
and then the target island becomes informed. These 
decisions are made independently. Note that an unin- 
formed island might obtain a probability larger than p 
of becoming informed, in case several informed islands 
try to inform it. The model is inspired by models from 
epidemiology; it can be used to model the spread of 
a disease. 
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The model of propagation of information directly 
applies to our previous setting where the network is the 
migration topology and p describes the probability of 
successfully migrating a current best solution. Note that 
when looking for estimations of growth curves and up- 
per bounds on the takeover time, we can assume that p 
is a lower bound on the actual probability of a success- 
ful transmission. Then the model becomes applicable 
to a broader range of settings, where islands can have 
different transmission probabilities. 

On some graphs like unidirectional rings, we can 
just multiply our growth curves by p to reflect the ex- 
pected number of optimal islands after a certain time. It 
then follows that the time for taking over all m islands 
is by a factor of 1/p larger than in the previous, deter- 
ministic model. 

However, this reasoning does not hold in general. 
Multiplying the takeover time in the deterministic set- 
ting by 1/p does not always give the expected takeover 
time in the random model. Consider a star graph (or 
hub), where initially only the center vertex is informed. 
In the deterministic case p = 1, the takeover time is 
clearly 1. However, if 0 < p < 1, the time until the last 
vertex is informed is given by the maximum of n— 1 
independent geometric distributions with parameter p. 
For constant p, this time is of order © (logn), i. e., the 
time until the last vertex is informed is much larger 
than the expected time for any specific island to be 
informed. 

Rowe et al. [46.31] presented a detailed analysis of 
hubs. They also show how to obtain a general upper 
bound that holds for all graphs. For every graph G 
with n vertices and diameter diam(G) the expected 
takeover time is bounded by 


o (=e + en) 
i : 


Both terms diam(G) and logn make sense. The diam- 
eter describes what distance needs to be overcome in 
order to inform all vertices in the network. The fac- 
tor 1/p gives the expected time until a next vertex 
is informed, assuming that it has only one informed 
neighbor. We also obtain diam(G) (without a fac- 
tor 1/p) as a lower bound on the takeover time. The 
additive term + logn is necessary to account for a po- 
tentially large variance, as seen in the example for star 
graphs. 

If the diameter of the graph is at least 2(logn), 
we can drop the + log n-term in the asymptotic bound, 
leading to an upper bound of O(diam(G)/p). 


Interestingly, the concept of propagation also ap- 
pears in other contexts. When solving shortest paths 
problems in graphs, metaheuristics like evolutionary 
algorithms [46.32-34] and ant colony optimization 
(ACO) [46.35,36] tend to propagate shortest paths 
through the graph. In the single-source shortest paths 
problem (SSSP) one is looking for shortest paths from 
a source vertex to all other vertices of the graph. The 
EAs and ACO algorithms tend to find shortest paths 
first for vertices that are close to the source, in a sense 
that their shortest paths only contain few edges. If 
these shortest paths are found, it enables the algo- 
rithm to find shortest paths for vertices that are further 
away. 

When a shortest paths to vertex u is found and there 
is an edge {u, v} in the graph, it is easy to find a shortest 
path for v. In the case of evolutionary algorithms, an EA 
only needs to assign u as a predecessor of v on the short- 
est path in a lucky mutation in order to find a shortest 
path to v. In the case of ACO, pheromones enable an ant 
to follow pheromones between the source and u, and so 
it only has to decide to travel between u and v to find 
a shortest path to v, with good probability. 

Doerr etal. [46.34, Lemma 3] used tail bounds 
to prove that the time for propagating shortest paths 
with an EA is highly concentrated. If the graph has 
diameter diam(G) > logn, the EA with high proba- 
bility finds all shortest paths in time O(diam(G)/p), 
where p = @(n~”) in this case. This result is similar 
to the one obtained by Rowe et al. [46.31]; asymptot- 
ically, both bounds are equal. However, the result by 
Doerr et al. [46.33] also allows for conclusions about 
growth curves. 

Ldssig and Sudholt [46.6, Theorem 3] introduced 
yet another argument for the analysis of propagation 
times. They considered layers of vertices. The i-th layer 
contains all vertices that have shortest paths of at most i 
edges, and that are not on any smaller layer. They bound 
the time until information is propagated throughout all 
vertices of a layer. This is feasible since all vertices 
in layer i are informed with probability at least p if 
all vertices in layers 1,...,i—1 are informed. If n; is 
the number of vertices in layer i, the time until the last 
vertex in this layer is informed is O(n;-logn;). This 
gives a bound for the total takeover time of O(diam(G)- 
In(en/diam(G))). For small (diam(G) = O(1)) or large 
(diam(G) = 2 (n)) diameters, we get the same asymp- 
totic bound as before. For other values it is slightly 
worse. 

However, the layering of vertices allows for inclu- 
sion of intra-island effects. Assume that the transmis- 
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sion probability p only applies once islands have been 
taken over (to a significantly large degree) by best indi- 
viduals. This is a realistic setting as with only a single 
best individual the probability of selecting it for emigra- 
tion (or pollination, to be precise) might be very small. 
If all islands need time Tint: in order to reach this stage 


after the first best individual has reached the island, we 
obtain an upper bound of 


O(diam(G) - In(en/diam(G))) + diam(G) - Tintra 


for the takeover time. 


46.4 Examples Where Parallel EAs Excel 


Parallel EAs have been applied to a very broad range 
of problems, including many NP-hard problems from 
combinatorial optimization. The present literature is 
immense; already early surveys like the one by Alba 
and Troya [46.37] present long lists of applications 
of parallel EAs. Further applications can be found 
in [46.38—40]. Research on and applications of paral- 
lel metaheuristics has increased in recent years, due to 
the emergence of parallel computer architectures. 

Crainic and Hail [46.41] review applications of 
parallel metaheuristics, with a focus on graph color- 
ing, partitioning problems, covering problems, Steiner 
tree problems, satisfiability problems, location and 
network design, as well as the quadratic assignment 
problems with its famous special cases: the travel- 
ing salesman problem and vehicle routing problems. 
Luque and Alba [46.19] present selected applications 
for natural language tagging, the design of combina- 
torial logic circuits, the workforce planning problem, 
and the bioinformatics problem of assembling DNA 
fragments. 

The literature is too vast to be reviewed in this 
section. Also, for many hard practical problems it is 
often hard to determine the effect that parallelization 
has on search dynamics. The reasons behind the suc- 
cess of parallel models often remain elusive. We follow 
a different route and describe theoretical studies of evo- 
lutionary algorithms where parallelization was proven 
to be helpful. This concerns illustrative toy functions 
as well as problems from combinatorial optimization. 
All following settings are well understood and allow 
us to gain insights into the effect of parallelization. 
We consider parallel variants of the most simple evolu- 
tionary algorithm called (1 + 1) evolutionary algorithm, 
shortly (1+ 1) EA. It is described in Algorithm 46.3 
and it only uses mutation and selection in a population 
containing just one current search point. We are inter- 
ested in the optimization time, defined as the number 
of generations until the algorithm first finds a global 
optimum. Unless noted otherwise, we consider pseudo- 


Boolean optimization: the search space contains all bit 
strings of length n and the task is to maximize a func- 
tion f: {0, 1}" > R. We use the common notation x = 
X1 ...X, for bit strings. 


Algorithm 46.3 (1+1) EA for maximizing 
f:{0,1}" +R 
1: Initialize x € {0, 1}" uniformly at random. 
2: loop 
3: Create x’ by copying x and flipping each bit in- 
dependently with probability 1/n. 
if f(x’) > f(x) then x:= x’. 
end loop 


oh 


The presentation in this section is kept informal. For 
theorems with precise results, including all precondi- 
tions, we refer to the respective papers. 


46.4.1 Independent Runs 


Independent runs prove useful if the running time has 
a large variance. The reason is that the optimization 
time equals the time until the fastest run has found 
a global optimum. 

The variance can be particularly large in the case 
when the objective function yields local optima that 
are very hard to overcome. Bimodal functions contain 
two local optima, and typically only one is a global 
optimum. One such example was already analyzed the- 
oretically in the seminal runtime analysis paper by 
Droste et al. [46.42]. 

We review the analysis of a similar function that 
leads to a simpler analysis. The function TwoMax was 
considered by Friedrich et al. [46.43] in the context of 
diversity mechanisms. It is a function of unitation: the 
fitness only depends on the number of bits set to 1. The 
function contains two symmetric slopes that increase 
linearly with the distance to n/2. Only one of these 
slopes leads to a global optimum. Formally, the function 
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is defined as the maximum of OneMax := X` ;—; x; and 
its symmetric cousin ZeroMax := bear (1—x;), with an 
additional fitness bonus for the all-ones bit string 


+] [.. 


i=1 


TwoMax(x) := max È Xi, Sa — xi) 


i=1 


i=1 


See Fig. 46.7 for a sketch. 

The (1 + 1) EA reaches either a local optimum or 
a global optimum in expected time O(n logn). Due to 
the perfect symmetry of the function on the remainder 
of the search space, the probability that this is the global 
optimum is exactly 1/2. If a local optimum is reached, 
the (1+ 1) EA has to flip all bits in one mutation in 
order to reach the global optimum. The probability for 
this event is exactly n™”. 

The authors consider deterministic crowd- 
ing [46.43] in a population of size u as a diversity 
mechanism. It has the same search behavior as ju 
independent runs of the (1+ 1) EA, except that the 
running time is counted in a different way. Their result 
directly transfers to this parallel model. The only 
assumption is that the number of independent runs is 
polynomially bounded in n. 

The probability of finding a global optimum af- 
ter O(nlogn) generations of the parallel system is 
amplified to 1— 2~“. This means that only with prob- 
ability 2~“ we arrive at a situation where the parallel 
EA needs to escape from a local optimum. When all m 
islands are in this situation, the probability that at least 
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Fig. 46.7 Plots of the bimodal function TwoMax as de- 
fined in [46.43] 


one island makes this jump in one generation is at most 
_ a —n™”)” = O(m- n”) ; 


where the last equality holds since m is asymptotically 
smaller than n”. 

This implies that the expected number of genera- 
tions of a parallel system with m independent runs is 


O(nlogn) +27"-@ (=) 
m 


We can see from this formula that the number of runs m 
has an immense impact on the expected running time. 
Increasing the number of runs by 1 decreases the sec- 
ond summand by more than a factor of 2. The speedup 
is, therefore, exponential, up to a point where the run- 
ning time is dominated by the first term O(nlogn). 
Note in particular that log(n”) = nlogn processors 
are sufficient to decrease the expected running time 
to O(nlogn). 

This is a very simple example of a superlinear 
speedup, with regard to the optimization time. 

The observed effects also occur in combinatorial 
optimization. Witt [46.44] analyzed the (1+ 1) EA on 
the NP-hard PARTITION problem. The task can be 
regarded as scheduling on two machines: given a se- 
quence of jobs, each with a specific effort, the goal is to 
distribute the jobs on two machines to that the largest 
execution time (the makespan) is minimized. 

On worst-case instances the (1 + 1) EA has a con- 
stant probability of getting stuck in a bad local op- 
timum. The expected time to find a solution with 
a makespan of less than (4/3 — £) - OPT is n? ™ where 
€ > 0 is an arbitrary constant and OPT is the value of 
the optimal solution. 

However, if the (1+ 1) EA is lucky, it can, in- 
deed, achieve a good approximation of the global 
optimum. Assume we are aiming at a solution with 
a makespan of at most (1+ £). OPT, for some €> 
O we can choose. Witt’s analysis shows that then 
Qleloge+e)-[2/e] n(4/e)+0U/8) parallel runs output a so- 
lution of this quality with probability at least 3/4. (This 
probability can be further amplified quite easily by 
using more runs.) Each run takes time O(n In(1/e)). 
The parallel model represents what is known as 
a polynomial-time randomized approximation scheme 
(PRAS). The desired approximation quality (1 + £) can 
be specified, and if ¢ is fixed, the total computation 
time is bounded by a polynomial in n. This was the 
first example that parallel runs of a randomized search 
heuristics constitute a PRAS for an NP-hard problem. 
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46.4.2 Offspring Populations 


Using offspring populations in a master-slave architec- 
ture can decrease the parallel running time and lead 
to a speedup. We will discuss this issue further in 
Sect. 46.5 as offspring populations are very similar to 
island models on complete topologies. For now, we 
present one example where offspring populations de- 
crease the optimization time very drastically. 

Jansen etal. [46.45] compared the (1+ 1) EA 
against a variant (1 + A) EA that creates À offspring in 
parallel and compares the current search point against 
the best offspring. They constructed a function Suf- 
Samp where offspring populations have a significant 
advantage. We refrain from giving a formal definition, 
but instead describe the main ideas. The vast majority 
of all search points tend to lead an EA towards the start 
of a path through the search space. The points on this 
path have increasing fitness, thus encouraging an EA to 
follow it. All points outside the path are worse, so the 
EA will stay on the path. 

The path leads to a local optimum at the end. How- 
ever, the function also includes a number of smaller 
paths that branch off the main path, see Fig. 46.8. All 
these paths lead to global optima, but they are diffi- 
cult to discover. This makes a difference between the 
(1+ 1) EA and the (1 + à) EA for sufficiently large À. 
The (1 + 1) EA typically follows the main path without 
discovering the smaller paths branching off. At the end 
of the main path it thus becomes stuck in a local opti- 
mum. The analysis in [46.45] shows that the (1+ 1) EA 
needs superpolynomial time, with high probability. 

Contrarily, the (1 +A) EA performs a more thor- 
ough search as it progresses on the main path. The many 
offspring tend to discover at least one of the smaller 
branches. The fitness on the smaller branches is larger 
than the fitness of the main path, so the EA will move 
away from the main path and follow a smaller path. It 
then finds a global optimum in polynomial time, with 
high probability. 

Interestingly, this construction can be easily adapted 
to show an opposite result. We replace the local opti- 
mum at the end of the main path by a global optimum 


Global optima 


Local Local 
optima optima 


and replace all global optima at the end of the smaller 
branches by local optima. This yields another function 
SufSamp’, also shown in Fig. 46.8. By the same reason- 
ing as above, the (1 + 4) EA will become stuck and the 
(1+ 1) EA will find a global optimum in polynomial 
time, with high probability. 

While the example is clearly constructed and artifi- 
cial, it can be seen as a cautionary tale. The reader might 
be tempted to think that using offspring populations in- 
stead of creating a single offspring can never increase 
the number of generations needed to find the optimum. 
After all, evolutionary search with offspring population 
is more intense and improvements can be found more 
easily. As we focus on the number of generations (and 
do not count the effort for creating A offspring), it is 
tempting to claim that offspring populations are never 
disadvantageous. 

The second example shows that this claim — how- 
ever obvious it may seem — does not hold for general 
problem classes. Note that this statement is also implied 
by the well-known no free lunch theorems [46.46], but 
the above results are much stronger and more concrete. 


46.4.3 Island Models 


The examples so far have shown that a more thorough 
search — by independent runs or increased sampling of 
offspring — can lead to more efficient running times. 
Lässig and Sudholt [46.3] presented a first example 
where communication makes the difference between 
exponential and polynomial running times, in a typi- 
cal run. They constructed a family of problems called 
LOLZ,,...,¢ where a simple island model finds the op- 
timum in polynomial time, with high probability. This 
holds for a proper choice of the migration interval and 
any migration topology that is not too sparse. The is- 
lands run (1+ 1) EAs, hence the island model resembles 
a fine-grained model. 

Contrarily, both a panmictic population as well as 
independent islands need exponential time, with high 
probability. This shows that the weak speedup versus 
panmixia is superlinear, even exponential (when con- 
sidering speedups with respect to the typical running 


Local optima 


Fig. 46.8 Sketches of the functions 
SufSamp (left) and SufSamp’ (right). 
The fitness is indicated by the color 
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Table 46.1 Examples of solutions for the function LOLZ with four blocks and z = 3, along with their fitness values. All 
blocks have to be optimized from left to right. The sketch shows in bold all bits that are counted in the fitness evaluation. 
Note how in x3 in the third block only the first z = 3 zeros are counted. Further 0-bits are ignored. The only way to 


escape from this local optimum is to flip all z O-bits in this block simultaneously 


XI 11110011 11010100 
x 11111111 11010100 
X3 11111111 11111111 


time instead of the expected running time). Unlike pre- 
vious examples, it also shows that more sophisticated 
means of parallelization can be better than independent 
runs. 

The basic idea of this construction is as follows. An 
EA can increase the fitness of its current solutions by 
gathering a prefix of bits with the same value. Gener- 
ally, a prefix of i leading ones yields the same fitness as 
a prefix of i leading zeros. The EA has to make a de- 
cision whether to collect leading ones (LOs) or leading 
zeros (LZs). This not only holds for the (1 + 1) EA but 
also for a (not too large) panmictic population as genetic 
drift will lead the whole population to either leading 
ones or leading zeros. 

In the beginning, both decisions are symmetric. 
However, after a significant prefix has been gath- 
ered, symmetry is broken: after the prefix has reached 
a length of z, z being a parameter of the function, only 
leading ones lead to a further fitness increase. If the EA 
has gone for leading zeros, it becomes stuck in a local 
optimum. The parameter z determines the difficulty of 
escaping from this local optimum. 

This construction is repeated on several blocks of 
the bit string that need to be optimized one-by-one. 
Each block has length £. Only if the right decision to- 
wards the leading ones is made on the first block, can 
the block be filled with further leading ones. Once the 
first block contains only leading ones, the fitness de- 
pends on the prefix in the second block, and a further 
decision between leading ones and leading zeros needs 
to be made. Figure 46.1 illustrates the problem defini- 
tion. 

So, the problem requires an EA to make several 
decisions in succession. The number of blocks, b, is 
another parameter that determines how many decisions 
need to be made. Panmictic populations will sooner or 
later make a wrong decision and become stuck in some 
local optimum. If b is not too small, the same holds for 
independent runs. 

However, an island model can effectively commu- 
nicate the right decisions on blocks to other islands. 
Islands that have become stuck in a local optimum can 


11010110 01011110 LOLZ (x1) = 4 
11010110 01011110 LOLZ (x2) = 10 
00000110 01011110 LOLZ(x3) = 19 


be taken over by other islands that have made the cor- 
rect decision. These dynamics make up the success of 
the island model as it can be shown to find global op- 
tima with high probability. A requirement is, though, 
that the migration interval is carefully tuned so that 
migration only transmits the right information. If mi- 
gration happens before the symmetry between leading 
ones and leading zeros is broken, it might be that islands 
with leading zeros take over islands with leading ones. 
Lässig and Sudholt [46.3] give sufficient conditions un- 
der which this does not happen, with high probability. 

An interesting finding is also how islands can regain 
independence. During migration, genetic information 
about future blocks is transmitted. Hence, after migra- 
tion all islands contain the same genotype on future 
blocks. This is a real threat as this dependence might 
imply that all islands make the same decision after mov- 
ing on to the next block. Then all diversity would be 
lost. 

However, under the conditions given in [46.3] there 
is a period of independent evolution following mi- 
gration, before any island moves on to a new block. 
During this period of independence, the genotypes of 
future blocks are subjected to random mutations, inde- 
pendently for each island. The reader might think of 
moving particles in some space. Initially, all bits are in 
the same position. However, then particles start moving 
around randomly. Naturally, they will spread out and 
separate from one another. After some time the distri- 
bution of particles will resemble a uniform distribution. 
In particular, an observer would not be able to distin- 
guish whether the positions of particles were obtained 
by this random process or by simply drawing them from 
a uniform distribution. 

The same effect occurs with bits of future blocks; 
after some time all bits of a future block will be in- 
distinguishable from a random bit string. This shows 
that independence can not only be gained by indepen- 
dent runs, but also by periods of independent evolution. 
One could say that the island model combines the 
advantages of two worlds: independent evolution and 
selection pressure through migration. The island model 
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Fig. 46.9 Sketch of the graph G’. The top shows a configuration where a decision at v* has to be made. The three 
configurations below show the possible outcomes. All these transitions occur with equal probability, but only the one on 
the bottom right leads to a solution where rotations are necessary 


is only successful because it can use both migration and 
periods of independent evolution. 

The theoretical results [46.3] were complemented 
by experiments in [46.47]. The aim was to look at what 
impact the choice of the migration topology and the 
choice of the migration interval have on performance, 
regarding the function LOLZ. The theoretical results 
made a statement about a broad class of dense topolo- 
gies, but required a very precise migration interval. The 
experiments showed that the island model is far more 
robust with respect to the migration interval than sug- 
gested by theory. 

Depending on the migration interval, some topolo- 
gies were better than others. The topologies involved 
were a bidirectional ring, a torus with edges wrapping 
around, a hypercube graph, and the complete graph. We 
considered the success rate of the island model, stop- 
ping it as soon as all islands had reached local or global 
optima. We then performed statistical tests comparing 
these success rates. For small migration intervals, i. e., 
frequent migrations, sparse topologies were better than 
dense ones. For large migration intervals, i. e., rare mi- 
grations, the effect was the opposite. This effect was 
expected; however, we also found that the torus was 
generally better than the hypercube. This is surprising, 
as both have a similar density. Table 46.2 shows the 
ranking obtained for commonly used topologies. 

Superlinear speedups with island models also oc- 
cur in simpler settings. Ldssig and Sudholt [46.6] also 
considered island models for the Eulerian cycle prob- 


lem. Given an undirected Eulerian graph, the task is to 
find a Eulerian cycle, i.e., a traversal of the graph on 
which each edge is traversed exactly once. This prob- 
lem can be solved efficiently by tailored algorithms, but 
it served as an excellent test bed for studying the per- 
formance of evolutionary algorithms [46.48-5 1]. 
Instead of bit strings, the problem representation by 
Neumann [46.48] is based on permutations of the edges 
of the graph. Each such permutation gives rise to a walk: 
starting with the first edge, a walk is the longest se- 
quence of edges such that two subsequent edges in the 
permutation share a common vertex. The walk encoded 
by the permutation ends when the next edge does not 
share a vertex with the current one. A walk that con- 
tains all edges represents a Eulerian cycle. The length 
of the walk gives the fitness of the current solution. 
Neumann [46.48] considered a simple instance that 
consists of two cycles of equal size, connected by one 
common vertex v* (Fig. 46.9). The instance is interest- 
ing as it represents a worst case for the time until an 


Table 46.2 Performance comparison according to success 
rates for commonly used migration topologies. The notion 
A < B means that topology A has a significantly smaller 
success rate than topology B 


Migration interval Ranking 

Small migration intervals Ku < hypercube < torus < ring 
Medium migration intervals hypercube < K,, < ring < torus 
High migration intervals ring < torus < hypercube < Ky, 
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improvement is found. This is with respect to random- 
ized local search (RLS) working on this representation. 
RLS works like the (1 + 1) EA, but it only uses lo- 
cal mutations. As the mutation operator it uses jumps: 
an edge is selected uniformly at random and then it is 
moved to a (different) target position chosen uniformly 
at random. All edges in between the two positions are 
shifted accordingly. 

On the considered instance RLS typically starts 
constructing a walk within one of these cycles, either by 
appending edges to the end of the walk or by prepend- 
ing edges to the start of the walk. When the walk 
extends to v* for the first time, a decision needs to be 
made. RLS can either extend the walk to the opposite 
cycle, Fig. 46.9. In this case, RLS can simply extend 
both ends of the walk until a Eulerian cycle is formed. 
The expected time until this happens is © (m°), where m 
denotes the number of edges. 

However, if another edge in the same cycle is added 
at v*, the walk will evolve into one of the two cycles 
that make up the instance. It is not possible to add fur- 
ther edges to the current walk, unless the current walk 
starts and ends in v*. However, the walk can be rotated 
so that the start and end vertex of the walk is moved to 
a neighbored vertex. Such an operation takes expected 
time @(m7). Note that the fitness after a rotation is the 
same as before. Rotations that take the start and end 
closer to v* are as likely as rotations that move it away 
from v*. The start and end of the walk hence performs 
a fair random walk, and ©(m7) rotations are needed on 
average in order to reach v*. The total expected time for 
rotating the cycle is hence © (m°). 

Summarizing, if RLS makes the right decision then 
expected time O(m) suffices in total. However, if ro- 
tations become necessary the expected time increases 
to O(m*). Now consider an island model with m is- 
lands running RLS. If islands evolve independently for 
at least T > m? generations, all mentioned decisions are 
made independently, with high probability. The proba- 
bility of making a wrong decision is 1/3, hence with m 
islands the probability that all islands make the wrong 
decision is 3~’”. The expected time can be shown to be 


O(m +3™ -m°). 


The choice m := log, m yields an expectation of © (m°), 
and every value up to logąm leads to a superlin- 
ear speedup, asymptotically speaking. Technically, the 
speedup is even exponential. 

Interestingly, this good performance only holds if 
migration is used rarely, or if independent runs are used. 


If migration is used too frequently, the island model 
rapidly loses diversity. If T is any strongly connected 
topology and diam(T) is its diameter, we have the fol- 
lowing. If 


t- diam(T) -m = O(m’) , 


then there is a constant probability that the island that 
first arrives at a decision at v* propagates this solution 
throughout the whole island model, before any other 
island can make an improvement. This results in an 
expected running time of 2(m*/log(m)). This is al- 
most @(m*), even for very large numbers of islands. 
The speedup is, therefore, logarithmic at best, or even 
worse. This natural example shows that the choice of 
the migration interval can make a difference between 
exponential and logarithmic speedups. 


46.4.4 Crossover Between Islands 


It has long been known that island models can also 
be useful in the context of crossover. Crossover usu- 
ally requires a good diversity in the population to work 
properly. Due to the higher diversity between different 
islands, compared to panmixia, recombining individu- 
als from different islands is promising. 

Watson and Jansen [46.52] presented and analyzed 
a royal road function for crossover: a function where 
crossover drastically outperforms mutation-based evo- 
lutionary algorithms. In contrast to previous theoreti- 
cally studied examples [46.53-57], their goal was to 
construct a function with a clear building-block struc- 
ture. In order to prove that a GA was able to assemble 
all building blocks, they resorted to an island model 
with a very particular migration topology. In their 
single-receiver model all islands except one evolve 
independently. Each island sends its migrants to a des- 
ignated island called the receiver (Fig 46.10). This way, 
all sending islands are able to evolve the right building 
blocks, and the receiver is used to assemble all these 
building blocks to obtain the optimum. 


Fig. 46.10 The topology for Watson and Jansen’s single- 
receiver model (after [46.52]) 
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Fig. 46.11 Vertex cover instance with bipartite graphs. The brown vertices denote selected vertices. In this configuration 
the second component shows a locally optimal configuration while all other components are globally optimal 


This idea was picked up later on by Neumann 
et al. [46.7] in a more detailed study of crossover in is- 
land models. We describe parts of their results, as their 
problem is more illustrative than the one by Watson 
and Jansen. The former authors considered instances of 
the NP-hard Vertex cover problem. Given an undirected 
graph, the goal is to select a subset of vertices such that 
each vertex is either selected or neighbored to a selected 
vertex. We say that vertices are covered if this property 
holds for them. The objective is to minimize the num- 
ber of selected vertices. The problem has a simple and 
natural binary representation where each bit indicates 
whether a corresponding vertex is selected or not. 

Prior work by Oliveto et al. [46.58] showed that 
evolutionary algorithms with panmictic populations 
even fail on simply structured instance classes like 
copies of bipartite graphs. An example is shown in 
Fig. 46.11. Consider a single bipartite graph, i.e., two 
sets of vertices such that each vertex in one set is con- 
nected to every vertex in the other set. If both sets 
have different sizes, the smaller set is an optimal Ver- 
tex cover. The larger set is another Vertex cover. It is, in 
fact, a non-optimal local optimum which is hard to over- 
come: the majority of bits has to flip in order to escape. 
If the instance consists of several independent copies of 
bipartite graphs, it is very likely that a panmictic EA 
will evolve a locally optimal configuration on at least 
one of the bipartite graphs. Then the algorithm fails to 
find a global optimum. 


46.5 Speedups by Parallelization 


46.5.1 A General Method 
for Analyzing Parallel EAs 


We now finally discuss a method for estimating the 
speedup by parallelization. Assume that, instead of run- 
ning a single EA, we run an island model where each 
island runs the same EA. The question is by how much 
the expected optimization time (1. e., the number of gen- 
erations until a global optimum is found) decreases, 


Island models perform better. Assume the topol- 
ogy is the single-receiver model. In each migration 
a 2-point crossover is performed between migrants and 
the individual on the target island. All islands have 
population size | for simplicity. We also assume that 
the bipartite subgraphs are encoded in such a way 
that each subgraph forms one block in the bit string. 
This is a natural assumption as all subgraphs can be 
clearly identified as building blocks. In addition, Jansen 
et al. [46.59] presented an automated way of encoding 
graphs in a crossover-friendly way, based on the degrees 
of vertices. 

The analysis in [46.7] shows the following. As- 
sume that the migration interval is at least t > nite 
for some positive constant € > 0. This choice implies 
that all islands will evolve to configurations where 
all bipartite graphs are either locally optimal or glob- 
ally optimal. With probability 1 — e7? 0™ we have that 
for each bipartite graph at least a constant fraction 
of all sender islands will have the globally optimal 
configuration. 

All that is left to do for the receiver island is to 
rely on crossover combining all present good building 
blocks. As two-point crossover can select one block 
from an immigrant and the remainder from the current 
solution on the receiver island, all good building blocks 
have a good chance to be obtained. The island model 
finds a global optimum within a polynomial number of 
generations, with probability 1 — e7? (mingn®/? m}) , 


compared to the single, panmictic EA. Recall that this 
speedup is called weak orthodox speedup [46.14]. 

In the following we sometimes speak of the ex- 
pected parallel optimization time to emphasize that 
we are dealing with a parallel system. If the num- 
ber of islands and the population size on each island 
is fixed, we can simply multiply this time by a fixed 
factor to obtain the expected number of function evalu- 
ations. 
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Lässig and Sudholt [46.4] presented a method for 
estimating the expected optimization time of island 
models. It combines growth curves with a well-known 
method for the analysis of evolutionary algorithms. 
The fitness-level method or method of f-based parti- 
tions [46.60] is a simple, yet powerful technique. The 
idea is to partition the search space into non-empty 
sets Aj, A2, . . ., Am such that the following holds: 


@ for each 1<i<m each search point in A; has 
a strictly worse fitness than each search point 
in Aj+, and 

© A,, contains all global optima. 


The described ordering with respect to the fitness f 
is often denoted 


Aj <p Az <n <p Åm š 


Note that A,, can also be redefined towards containing 
all search points of some desired quality if the goal is 
not global optimization. 

We say that a population-based algorithm A (in- 
cluding populations of size 1) is in A; or on fitness 
level i if the best search point in the population is in A;. 
Now, assume that we know that s; is a lower bound 
on the probability that the algorithm finds a solution 
in Aj4; U ++- U An if it is currently in A;. Then the re- 
ciprocal 1/s; is an upper bound on the expected time 
until this event happens. If the algorithm is elitist (i. e., 
it never loses the current best solution), then it will 
never decrease its current fitness level. A sufficient con- 
dition for finding an optimal solution is that all sets 
A,,A2,...,Am—j, are left in the described manner at 
least once. This implies the following bound on the ex- 
pected optimization time. 


Theorem 46.1 Wegener [46.60] 

Consider an elitist EA and assume a fitness-level par- 
tition A; <---<;~A,, where An is the set of global 
optima. Let s; be a lower bound for the probability 
that in one generation the EA finds a search point in 
Aj41 U+- U An if the best individual in the parent pop- 
ulation is in A;. Then the expected optimization time is 
bounded by 


m—1 
1 


The above bound applies to all elitist algorithms. 
It is generally applicable and often quite versatile, as 


we can freely choose the partition A;,...,A,. The 
challenge is to find such a partition and to find cor- 
responding probability bounds s,...,5,,—; for find- 
ing improvements. Many papers have shown that this 
method — applied explicitly or implicitly — yields tight 
bounds on the expected optimization time of EAs for 
various problems [46.32, 42, 48]. It can also be used as 
part of a more general analysis [46.61, 62]. 

We are being pessimistic in assuming that every fit- 
ness level has to be left. In reality, several fitness levels 
might be skipped. The fitness-level method often yields 
good bounds if not too many levels are skipped, and if 
the probability bounds s; are good estimates for the real 
probabilities of finding a better fitness-level set. Note 
that the lower bound s; must apply regardless of the 
precise search point(s) in A; present in the population, 
hence we need to consider the worst-case probability of 
escaping from Aj. 

Nevertheless, the fitness-level method often yields 
tight bounds. Sudholt [46.63] recently developed 
a lower-bound method based on fitness levels, which 
in each case shows that the upper bound is tight. Also, 
Lehre [46.64] recently presented an extension of the 
method to non-elitist algorithms. Asymptotically, the 
same bound as in Theorem 46.1 applies, if some ad- 
ditional conditions on the selection pressure and the 
population size are fulfilled. For the sake of simplicity, 
we focus on elitist algorithms in the following. 

If s; denotes the probability of a single offspring 
finding an improvement, this probability can be in- 
creased by using À offspring in parallel. We have 
already seen in Sect. 46.1 how A independent trials can 
increase or amplify a success probability p to 1— (1 — 
p)*. The same reasoning applies to the probability s; 
for finding an improvement on the current best level. 
Figure 46.1 has shown how this probability increases 
with the number of trials. Figure 46.12 shows how the 
expected time for having a success decreases with the 
number of offspring. In fact, the curves in Fig. 46.12 
are just reciprocals of those in the previous Fig. 46.1. 

Figure 46.12 shows that the speedup can be close to 
linear (in a strict, non-asymptotic sense), especially for 
low success probabilities. As the probability of increas- 
ing the current fitness level i is at least 1 — (1 — si), we 
obtain the following. 


Theorem 46.2 

Consider an elitist EA creating À offspring indepen- 
dently in each generation. Assume a fitness-level par- 
tition Aj <---<~A,, where Am is the set of global 
optima. Let s; be a lower bound for the probability that 
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in one generation a single offspring finds a search point 
in Aj+; U---UA,, if the best individual in the parent 
population is in A;. Then the expected optimization time 
is bounded by 


m—1 m—1 


1 1 1 
2 Taaaye a ae ae 


i=1 i=1 


Note that the first bound for A = 1 reproduces the 
previous upper bound from Theorem 46.1. For the sec- 
ond bound we used 

ee ee (46.2) 
T=(l=s9" KM Si 
where the inequality was proposed by Jon Rowe (per- 
sonal communication, 2011); it can be proven by a sim- 
ple induction. 

Our estimate of the probability for an improvement 
increases with the number of islands on the current best 
fitness level. In a spatially structured EA these growth 
curves are non-trivial. Especially with a sparse migra- 
tion topology, information about the current best fitness 
level is typically propagated quite slowly. The increased 
exploration slows down exploitation. Still, even sparse 
topologies lead to drastically improved upper bounds, 
when compared to the simple bound for a sequential 
EA from Theorem 46.1. The precise bounds crucially 
depend on the particular topology. 


Expected parallel time 
A 
20 


> 
0 2 4 6 8 10 
Number of independent trials 


Fig. 46.12 Plots of the expected parallel time until an off- 
spring population of size À has a success, if each offspring 
independently has a success probability of p. The dashed 
lines indicate a perfect linear speedup 


We first consider a setting where migration always 
transmits the current best fitness level and migration 
occurs in every generation. It is possible to adapt the re- 
sults to account for larger migration intervals. One way 
of doing this is to redefine s; to represent a lower bound 
of finding an improvement in a time period between 
migrations. Then we obtain an upper bound on the ex- 
pected number of migrations. For the sake of simplicity, 
we only consider the case t = 1 in the following. 

The following theorem was presented in Lässig and 
Sudholt [46.6]; it is a refined special case of previous 
results [46.4]. The main proof idea is to combine the 
investigation of growth curves with the consideration 
of amplified success probabilities. 


Theorem 46.3 Lässig and Sudholt [46.6] 

Consider an island model with jz islands where each is- 
land runs an elitist EA. In every iteration each island 
sends copies of its best individual to all neighbored is- 
lands (i. e., t = 1). Each island incorporates the best out 
of its own individuals and its immigrants. 

For every partition A; <p +-+ <p Am if s; is a lower 
bound for the probability that in one generation an is- 
land in A; finds a search point in Aj+; U +++ U Am then 
the expected parallel optimization time is bounded by: 
iL, oy = T7 + T ye Ł for every unidirectional 

ring (a ring with edges in one direction) or any other 

strongly connected topology, 
2. 3, et a 522 + for every undirected grid 


i=1 i=l 5; 
or torus graph with side lengths at least u x ~H, 
3. m—-1+ 7 572] L for the complete topology Ky. 


i=1 Sj 


Note that the bound for the complete topology K,, is 
equal to the upper bound for offspring populations, The- 
orem 46.2. This makes sense as an island model with 
a complete topology propagates the current best fitness 
level like an offspring population. 

All bounds in Theorem 46.3 consist of two additive 
terms. The second term 


represents a perfect linear speedup, compared to the up- 
per bound from Theorem 46.1. The larger we choose 
the number of islands jz, the smaller this term becomes. 
The first additive term is related to the growth curves of 
the current best fitness level in the island model. The 
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denser the topology, the faster information is spread, 
and the smaller this term becomes. Note that it is inde- 
pendent of u. It can be regarded as the term limiting the 
degree of parallelizability. We can increase the number 
of islands in order to decrease the second term 


but we cannot decrease the first term by changing ju. 

This allows for immediate conclusions about cases 
where we obtain an asymptotic linear speedup over 
a single-island EA. For all choices of jz where the sec- 
ond term is asymptotically no smaller than the first 
term, the upper bound is smaller than the upper bound 
from Theorem 46.1 by a factor of order u. This is an 
asymptotic linear speedup if the upper bound from The- 
orem 46.1 is asymptotically tight. (If it is not, we can 
only compare upper bounds for a sequential and a par- 
allel EA.) 

We illustrate this with a simple and well-known test 
function from pseudo-Boolean optimization. The algo- 
rithm considered is an island model where each island 
runs a (1 + 1) EA; the island model is also called paral- 
lel (1 + 1) EA. The function 


LO(x) := 2 IE (LeadingOnes) 


i=1j=1 


counts the number of leading ones in the bit string. 
We choose the canonic partition where A; contains all 
search points with fitness i, 1. e., i leading ones. For any 
set A; 0 <i<n—1 we use the following lower bound 
on the probability for an improvement. 

An improvement occurs if the first 0-bit is flipped 
from 0 to 1 and no other bit flips. The probability 
of flipping the mentioned 0-bit is 1/n as each bit is 


flipped independently with probability 1/n. The prob- 
ability of not flipping any other bit is (1 —1/n)"—'. We 
use the common estimate (1—1/n)"—!> 1/e, where 
e = exp(1) = 2.718..., so the probability of an im- 
provement is at least s; > 1/(en) for all O<i<n-1. 
Plugging this into Theorem 46.3, the second term is 


i - en? for all bounds. The first terms are 


2n- (en)! = 2e! n? 
for the ring, 
3n- (en)!/3 = 3e!/3n4/3 


for the torus, and n for the complete graph, respectively. 

For the ring, choosing u = O(n!” 2) islands results 
in an expected parallel time of ol} -n) as the second 
term is asymptotically not smaller than the first one. 
This is asymptotically smaller by a factor of 1/j than 
the expected optimization time of a single (1 + 1) EA, 
O(n?) [46.42]. Hence, each choice of u up to p= 
O(n'/?) gives a linear speedup. For the torus we obtain 
a linear speedup for u = O(n?/*) in the same fashion. 
For the complete graph this even holds for u = O(n). 
One can see here that the island model can decrease the 
expected parallel running time by significant polyno- 
mial factors. 

Table 46.3 lists expected parallel optimization time 
bounds for several well-known pseudo-Boolean func- 
tions. The above analysis for LO generalizes to all 
unimodal functions. A function is called unimodal here 
if every non-optimal search point has a better Ham- 
ming neighbor, i. e., a better search point can be reached 
by flipping exactly one specific bit. ONEMAX(x) = 
71%; counts the number of ones, hence modeling 
a simple hill climbing task. Finally, Jump, [46.42] is 
a multimodal function of tunable difficulty. An EA 


Table 46.3 Upper bounds for expected parallel optimization times (number of generations) for the (1 + 1) EA and the 
corresponding island model with jz islands in pseudo-Boolean optimization. The last but one column is for any unimodal 
function with d function values. The number of function evaluations in the island model is larger than the number of 


generations by a factor of u 
Algorithm 
(1+1) EA 


ONEMAX 
O(n log n) [46.42] 


O(n+ moen) 
O(n+ nies) 
Island model on K,,/(1 + u) EA O(n+ mosa) 


LO 


Island model on ring 


Island model on torus 


O(n?) [46.42] 
O (n? + z) 
O (ntr + 2) 


O(n+ =) 


Unimodal, d values 
O(nd) 


O (an2 + an) 
O(an'/? + H) 


o(a+ #) 


Jump,, k = 3 
O(n‘) [46.42] 


O (n? + z) 
O (n? + a) 


O(n+ £) 
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typically has to make a jump by flipping k bits si- 
multaneously, where 2 < k < n. The (1 + 1) EA has an 
expected optimization time of O(n"), hence growing 
rapidly with increasing k. 

One can see that the island model leads to drasti- 
cally reduced parallel optimization times. This particu- 
larly holds for problems where improvements are hard 
to find. 

We remark that Lässig and Sudholt [46.4] also con- 
sidered parallel EAs where migration is not always 
successful in transmitting information about the cur- 
rent best fitness level. This includes the case where 
crossover is used during migration and crossover has 
a certain probability of being disruptive. We do obtain 
upper bounds on the expected optimization time if we 
know a lower bound pt on the probability of a suc- 
cessful transmission. The bounds depend on pt; the 
degree of this dependence is determined by the topol- 
ogy. For simplicity we only focus on the deterministic 
case here. 


46.5.2 Speedups in Combinatorial 
Optimization 


The techniques are also applicable in combinatorial op- 
timization. We review two examples here, presented 
in [46.6]. Scharnow et al. [46.32] considered the classi- 
cal sorting problem as an optimization problem: given 
a sequence of n distinct elements from a totally ordered 
set, sorting is the problem of maximizing sortedness. 
Without loss of generality the elements are 1,...,n; 
then the aim is to find the permutation opt such that 
(Zopt(1),.--, Zopt(7)) is the sorted sequence. 

The search space is the set of all permutations z 
on 1,...,n. Two different operators are used for muta- 
tion. An exchange chooses two indices i Æ j uniformly 
at random from {1,...,} and exchanges the entries at 
positions i and j. A jump chooses two indices in the 
same fashion. The entry at i is put at position j and all 
entries in between are shifted accordingly. For instance, 


a jump with i = 2 and j = 5 would turn (1, 2,3,4,5, 6) 
into (1,3,4,5, 2,6). 

The (1 + 1) EA draws S according to a Poisson dis- 
tribution with parameter A = | and then performs S + 1 
elementary operations. Each operation is either an ex- 
change or a jump, where the decision is made inde- 
pendently and uniformly for each elementary operation. 
The resulting offspring replaces its parent if its fitness 
is not worse. The fitness function f7,,,(7) describes the 
sortedness of (7(1),...,(m)). As in [46.32], we con- 
sider the following measures of sortedness: 


© INV(z) measures the number of pairs (i,j), 1 < i< 
j < n, such that z (i) < 7 (j) (pairs in correct order), 

@ HAM(rx) measures the number of indices i such that 
z (i) = i (elements at the correct position), 

© LAS(z) equals the largest k such that z (i1) < -+ < 
z (ip) for some i; <--- < ip (length of the longest as- 
cending subsequence), 

@ EXC(zsr) equals the minimal number of exchanges 
(of pairs x (i) and 7 (j)) to sort the sequence, leading 
to a minimization problem. 


The expected optimization time of the (1 + 1) EA 
is 2 (n?) and O(n? logn) for all fitness functions. The 
upper bound is tight for LAS, and it is believed to 
be tight for INV, HAM, and EXC as well [46.32]. 
Theorem 46.3 yields the following. For INV, all topolo- 
gies guarantee a linear speedup only in case u = 
O(log n) and the bound O(n? log n) for the (1 + 1) EA is 
tight. The other functions allow for linear speedups up 
to u = O(n! logn) (ring), u = O(n?/3 logn) (torus), 
and u = O(nlogn) (K,,), respectively (again assuming 
tightness, otherwise up to a factor of logn). Note how 
the results improve with the density of the topology. 
HAM, LAS, and EXC yield much better guarantees 
for the island model than INV. This is surprising as 
there is no visible performance difference for a single 
(1+ 1) EA. Theorem 46.3 yields the following results 
also shown in Tab. 46.4 


Table 46.4 Upper bounds for expected parallel optimization times for the (1+ 1) EA and the corresponding island 


model with u islands for sorting n objects 
Algorithm INV 
(1+1) EA O(n? logn) [46.32] 


O (n + noer) 
(0) (0? J Zoen) 
O (r? + Zeen) 


Island model on ring 
Island model on torus 


Island model on Ky, /(1 + 4) EA 


HAM, LAS, EXC 

O(n? log n) [46.32] 

0 nas ete) 

O (n + Zeen) 
n2 logn 

O (n + mT ) 
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Table 46.5 Worst-case expected parallel optimization times for the (1 + 1) EA and the corresponding island model with 
H islands for the SSSP on graphs with n vertices and m edges. The value £ is the maximum number of edges on any 
shortest path from the source to any vertex and £* := max{£, Inn}. The second lines show a range of jz-values yielding 


a linear speedup, apart from a factor In(en/£) 
Algorithm 
(1+1)EA O(n? l*) [46.34] 

n2 lln(en/£) 
O (n?/201/2 4 Eler) 
— = 0((nb)'/?) 
o (nne a Linton) 
— u=0 ((ne)?/?) 
Island model on Ky,/(1 +) EA O (n $ Kemet) 
— u =0 (nb) 


Island model on ring 


Island model on torus 


An explanation is that INV leads to (5) non-optimal 
fitness levels that are quite easy to overcome. HAM, 
LAS, and EXC have only n non-optimal fitness levels 
that are more difficult. For a single EA both settings 
are equally difficult, leading to asymptotically equal 
expected times (assuming all upper bounds are tight). 
However, the latter setting is easier to parallelize than 
the former as it is easier to amplify small success prob- 
abilities. 

We also consider parallel variants of the (1+ 
1) EA for the single source shortest path prob- 
lem (SSSP) [46.32]. An SSSP instance is given 
by an undirected connected graph with vertices 
{1,... n} and a distance matrix D = (dj)\<ij<n, 
where dj € RE U {oo} defines the length value for 
given edges from node i to node j. We are searching 
for shortest paths from a node s (without loss of gener- 
ality s = n) to each other node 1 < i< n— 1. 

A candidate solution is represented as a shortest 
paths tree, a tree rooted at s with directed shortest paths 
to all other vertices. We define a search point x as 
vector of length n— 1, where position i describes the 
predecessor node x; of node i in the shortest path tree. 
Note that infeasible solutions are possible if the prede- 
cessors do not encode a tree. An elementary mutation 
chooses a vertex i uniformly at random and replaces 
its predecessor x; by a vertex chosen uniformly at ran- 
dom from {1,...,}\ {i,x;}. We call this a vertex-based 
mutation. Doerr et al. [46.65] proposed an edge-based 
mutation operator. An edge is chosen uniformly at ran- 
dom, and the edge is made a predecessor edge for its 
end node. 

The (1 + 1) EA uses either vertex-based mutations 
or edge-based ones. It creates an offspring using S el- 
ementary mutations, where S is chosen according to 


Vertex-based mutation [46.32] 


Edge-based mutation [46.65] 
O(me*) [46.65] 
O(m!/2n'/2¢1/2 at meine) 
— p = O((m/n-£)'/?) 
O (mA i méint 8) 
— = 0((m/n- 07?) 
m£ In(en/£) 
O (n + EE 
—> u =0(m/n-£) 


a Poisson distribution with A = 1. The result of an off- 
spring is accepted in case no distance to any vertex has 
gotten worse. 

Applying Theorem 46.3 along with a layering ar- 
gument as described at the end of Sect. 46.3.4 yields 
the bounds on the expected parallel optimization time 
shown in Table 46.5. 

The upper bounds for the island models with con- 
stant u match the expected time of the (1+ 1) EA if 
£= O(1) or £= Q (n) as then £ln(en/£) = O(4*). In 
other cases, the upper bounds are off by a factor of 
ln(en/£). Table 46.5 also shows a range of j1-values for 
which the speedup is linear (if £ = O(1) or £ = Q(n)) 
or almost linear, that is, when disregarding the In(en/4) 
term. 

Note how the possible speedups significantly in- 
crease with the density of the topology. The speedups 
also depend on the graph instance and the maxi- 
mum number of edges £ on any shortest path. For 
a single (1+ 1) EA edge-based mutations are more 
effective than vertex-based mutations [46.65]. Island 
models with edge-based mutations cannot be paral- 


Table 46.6 Asymptotic bounds for expected parallel run- 
ning times and expected sequential running times for the 
parallel (1 + 1) EA with adaptive population models 


Scheme Sequential Parallel 


ONEMAX A O(nlogn)  O(nlogn) 

B O(nlogn) O(n) 
LO A O(n?) O(nlogn) 

B O(n?) O(n) 
Unimodal f A O(dn) O(d log n) 
with df-values B O(dn) O(d + logn) 
Jump; A O(n’) O(n log n) 
with k > 2 B O(n’) O(n + klogn) 
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lelized as effectively for sparse graphs as those with 
vertex-based mutations if the graph is sparse, i.e., 
m = o(n°). Then the number of islands that guaran- 
tees a linear speedup is smaller for edge-based mu- 
tations than for vertex-based mutations. The reason 
is that with a more efficient mutation operator there 
is less potential for further speedups with a parallel 
EA. 


46.5.3 Adaptive Numbers of Islands 


Theorem 46.3 presents a powerful tool for deter- 
mining the number of islands that give an asymp- 
totic linear speedup. However, it would be even more 
desirable to have an adaptive system that automati- 
cally finds the ideal number of islands throughout the 
run. 

In [46.5] Lässig and Sudholt proposed and analyzed 
two simple adaptive schemes for choosing the number 
of islands. Both schemes check whether in the cur- 
rent generation some island has found an improvement 
over the current best fitness in the system. If no is- 
land has found an improvement, the number of islands 
is doubled. This can be implemented, for instance, by 
copying each island. New processors can be allocated 
to host these islands in large clusters or by using cloud 
computing. 

If some island has found an improvement, the 
number of islands is reduced by removing selected 
islands from the system and de-allocating resources. 
Both schemes differ in the way they decrease the 
number of islands. The first scheme, simply called 
Scheme A, only keeps one island containing a cur- 
rent best solution. Scheme B halves the number of 
islands. Both schemes use complete topologies, so all 
remaining islands will contain current best individuals 
afterwards. 

Both mechanisms lead to optimal speedups in many 
cases. Doubling the number of islands may seem ag- 
gressive, but the analysis shows that the probability of 
allocating far more islands than necessary is very very 
small. The authors considered the expected sequential 
optimization time, defined as the number of function 
evaluations, to measure the total effort over time. With 
both schemes it is guaranteed that the expected se- 
quential time does not exceed the simple bound for 
a sequential EA from Theorem 46.1, asymptotically. 
The expected parallel times on each fitness level can, 
roughly speaking, be replaced by their logarithms. 

The following is a slight simplification of results 
in [46.5]. 


Theorem 46.4 Ldssig and Sudholt [46.5] 

Given an f-based partition A,,...,A,, and lower 
bounds s1,...,Sm—1 On the probability of a single is- 
land finding an improvement, the expected sequential 
times for island models using a complete topology and 
either Scheme A or Scheme B are bounded by 


If each set A; contains only a single fitness value then 
also the expected parallel time is bounded by 


m—1 2 
4 log|—]}. 
Sj 


i=1 


Actually, for Scheme A we can obtain slightly better 
constants than the ones stated in Theorem 46.4. How- 
ever, with a more detailed analysis one can show that 
Scheme B can perform much better than Scheme A. 
Ldssig’s and Sundholt’s work [46.5] contains a more 
refined upper bound for Scheme B. We only show a spe- 
cial case where the fitness levels become increasingly 
harder. Then it makes sense to only halve the number 
of islands when an improvement is found, instead of re- 
setting the number of islands to 1. 


Theorem 46.5 Ldssig and Sudholt [46.5] 

Given an f-based partition A,,...,Am, where each 
set A; contains only a single fitness value and for the 
probability bounds it holds sı > s2 > -++ > Sm—1. Then 
the expected parallel running time for an island model 
using a complete topology and Scheme B is bounded by 


1 
3(m—2) + toe (- ) . 
m—1 


Example applications for a parallel (1+ 1) EA in 
Table 46.6 show that Scheme B can automatically lead 
to the same speedups as when using an optimal number 
of islands. This holds for ONEMAX, LO, and the gen- 
eral bound for unimodal functions. For Jump, it also 
holds in the most relevant cases, when k = O(n/ logn), 
as then the expected parallel time is O(n). 

We conclude that simply doubling or halving the 
number of islands represents a simple and effective 
mechanism for finding optimal parameters adaptively. 
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46.6 Conclusions 


Parallel evolutionary algorithm can effectively reduce 
computation time and at the same time lead to an in- 
creased exploration and better diversity, compared to 
sequential evolutionary algorithms. 

We have surveyed various forms of parallel EAs, 
from independent runs to island models and cellular 
EAs. Different lines of research have been discussed that 
give insight into the working principles behind parallel 
EAs. This includes the spread of information, growth 
curves for current best solutions, and takeover times. 

A recurring theme was the possible speedup that can 
be achieved with parallel EAs. We have elaborated on 
the reasons why superlinear speedups are possible in 
practice. Rigorous runtime analysis has given examples 
where parallel EAs excel over sequential algorithms, 
with regard to the number of generations or the num- 
ber of function evaluations until a global optimum is 
found. The final section has covered a method for esti- 
mating the expected parallel optimization time of island 
models. The method is easy to apply as we can auto- 
matically transfer existing analyses for sequential EAs 
to a parallel version thereof. Examples have been given 
for pseudo-Boolean optimization and combinatorial op- 
timization. The results have also led to the discovery of 
a simple, yet surprisingly powerful adaptive scheme for 
choosing the number of islands. 

There are many possible avenues for future work. In 
the light of the development in computer architecture, it 
is important to develop parallel EAs that can run effec- 
tively on many cores. It also remains a crucial issue to 
increase our understanding of how design choices and 
parameters affect the performance of parallel EAs. Rig- 
orous runtime analysis has emerged recently as a new 
line of research that can give novel insights in this re- 
spect and opens new roads. The present results should 
be extended towards further algorithms, further prob- 
lems, and more detailed cost models that reflect the 
costs for communication in parallel architectures. It 
would also be interesting to derive further rigorous re- 
sults on takeover times in settings where propagation 
through migration is probabilistic. Finally, it is impor- 
tant to bring theory and practice together in order to 
create synergetic effects between the two areas. 


46.6.1 Further Reading 


This book chapter does not claim to be comprehen- 
sive. In fact, parallel evolutionary algorithms represent 


a vast research area with a long history. Early vari- 
ants of parallel evolutionary algorithms were devel- 
oped, studied, and applied more than 20 years ago. 
We, therefore, point the reader to references that may 
complement this chapter. Paz [46.66] presented a re- 
view of early literature and the history of parallel 
EAs. The survey by Alba and Troya [46.37] contains 
detailed overviews of parallel EAs and their character- 
istics. 

This chapter does not cover implementation de- 
tails of parallel evolutionary algorithms. We refer to 
the excellent survey by Alba and Tomassini [46.38]. 
This survey also includes an overview of the theory 
of parallel EAs. The emphasis is different from this 
chapter and it can be used to complement this chap- 
ter. 

Tomassini’s text book [46.67] describes various 
forms of parallel EAs like island models, cellular 
EAs, and coevolution. It also presents many mathe- 
matical and experimental results that help understand 
how parallel EAs work. Furthermore, it contains an 
appendix dealing with the implementation of parallel 
EAs. 

The book edited by Alba etal. [46.39] takes 
a broader scope on parallel models that also in- 
clude parallel evolutionary multiobjective optimization 
and parallel variants of swarm intelligence algorithms 
like particle swarm optimization and ant colony opti- 
mization. The book contains a part on parallel hard- 
ware as well as a number of applications of parallel 
metaheuristics. 

Alba’s edited book on parallel metaheuris- 
tics [46.40] has an even broader scope. It covers 
parallel variants of many common metaheuristics such 
as genetic algorithms, genetic programming, evolu- 
tion strategies, ant colony optimization, estimation- 
of-distribution algorithms, scatter search, variable- 
neighborhood search, simulated annealing, tabu 
search, greedy randomized adaptive search procedures 
(GRASPs), hybrid metaheuristics, multiobjective 
optimization, and heterogeneous metaheuristics. 

The most recent text book was written by Luque and 
Alba [46.19]. It provides an excellent introduction into 
the field, with hands-on advice on how to present results 
for parallel EAs. Theoretical models of selection pres- 
sure in distributed GAs are presented. A large part of 
the book then reviews selected applications of parallel 
GAs. 
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47, Learning Classifier Systems 


Martin V. Butz 


Learning Classifier Systems (LCSs) essentially 
combine fast approximation techniques with evo- 
lutionary optimization techniques. Despite their 
somewhat misleading name, LCSs are not only sys- 
tems suitable for classification problems, but may 
be rather viewed as a very general, distributed 
optimization technique. Essentially, LCSs have very 
high potential to be applied in any problem do- 
main that is best solved or approximated by means 
of a distributed set of local approximations, or 
predictions. The evolutionary component is de- 
signed to optimize a partitioning of the problem 
domain for generating maximally useful predic- 
tions within each subspace of the partitioning. 
The predictions are generated and adapted by 
the approximation technique. Generally any form 
of spatial partitioning and prediction are pos- 
sible — such as a Gaussian-based partitioning 
combined with linear approximations, yielding 
a Gaussian mixture of linear predictions. In fact, 
such a solution is developed and optimized by 
XCSF (XCS for function approximation). The LCSs XCS 
(X classifier system) and the function approxima- 
tion version XCSF, indeed, are probably the most 
well-known LCS architectures to date. Their opti- 
mization technique is very-well balanced with the 
approximation technique: as long as the approxi- 
mation technique yields reasonably good solutions 
and evaluations of these solutions fast, the evolu- 
tionary component will pick-up on the evaluation 
signal and optimize the partitioning. This chapter 


Learning classifier systems (LCSs) are machine learn- 
ing algorithms that combine gradient-based approx- 
imation with evolutionary optimization. Due to this 
flexibility, LCSs have been successfully applied to 
classification and data mining problems, reinforcement 
learning (RL) problems, regression problems, cogni- 
tive map learning, and even robot control problems. 
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provides historical background on LCSs. Then XCS 

and XCSF are introduced in detail providing enough 
information to be able to implement, understand, 
and apply these systems. Further LCS architectures 
are surveyed and their potential for future research 
and for applications is discussed. The conclusions 
provide an outlook on the many possible future 

LCS applications and developments. 


The main feature of LCSs is their innovative combina- 
tion of two learning principles; whereas gradient-based 
approximation adapts local, predictive approximations 
of target function values, evolutionary optimization 
structures individual classifiers to enable the formation 
of effectively distributed and accurate approximations. 
The two learning methods interact bidirectionally in 


961 


v 
o 
= 
et 
m 
S 


962 


24 |3 Hed 


Part E 


Evolutionary Computation 


that the gradient-based approximations yield local fit- 
ness quality estimates of the generated approximations, 
which the evolutionary optimization technique uses 
for optimizing classifier structures. Concurrently, the 
evolutionary optimization technique is generating new 
classifier structures, which again need to be evaluated 
by the gradient-based approach in competition with the 
other, locally overlapping, interacting classifiers. 

Due to the innovative combination of two learning 
and optimization techniques, LCSs are often perceived 
as being hard to understand. Facet-wise analyses of 
the individual LCS components and their interactions, 
however, give both mathematical scalability bounds 
for learning and an intuitive understanding of the sys- 
tems in general. Moreover, the currently most common 
LCS, which is the XCS classifier system (note that 
the X in XCS does not really encode any particular 
acronym according to the system creator Wilson), is 
comparatively easy to understand, to tune, and to ap- 
ply. Thus, the core of this chapter focuses on XCS, 
gives a facet-wise overview of its functionality, de- 
tails several enhancements, and highlights various suc- 
cessful application domains. However, XCS is also 
compared with other LCS architectures and LCSs in 


47.1 Background 


Learning classifier systems (LCS) were proposed over 
30 years ago by Holland [47.1-3]. Originally, Hol- 
land and Reitman actually called LCSs cognitive sys- 
tems [47.4], focusing on problems related to reinforce- 
ment learning (RL) [47.5,6]. His cognitive system 
developed a memory of classifiers, where each classi- 
fier consisted of a condition part (taxon), an action part 
(originally consisting of a message, and an effector bit), 
a payoff prediction part, and several other parameters 
that stored the age, the application frequency, and the 
attenuation of the classifier. 

Concurrently with the development of temporal dif- 
ference learning techniques in RL — such as the now 
well-known state-action-reward-state-action (SARSA) 
algorithm [47.6] — Holland and Reitman introduced 
the bucket brigade algorithm [47.4, 7], which also dis- 
tributes reward backwards in time with a discounting 
mechanism. In addition, the attenuation parameter in 
a classifier realized something similar to an eligibility 
trace in RL — distributing a currently encountered re- 
ward also to classifiers that were active several time 
steps ago and that thus indirectly led to gaining the cur- 


general are compared with other machine learning tech- 
niques. 

This chapter starts with a historical perspective, 
providing information on the beginnings of LCSs and 
establishing some terminology background. We then in- 
troduce the XCS classifier system providing a detailed 
system overview as well as theoretical and facet-wise 
conceptual insights on its performance. Also tricks 
and tweaks are discussed to tune the system to the 
problem at hand. Next, the XCS counterpart for regres- 
sion problems, XCSF, is introduced. Focusing then on 
the application-side, LCS applications to data mining 
tasks and to behavioral learning and cognitive modeling 
tasks are surveyed. We cover various LCS architectures 
that have been successfully applied in the data mining 
realm. With respect to behavioral learning, we point out 
the relation of LCSs to reinforcement learning. More- 
over, we cover anticipatory learning classifier systems 
(ALCSs) — which learn predictive schema models of 
the environment rather than reward prediction maps — 
and we introduce the modified XCSF version that can 
effectively learn a redundant forward-inverse kinemat- 
ics model of a robot arm. A summary and conclusions 
wrap up the chapter. 


rently experienced reward. Meanwhile, Holland’s cog- 
nitive system applied a genetic algorithm (GA) [47.1, 
8] as its second learning mechanism. The GA modified 
the taxa in Holland and Reitman’s cognitive system. 

In sum, the first actual LCS implementation, i.e., 
the cognitive system by Holland and Reitman [47.4], 
was ahead of its time. It implemented various reward- 
related ideas that were later established in the reinforce- 
ment learning community — and can now partially be 
regarded as standard RL techniques. However, the com- 
bination with GAs yielded a highly interactive and very 
complex system that was and still is hard to analyze. 
Thus, while proposing a highly innovative cognitive 
learning approach, the applicability of the system re- 
mained limited at the time. 


47.1.1 Early Applications 


Nonetheless, early applications of LCSs were pub- 
lished in the 1980s. Smith developed a poker deci- 
sion making system [47.9] based on De Jong’s ap- 
proach to LCSs [47.10]. Booker worked on animal-like 
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automation based on the cognitive systems architec- 
ture [47.11]. Wilson proposed and worked on the animat 
problem with LCS architectures derived from Hol- 
land and Reitman’s cognitive systems approach [47.12, 
13]. Goldberg solved a gas pipeline control task with 
a simplified version of the cognitive system archi- 
tecture [47.8, 14]. Despite these successful early ap- 
plications, a decade passed until a growing research 
community developed that worked on learning classi- 
fier systems. 


47.1.2 The Pitt and Michigan Approach 


Two fundamentally different LCS approaches were pur- 
sued from early on. The Pitt approach was fostered by 
the work of De Jong et al. [47.10, 15, 16]. On the other 
hand, the Michigan approach developed in the further 
years at Michigan under the supervision of John H. 
Holland [47.11, 14, 17, 18]. Diverse perspectives on the 
Michigan approach can be found in [47.19]. 

The essential difference between the two ap- 
proaches is that in the Pitt approach rule sets are evolved 
where each particular rule set constitutes an individual 
for the GA. In contrast, in the Michigan approach one 
set of rules is evolved and each rule is an individual for 


a) Pitt approach 
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b) Michigan approach 
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quality feedback 
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Action 
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Fig. 47.1a,b While the Pitt approach to LCSs evolves 
a population of sets of rules, in the Michigan approach 
there is only one set of rules (i. e., the population) that is 
evolved 


the GA. As a consequence, the Pitt-style LCSs are much 
closer to general GAs because each individual consti- 
tutes an overall problem solution. In the Michigan-style 
LCSs, on the other hand, each individual only applies 
in a subspace of the overall problem and only the whole 
set of rules that evolves constitutes the overall problem 
solution. Figure 47.1 illustrates this fundamental con- 
trast between the two approaches. 

As a consequence of this contrast, Pitt-style LCSs 
usually apply rather standard GA approaches. The 
whole population of rule sets is evolved. For fitness 
evaluation purposes, each set of rules needs to be 
evaluated in the problem environment addressed. On 
the other hand, Michigan-style LCSs need to con- 
tinuously interact with an environment to sufficiently 
evaluate all the rules in the rule set — essentially ex- 
ploring all the environmental subspaces to make sure all 
rules can develop a sufficiently useful fitness estimate. 
This continuous interaction and the typical interacting 
components of Michigan-style LCSs are illustrated in 
further detail in Fig. 47.2. Due to the continuously de- 
veloping fitness estimates, often a more steady-state, 
niched GA is applied online in Michigan-style LCSs. 
The undertaken updates then depend directly on the cur- 
rent interaction and thus on the current subset of rules 
relevant in the experienced interaction. The steady- 
state, niched GA optimizes the internal knowledge 
base iteratively depending on the incoming learning 
samples. 


Learning classifier system architecture 


Evolutionary learning component 
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Fig. 47.2 LCSs consist of a knowledge base (population of 
classifiers), a genetic algorithm for rule structure evolution, 
and a reinforcement learning component for rule evalua- 
tion, reward propagation, and decision making. The system 
interacts with its environment or problem iteratively learn- 
ing online 
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In summary, Pitt-style LCSs evaluate and opti- 
mize their rule sets globally based on sets of problem 
instances. They usually learn offline. Michigan-style 
LCSs evaluate and optimize their set of rules online 
while interacting with the problem, iteratively perceiv- 
ing problem instances. The major qualities of Pitt-style 
LCSs are that they evolve competing global problem 
solutions in the form of sets of rules. Evolutionary 
rule structure optimization is used — typically evolving 
small sets of rules (10 s). Michigan-style LCSs, on the 
other hand, are designed to develop one distributed, lo- 
cally optimized problem solution by combining local 
gradient-based approximation techniques with steady- 
state, niched GAs. In consequence, typically larger, 
more distributed sets of rules develop yielding problem 
solutions with potentially 1000s of rules. 


47.1.3 Basic Knowledge Representation 


Because an exemplary knowledge representation was 
already discussed for the early cognitive system imple- 
mentation of [47.4], we now provide a general sketch 
of the knowledge representation typically found in 
Michigan-style LCSs. 

The knowledge representation of an LCS consists 
of a finite population of classifiers (that is, a finite set 
of rules). This population of classifiers essentially rep- 
resents the current knowledge of the LCS about the 
problem the system is applied to. Each rule — or clas- 
sifier — usually consists of a condition and an action 
part, as well as a prediction and a fitness estimate. The 
condition part specifies the problem subspace in which 
the classifier is applicable. When the condition part is 
satisfied given a particular problem instance, a classi- 
fier is said to match that problem instance. The action 
part specifies an action that may be executed, or a clas- 
sification that may be tested. The prediction specifies 
the expected reward, or feedback value, given the spec- 
ified action was executed under the specified contextual 
conditions. The fitness estimates the value of this classi- 
fier relative to other, competing classifiers. In the early 
approaches, fitness was often simply equal to the pre- 
diction value. In the currently established LCSs, fitness 
typically estimates the accuracy of the prediction. 

Michigan-style LCSs usually learn online about 
a problem, iteratively perceiving or actively generating 
problem instances. Given a particular problem instance, 
first, the system forms a match set of those classifiers 
in the population whose conditions match. Next, the 
system decides on an action or classification and ex- 


ecutes it. Classifiers in the match set that specify the 
executed action constitute the current action set. After 
feedback is received, the predictions of the classifiers 
in the action set are adjusted. From the classifier pre- 
diction estimates, a fitness estimate is derived for each 
classifier. Finally, the steady-state GA is applied to the 
match set or the population as a whole. The GA mod- 
ifies classifier structures by reproducing, mutating, and 
recombining well-performing classifiers and by delet- 
ing ill-performing ones. In contrast to the Michigan 
approach, Pitt-style LCSs evaluate their sets of rules 
typically independently of each other in the provided 
problem. The GA exchanges rules and rule-structures 
within and across the sets of rules. 

A Michigan-style LCS consequently is an interac- 
tive, online learning system. It maintains a population 
of classifiers as its knowledge base. It applies a niched, 
steady-state genetic algorithm for gradual rule structure 
evolution; it applies a gradient-based learning com- 
ponent for rule evaluation — yielding prediction and 
fitness estimates. Michigan-style LCSs are often ap- 
plied in RL scenarios in which reward estimates need 
to be propagated and action decisions are made based 
on the learned reward prediction estimates. In this 
case, typically techniques similar to SARSA learning 
or Q-learning are applied. Figure 47.2 shows the basic 
components of a Michigan-style LCS as well as their 
interactions. 

The earliest Michigan-style LCS implementation 
is the introduced cognitive system CS1 [47.4]. After 
various early applications of LCSs, Wilson set a mile- 
stone in LCS research by introducing the zeroth level 
classifier system ZCS [47.20] and the now most promi- 
nent and well-known LCS: the XCS classifier sys- 
tem [47.21]. Both systems were explicitly compared 
to the very well-known Q-learning [47.22] technique 
from the RL community, offering with ZCS and XCS 
two learning classifier systems that can learn Q-value 
functions with a compact highly generalized rule-based 
representation. 

In the following, we now first give a precise in- 
troduction to the XCS classifier system. We then also 
introduce the real-valued version for solving regression 
problems, with a Gaussian mixture of linear approxi- 
mations, i. e., XCSF. After that, we provide spot-lights 
on various current application domains where vari- 
ous types of LCSs, including XCS(F), have produced 
highly competitive problem solutions, when compared 
to other machine learning techniques and regression 
algorithms. 
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47.2 XCS 


Wilson introduced the XCS classifier system in 
1995 [47.21]. The two main novel features of XCS 
in comparison to earlier Michigan-style LCSs are its 
accuracy-based fitness estimation and its niche-based 
application of the evolutionary component. The intro- 
duction of accuracy-based fitness essentially decoupled 
the classifier fitness estimate from the reward pre- 
diction, enforcing that XCS learned complete payoff 
landscapes rather than only estimates for those sub- 
spaces where high reward is encountered. In addition, 
Wilson related XCS directly to Q-Learning [47.21, 
22]. Much later, even a relation to Kalman filtering 
and general regression tasks was made mathemati- 
cally explicit [47.23,24]. The niche-based GA repro- 
duction combined with population-wide deletion en- 
abled a much more focused GA-based optimization 
of classifier structures as well as the generalization 
of classifier structures based on the sampling distri- 
bution [47.25]. In consequence, XCS is an LCS that 
is designed to evolve not only the best solution to 
a problem, but it evolves all alternative solutions with 
associated Q-value estimations and variance estima- 
tions of the respective Q-value estimates. Due to its 
GA design and fitness definition, XCS strives to ap- 
proximate the full Q-table of a problem with a maxi- 
mally accurate and maximally compact classifier-based 
representation. 

Despite its original strong relation to Q-learning and 
RL in general, XCS has also been applied successfully 
to classification problems and regression problems. In 
the former case, XCS identifies locally relevant features 
for the generation of maximally accurate classification 
estimates. In the latter case, XCS optimizes the distri- 
bution and structure of local, typically linear estimators 
for a maximally accurate approximation of the func- 
tion surface. Thus, despite its original strong relation 
to RL, XCS is a much more generally applicable learn- 
ing system that can solve single-step classification or 
regression problems as well as multi-step RL prob- 
lems, which are typically defined as Markov decision 
processes. 


47.2.1 System Overview 


XCS evolves one population of classifiers. Classifier 
structures are optimized by means of a steady-state GA. 
A classifier consists of a condition part C, an action 
part A, reward prediction r, reward prediction error €, 
and fitness f estimates. While the condition and action 


structures are iteratively optimized by the steady-state 
GA, the estimates are adjusted using the Widrow—Hoff 
delta rule [47.26] based on an approximation of the Q- 
value signal. 

While condition and action parts can be generally 
represented in any way desired [47.25], in this overview 
we focus on binary problems and a ternary representa- 
tion of the condition part. Conventionally, the condition 
part C is coded by C € {0, 1, #}", where the # symbol 
matches zero and one. Condition C essentially speci- 
fies a hypercube within which the classifier matches and 
can be said to cover a certain volume of the complete 
problem space. Action part A € A defines an action or 
classification from a provided finite set of possible ac- 
tions A. Reward prediction r € R estimates the moving 
average of the received reward in the recent activations 
of the classifier. Reward prediction error € estimates 
the moving average of the absolute error of the reward 
prediction. Finally, fitness f € [0, 1] estimates the mov- 
ing average of the relative accuracy of the classifier 
compared to the competing classifiers in the activated 
match sets (or action sets). The larger the fitness esti- 
mate, the on average larger the accuracy of a classifier 
in comparison to all classifiers that encode the same 
action and whose condition parts define overlapping 
subspaces. 

Each classifier also maintains several additional pa- 
rameters. The action set size estimate as estimates the 
moving average of the action sets the classifier was 
part of. It is updated similarly to the reward predic- 
tion r. A time stamp ts specifies the last time the 
classifier was part of a GA competition. An expe- 
rience counter exp specifies the number of applied 
parameter updates. The numerosity num specifies the 
number of (micro-) classifiers, this macro-classifier 
actually represents — mainly for saving computation 
time. 

Learning usually starts with an empty population. 
The problem faced is sampled iteratively, encounter- 
ing particular problem instances s € S. The set of all 
matching classifiers in the classifier population [P] is 
termed the match set [M]. If some action in A is not 
represented in [M], a covering mechanism is applied. 
Covering creates classifiers that match s (inserting #- 
symbols in the new C with a probability Py at each 
position) and that specify the unrepresented actions. [M] 
essentially contains all the knowledge of XCS about the 
current problem instance. Given [M], XCS estimates the 
payoff for each possible action forming a prediction ar- 
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; (47.1) 


where classifier parameters are addressed using the dot 
notation. P(A) computes the fitness-averaged Q-value 
estimates for each action in the current state s. Thus, 
P(A) can be used to decide on the currently most 
promising action. 

Any action selection policy may be applied, such 
as choosing the action with the largest Q-value ex- 
pectation. Because XCS relies on exploring the com- 
plete problem spaces, however, it is important that 
all actions are applied sufficiently frequently. Al- 
ternatively, also the prediction error estimates may 
be considered for action selection — choosing, for 
example, that action with the highest fitness-aver- 
aged £ value with the aim of maximizing informa- 
tion gain (see also more elaborate techniques sur- 
veyed recently in the computational intelligence liter- 
ature [47.27]). 

After the choice of an action A, an action set 
[A] is formed, which contains all classifiers in [M] 
that specify the chosen action. Moreover, the cho- 
sen action is executed, feedback is received in the 
form of scalar reward ReIR, and the next prob- 
lem instance may be perceived. In conjunction with 
the maximum P(A) derived from the resulting match 
set, the [A] formed is updated according to the esti- 
mated Q-value signal, which is R+ y max4e a P(A). 
Moreover, the steady-state GA may be applied, repro- 
ducing two classifiers in [A], but choosing classifiers 
from [P] for deletion. In classification problems — of- 
ten also termed single-step problems — the Q-learning 
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update only considers the immediate reward R. Fig- 
ure 47.3 illustrates the iterative learning process applied 
in XCS. 


Rule Evaluation 
To evaluate the classifiers, it is crucial to update 
their parameter estimates and derive a relative fit- 
ness estimate. Parameter updates are applied itera- 
tively in respective action sets. Usually, the predic- 
tion error is updated before the prediction and the 
fitness. Other parameters may be updated in any 
order. 

In particular, the reward prediction error ¢ of each 
classifier in [A] is updated by 


e<e+B(|p—Rl|—e), (47.2) 
where p = R in classification problems and 
=R P(A 
p=R+ y max P(A) 


in multi-step reinforcement learning problems. Parame- 
ter £ € [0, 1] specifies a learning rate, which is typically 
set to values between 0.05 and 0.2. The higher the value 
of f is, the more the £ value depends on the most recent 
problem interactions. Next, the reward prediction r of 
each classifier in [A] is updated by 


r<—r+ f(p—-r). (47.3) 


Note that XCS essentially applies Q-learning updates, 
where Q-values are not approximated by a tabular entry 
but by a collection of rules expressed in the prediction 
array P(A) [47.21]. 


updates 
Re Yes) match set 
(A) Condition Action Reward) 


Fig. 47.3 The XCS classifier system 
learns iteratively online. With each 
iteration it forms a match set given 
the current problem instance. Next, 
it chooses an action or classification 
and applies it. After the perception of 
feedback, the classifiers in the corre- 
sponding action set [A] are updated 
and the steady-state GA is applied. 
After that, the next problem iteration 
proceeds 
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To update the fitness estimate of each classifier 
in [A], a current scaled relative accuracy x’ is deter- 
mined. 


1 if € < £0 
K= v P (47.4) 
a (2) otherwise 
F k - num 
k= (47.5) 


X cl.«-cl.num ` 
clE[A] 


k essentially measures the current inverse error of 
a classifier. £ọ specifies the targeted error below which 
a classifier is considered maximally accurate. «’ then 
determines the current relative accuracy with respect 
to all other classifiers in the current action set [A]. 
Thus, each classifier in [A] competes for a limited fit- 
ness resource, which is distributed relative to the current 
accuracy estimates x. Finally, the fitness estimate f is 
updated given the current x’ by 

f<ft+Bk'—f). (47.6) 
In effect, fitness reflects the moving average, set- 
relative accuracy of a classifier. As before, 8 controls 
the sensitivity of the fitness estimates to changes in the 
population. 

The action set size estimate as is updated similarly 
to the reward prediction R but with respect to the current 
action set size |[A]| 

as — as + B(|[A]| — as) , (47.7) 
resulting in an action set size adaptation to changes |[A]| 
in an order similar to the fitness changes. Parameters r, 
£, and as are updated using the moyenne adaptive mod- 
ifiée technique [47.28]. This technique sets parameter 
values directly to the average of the so far encountered 
cases until the resulting update is smaller than 6 (which 
is the case after 1/f updates). Finally, the experience 
counter exp is increased by one. If the GA is applied, 
the time stamps ts of all classifiers in [A] are set to the 
current iteration time t. 


Rule Evolution 
XCS applies a steady-state genetic algorithm (GA) for 
rule evolution. Given a current action set [A], the GA 
is invoked if the average time since the last GA ap- 
plication (stored in parameter ts) in [A] is larger than 
threshold 0g4. This mechanism is applied to ensure suf- 
ficient evaluation of classifiers, as well as to control 


unbalanced sampling. The higher the threshold Oga is, 
the slower evolution proceeds, but also the less prone 
XCS is to unbalanced problem sampling [47.29]. 

The steady-state GA first selects two parental clas- 
sifiers for reproduction in [A]. While this selection 
process was done by proportionate selection based on 
fitness in the original XCS, more recently it was shown 
that tournament selection can improve the robustness of 
the system highly significantly [47.30]. Tournament se- 
lection in XCS chooses the classifier with the highest 
fitness from a tournament of randomly chosen clas- 
sifiers from [A]. The tournament size is usually set 
relative to the current action set size |[A]| to t-|[A]]. Two 
classifiers are selected in two independent tournaments. 
The selected classifiers are reproduced generating the 
offspring. Crossover and mutation are applied to the 
offspring. The parents stay in the population. Muta- 
tion usually changes each condition and action symbol 
randomly with a certain probability u. Crossover ex- 
changes condition and action symbols. Often, simple 
uniform crossover is applied (exchanging each symbol 
with a probability of 0.5). However, also more sophisti- 
cated estimation of distribution (EDAs) algorithms have 
been applied for more effective building block process- 
ing [47.31]. 

The offspring parameters are initialized by setting 
prediction R, £, f, and as to the parental values. Fitness 
f is often decreased to 10% of the parental fitness. Ex- 
perience counter exp and numerosity num are set to one. 

The resulting offspring classifiers are finally added 
to the population. In this case, GA subsumption may be 
applied [47.32] to stress generalization. GA subsump- 
tion searches for another classifier in [A] that may sub- 
sume an offspring classifier. This classifier must have 
a more general condition than the offspring classifier, 
its error estimate must be below £ọ, and its experience 
counter must be sufficiently high (exp > Osup). If such 
a classifier is found, the offspring is subsumed, increas- 
ing the numerosity of the more general classifier by one 
and discarding the offspring. 

The population of classifiers [P] is maximally of fi- 
nite size N. When this size is exceeded after offspring 
insertion, classifiers are deleted from [P]. Fitness pro- 
portionate selection is applied depending on the action 
set size estimates as. Note that tournament selection is 
not suitable in this case because a balance in the action 
set sizes is most desirable. The likelihood of deletion 
of a classifier is further increased by a factor f/f if this 
classifier is experienced exp > Qe, and additionally if 
its fitness f is below a fraction 5 of the average fitness f 
in the population. 


967 


Z'Lh |3 Wed 


968 PartE | Evolutionary Computation 


e724 |3 Hed 


47.2.2 When and How XCS Works 


From the description above it may seem hard to un- 
derstand why XCS learns successfully. This section 
provides intuition about when and how XCS works and 
points to relevant literature that quantifies the sketched- 
out intuition. 

The two interacting learning components, which are 
gradient-based rule evaluation and evolutionary-based 
rule evolution, are strongly interactive. From an evo- 
lutionary point of view, several evolutionary pressures 
yield particular learning biases. Since reproduction is 
designed to maximize fitness, XCS strives to develop 
maximally accurate classifiers applying a fitness pres- 
sure [47.33]. Meanwhile, rules are selected in [A] for 
reproduction but they are selected in [P] for deletion. 
Since the classifier conditions in [A] will on average 
cover a larger subspace, i.e., they have a larger vol- 
ume than the average condition volumes of classifiers 
in [P], more general classifiers will be reproduced on 
average (when ignoring the fitness pressure for the mo- 
ment), yielding a sampling-dependent generalization 
pressure [47.33]. In consequence, it has been put for- 
ward that XCS strives to evolve a complete problem 
solution that is represented by maximally general clas- 
sifiers that are meanwhile maximally accurate (error 
below the threshold £o). The resulting problem solution 
representation was previously termed the optimal solu- 
tion representation [O] [47.34]. 

While these evolutionary pressures generally de- 
scribe how the GA in XCS works, successful rule 
evolution still relies on sufficiently accurate fitness sig- 
nals. Thus, rule evaluation needs to have enough time to 
estimate rule fitness before expected rule deletion. This 
leads to a covering bound, which quantifies the need 
for a sufficiently large population size given a particu- 
lar initial condition volume. Moreover, each particular 
problem can be assumed to have a certain complexity 
in terms of subspace sizes that need to be separated for 
learning to take place, that is, for decreasing the error 
below the average deviation of the payoff signal to per- 
ceive an initial fitness signal towards higher accuracy. In 
consequence, the subspace size requires the generation 
of classifiers with condition volumes of maximally that 
size, consequently yielding a schema bound on the pop- 
ulation size to be able to cover the full problem space 
with such condition volumes. Finally, better classifiers 
with a certain condition volume need to be able to grow, 
that is, have reproductive opportunities before deletion 
can be expected, consequently yielding a reproductive 
opportunity bound. 


Together these bounds give estimates on the neces- 
sary initial condition volumes and the resulting max- 
imal population size necessary to cover a problem 
space. For example, given the need for an initial 
classifier volume of 0.01 of the encountered prob- 
lem space, the population size N should be set to 
about 10/0.01 = 1000 to assure proper rule evolu- 
tion. Given that these factors are satisfied, better 
classifiers are assured to be identified and to grow 
in the population with high probability. For binary 
and for real-valued problem domains, these consid- 
erations have been quantified, showing that XCS is 
an approximate polynomial-time learning algorithm in 
problem domains with bounded complexity [47.25, 
35]. 

The considerations above ensure the theoretic 
growth of better classifiers. However, the evolutionary 
component may still destroy relevant classifier struc- 
tures due to mutation and crossover. Thus, neither 
mutation nor crossover may be overly disruptive. In ex- 
treme cases, where highly unstructured subspaces may 
need to be identified and recombined, estimation of 
distribution algorithms can help to identify these sub- 
spaces [47.31, 36]. In most cases, though, a sufficiently 
low mutation rate and uniform crossover suffice to learn 
successfully. However, clearly mutation is mandatory 
to detect more accurate classifier structures over time. 
Thus, a good compromise is necessary to ensure that 
offspring is usually mutated but its structure is not 
fully destructed. In the binary domain, for example, 
the mutation probability is consequently often set to 
1/1, where l is the number of bits of a problem in- 
stance. This is a typical choice for the mutation strength 
used in genetic algorithms — essentially setting the ex- 
pected number of attributes that will be mutated to 
one. 


47.2.3 When and How to Apply XCS 


From the reflections above it becomes clear that XCS 
is designed to learn the target function of a problem by 
a population of locally accurate predictors, that is, clas- 
sifiers. This target function may be the Q-value function 
in RL problems, a correctness function in classification 
problems, or also any other type of function. XCS is 
best suited to be applied in problem domains that can 
be partitioned into subspaces within which simple pre- 
dictions yield accurate values. Moreover, XCS is even 
better suited to be applied to problems where regular- 
ities in the target function can be well-represented in 
classifier conditions, that is, subspaces in which the 
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function values are approximately equal should be com- 
pactly representable with few classifiers. Overall, XCS 
thus strives to develop distributed problem solutions in 
the form of a set of locally partially overlapping classi- 
fier structures, which cover the whole sampled problem 
space in a generalized way. 

As long as a condition representation can be cho- 
sen that identifies expectable regularities in a data set 
or also in a reinforcement learning problem well, XCS 
is a good candidate to optimize these local condition 
structures iteratively online. However, also in offline, 
data mining-based classification problems XCS was 
applied successfully and it was shown that the gen- 
eralization and accuracy performance XCS yield is 
comparable to other state-of-the art machine learning 
algorithms [47.25, 37], such as decision tree learners, 
instance-based classifiers, or support vector machines. 
Thus, XCS may be applied to multi-step Q-learning 
problems but also to single-step classification problems 
and general regression problems. Online generalization 
and optimal condition structuring for accurate predic- 
tions are the major features of XCS. From a regres- 
sion perspective, XCS is a non-parametric regression 
algorithm that strives to minimize the expected abso- 
lute function approximation error, or also the expected 
squared function approximation error as put forward 
elsewhere [47.24]. 

The two components, (a) gradient-based rule pre- 
diction approximation and evaluation and (b) evolution- 
ary rule structure evolution, are the key to successful 
XCS applications. With respect to rule structure evolu- 
tion, also the XCS system strongly depends on distance 
representations, which can be compared with general 
kernel representations as used in support vector ma- 
chines and elsewhere [47.38, 39]. As long as the repre- 
sented kernel-based condition structures can be mean- 
ingfully modified by genetic operators, evolution and 
thus also XCS can be applied. Meanwhile, also sensible 
value predictions need to be generated. Gradient-based 
methods work best to approximate these predictions, 
whether the prediction is a single value, is computed 
linearly or polynomially from input, or its structured 
otherwise depends on the problem at hand and the 
gradient-based approximation approach available. The 
more the prediction structure fits with the regularities in 
the target function, the faster and more robust learning 
can be expected. While such structural considerations 
can improve system performance, the successful appli- 
cations of XCS to various problem domains show that 
successful learning is usually not precluded by subopti- 
mal structural choices. 


47.2.4 Parameter Tuning in XCS 


While XCS does, indeed, specify many parameters, 
only few parameters are really crucial. All other param- 
eter values can typically be set to standard values. Here 
we discuss some rules of thumb for tuning the critical 
parameter settings and also provide standard settings. 
While the following recommendations have not been 
published elsewhere so far, they can be derived from 
observations and other recommendations found in the 
literature [47.25, 35, 40]. 

The two most important parameters are the maximal 
population size N and the strived-for error threshold £o. 
The larger the population size N is, the more capacity 
XCS has for learning and thus the more complex prob- 
lems XCS can learn. On the other hand, the larger N 
is, the slower XCS learns, because it reproduces and 
deletes only two classifiers in a typical learning itera- 
tion. Parameter £ọ specifies the targeted approximation 
error. In continuous function approximation problems, 
smaller ¢9 values demand finer problem space partition- 
ings and thus larger population sizes to cover the whole 
problem space and to enable reproductive opportunities 
(see above). Moreover, €9 can partially determine the 
fitness signal available to XCS: if ¢9 is chosen very 
small, (47.4) will yield values very close to zero for 
all highly inaccurate classifiers. Thus, overly small £o 
values should be avoided. In noisy problems, £o should 
thus also not be chosen much smaller than the standard 
deviation of the noise expected in the function value 
signal. 

Without much knowledge of a problem, one may 
start with a rather small population size N — say 1000 — 
and evaluate learning progress in this setting with a de- 
sired € 9. If the generated approximation error over time 
does not decrease, then £o should be set to about 1/10 
of the encountered error. Next, the population size N 
should be progressively increased, for example, to N = 
5000 or more. If still no error decrease is observed, fur- 
ther analysis is necessary. If the population is filled with 
classifiers but the match set sizes are very small (be- 
low 5), better classifiers probably do not receive enough 
reproductive opportunities. In this case, first the initial 
condition volume should be increased — for example, in 
the binary domain the probability Py would need to be 
increased (up to close to 1). If the match set still de- 
creases to sizes below 5, the problem is rather hard, 
requiring a further population size increase. On the 
other hand, if the match set sizes are very large (above 
100), then over-generalization takes place and XCS ap- 
parently does not pick up the fitness signal. In this case, 
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the initial condition volume should be decreased. If this 
does not help, then the GA application rate should be 
decreased to enforce a more accurate classifier evalua- 
tion before evolution applies. This can be accomplished 
by increasing the threshold Öga to say 100, 500, or 
even higher. An increase in Og, can also be crucial in 
problems where the problem domain is sampled highly 
unevenly, as is studied in detail elsewhere [47.29]. 
Several other parameter settings may be checked 
as well; the mutation rate should not be set overly 
high. As stated above, in the binary problem domain, 
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The XCS classifier system for real-valued inputs was 
introduced by Wilson in 1999, introducing Michigan- 
style LCSs to the real-valued problem domain [47.41, 
42]. It was further enhanced to approximate continu- 
ous real-valued function surfaces in 2001/2002 [47.43], 
yielding an iterative online learning non-parametric 
regression system. XCS for function approximation 
(XCSF) essentially enhances and modifies XCS by 
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Fig. 47.4 The XCSF classifier system learns linear value predic- 
tions and usually does not specify actions. The feedback is the 
actual function value, which is used to update the linear approxi- 
mators of the matching classifiers. The consequent error and fitness 
estimation updates are then considered in the evolutionary compo- 
nent for further optimization of the condition structures 


for example, a mutation rate of u = 1/1, where / de- 
notes the condition size, is a good rule of thumb. 
Crossover can mostly be applied without restrictions 
(x = 1.0) — especially when tournament selection for 
reproduction is chosen because in this case disruption 
is often prevented by choosing two equal classifiers. 
Other parameters can be safely set to somewhat stan- 
dardized values. A typical initial parameter setting for 
XCS is: N = 1000, ¢9 = 0.1, y = 1/41, x= 1, «= 1, 
É =0.2, v = 5, fsa = 25, y = 0.9, Oae = 20, 6 = 0.1, 
Osub = 20, Py = 0.5, and t = 0.4. 


changing its classifier condition structure to accept real- 
valued input. Moreover, the prediction part no longer 
predicts single values, but it computes its prediction 
from the input using linear approximation techniques, 
such as recursive least squares (RLS) [47.44]. Finally, 
the action part of the system is removed, applying the 
parts of the algorithm that were previously applied to 
[A] to the match set [M] in XCSF. Figure 47.4 illustrates 
the iterative learning process in XCSF. 

XCSF is thus a regression system that solves func- 
tion approximation problems by developing partially 
overlapping locally weighted projections in the form of 
a population of classifiers. In this form, XCS develops 
problem solutions that are similar to those developed 
by the locally-weighted projection regression algorithm 
(LWPR), which is rather well-known in the robotics 
community [47.45]. A comparative study has shown 
that XCSF can outperform LWPR in various problem 
domains [47.46], often yielding better problem space 
partitionings, as well as more accurate function value 
approximations with a comparable number of individ- 
ual locally linear approximators (i. e., classifiers). In 
XCSF, each classifier specifies in its condition the sub- 
space within which it is applicable. Thus, the condition 
may be compared with a receptive field determining the 
neural activity of the classifier. Moreover, each classi- 
fier specifies a linear approximator weighted within its 
subspace. In effect, the function approximation prob- 
lem is approximated by locally-weighted, overlapping 
linear approximations. While typically the weighting is 


Fig. 47.5a,b Screenshots of the XCSF program learning to approximate the crossed ridge function. Current performance 
values are plotted on the top left. The current approximation surface is approximated on the bottom left. On the right- 
hand side the classifier condition structures are plotted. For visualization purposes, the receptive field sizes are plotted 
smaller than their actual size. Darker classifier conditions have higher fitness values > 
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fitness-dependent, also a weighting based on the dis- 
tance to the center of the classifier condition can be 
applied. 

With this structure it has been shown that XCSF is 
very well suited for developing any type of kernel struc- 
ture [47.47]. In effect, various condition structures have 
been applied, including rectangular structures with and 
without rotation and with various forms of represen- 
tation [47.35, 48, 49]. Moreover, the linear approxima- 
tions may be enhanced to polynomial approximations 
and others [47.50]. Finally, it is also possible to cluster 
a contextual space with conditions, while approximat- 
ing linear (or other) predictions given totally different 
inputs. For example, the velocity kinematics of an arm 
can be predicted locally dependent on the angular arm 
constellation for redundancy resolution [47.51] (see fur- 
ther details below). Thus, XCSF is a highly flexible 
system with which other modifications in the condi- 
tion and prediction parts of the classifiers may still yield 
highly vital system applications. 

As an example, we applied XCSF to the crossed 
ridge function — a function that has been used as 
a benchmark in the neural computation and machine 


47.4 Data Mining 


Data mining is a rather large field of research that gen- 
erally addresses the challenge of extracting knowledge 
from data. In the LCS realm, the addressed data usually 
consists of a set of data instances, where each instance 
specifies a set of features and a corresponding class the 
data instance belongs to. LCSs then typically learn to 
mine the data by predicting the class likelihoods of un- 
seen data instances, as well as by identifying the most 
relevant features and feature interactions for classifi- 
cation. Particularly Pitt-style LCSs have proven to be 
highly valuable in data mining applications. However, 
also the XCS classifier system was successfully applied 
in this domain. 

The XCS system was also converted to an of- 
fline learning system; the sUpervised classifier system 
(UCS) algorithm [47.53] determines classifier predic- 
tions and resulting fitness values in a supervised man- 
ner. Meanwhile, the other learning aspects of UCS were 
derived from XCS. Both, XCS and UCS have shown 
effective if not even superior prediction accuracies in 
various data mining tasks — most of them taken from 
the UCI machine learning database repository [47.54]. 
When applying always the same standard setting and 
comparing with various other decision making algo- 


learning community for many years [47.45,52]. The 
function contains a mix of linear and non-linear sub- 
spaces. It is specified in two dimensions as follows 


fi, x2) = max fexp(—10x7) , exp(—50x3) ; 


1.25 exp(—S(xt +.23))} . (47.8) 
We ran XCSF with a maximum population size N = 
4000 and a target error €ọ = 0.005 on this function, 
applying a condensation mechanism late in the run. 
Figure 47.5 shows that XCSF is able to yield a good 
function approximation very early in the run. The 
evolving classifier structures learn to suitably partition 
the problem space into local subspaces. In consequence, 
a smooth overall approximation surface is generated. 
Note how the inverse exponential hill in the center is ap- 
proximated with nearly circular receptive fields, while 
the fields are selectively elongated in the x; or x2 di- 
mension due to the non-linearities caused by the ridges 
extending to the four sides. Towards the corners of the 
input space, the function flattens out so that the recep- 
tive fields become increasingly wider. 


rithms, such as support-vector machines, decision tree 
learning, naive Bayes classifiers, and others imple- 
mented in the WEKA machine learning tool [47.55], 
XCSF outperformed these competing techniques in 
many cases — often depending on the problem at 
hand [47.25]. A similar performance was achieved with 
UCS, outperforming XCS in some cases due to its more 
accurate classifier prediction estimates. XCS was also 
further enhanced to be able to deal with highly unbal- 
anced datasets in data mining domains by automatically 
adjusting the threshold that controls the frequency of 
GA applications Og, [47.29]. 

Pitt-style LCSs have been evaluated and applied 
to data mining problems even more extensively. The 
typical offline-learning scenario faced in data min- 
ing particularly suits the Pitt approach. However, 
also the fact that often very compact rule sets are 
strived for is advantageous for the Pitt approach. More 
than 10 years ago, the GALE architecture [47.56, 
57] yielded very good performance results on a col- 
lection of datasets from the UCI repository. GALE 
distributes its evolutionary process adding additional 
niching biases due to a grid-based spatial distribution 
of individuals. A comparative study of GALE, XCS, 
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and other machine learning algorithms can be found 
in [47.58]. 

The GAassist architecture [47.59,60] develops 
a priority list of classification rules. The advantage of 
GAassist is its developing compactness. A compara- 
tive analysis with XCS is provided in [47.61]. Later, 
the architecture was enhanced with ensemble learn- 
ing techniques [47.62] and memetic algorithms [47.63], 
proving high scalability and fast learning of very com- 
pact rule sets. 

Recently, many efficiency enhancement techniques 
from the GA literature (cf. [47.64]) and from other 
fields, including bioinformatics and systems biol- 
ogy [47.65] were applied to various LCSs. These tech- 
niques can help tremendously to improve the learning 
speed of LCSs, particularly in data mining realms. For 
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While the application of LCSs to data mining problems 
will certainly still produce many further impressive re- 
sults and promises to yield novel, deep insights into data 
structures, LCSs were originally designed as cognitive 
systems. Thus, in the following we will focus on LCSs 
as cognitive systems, their structures, and their poten- 
tial as neural cognitive models. As had been sketched 
out above, the XCS classifier system in particular was 
compared with Q-learning in RL. We start from this 
perspective and detail various successful applications of 
XCS in reinforcement learning problems. Next, ALCSs 
are surveyed. ALCSs learn generalized cognitive maps 
that are suitable to apply Sutton’s Dyna algorithm and 
value iteration techniques in general. A strong relation 
to factored RL techniques was pointed out recently in 
this respect [47.67]. Finally, robotics applications of 
LCSs are discussed and their potential is revealed. 


47.5.1 Reward-Based Learning with LCSs 


From the beginning [47.2] a big appeal to LCSs lay in 
the fact that they are designed for reward-based learn- 
ing. Once the original bucket-brigade algorithm was 
replaced by Q-learning techniques, a theory developed 
in the RL community also applied to LCSs to a certain 
extent. 

In XCS, in particular, it was shown that the sys- 
tem approximates the Q-value function by a collection 
of classifiers. The prediction array (47.1) calculation 
essentially approximates the current Q-value estimates 
for the current state in the environment. The fitness 


example, windowing techniques select subsets of data 
instances to speed up the classifier evaluation process. 
Fitness surrogates were used to make the fitness es- 
timation even cheaper [47.66]. Hybrid methods were 
already mentioned above; they combine traditional GA 
operators with informed ones, as is done when ap- 
plying memetic algorithms, which locally improve the 
developing classifier structures when applied to LCSs. 
In combination, such techniques can yield LCSs that 
not only produce highly accurate classification perfor- 
mance and good generalizations, but they also offer 
solution interpretability allowing mining of the knowl- 
edge developed in the LCS rules, and they generate 
these results without requiring much computational 
time — which is often comparable to the time needed 
by much simpler machine learning techniques. 


weighting based on the relative accuracies, which are 
normalized to one, assures that these Q-value esti- 
mates on average do not over or underestimate the 
expected Q-value. Moreover, since Q-learning is an 
off-policy learning technique, XCS is well-suited to 
be combined with it because also XCS benefits from 
exploring all possible state—action combinations in 
the long run — striving to develop an approxima- 
tion of the complete Q-value function in the problem 
space. 

As a result, XCS has been successfully applied to 
learning optimal paths in various maze environments. 
Starting from the Woods! and Woods2 environments 
proposed by Wilson [47.20,21], XCS’s performance 
and generalization capabilities have been investigated 
in various mazes [47.68]. For illustrative purposes such 
mazes are shown in Fig. 47.6. These maze environ- 
ments provide information about the surrounding grid 
cells, indicating whether they are either free or occu- 
pied by an obstacle or by food. Reaching the latter cell 
usually results in a reward trigger. Movements are typ- 
ically possible to the eight surrounding cells, yielding 
a rather large action space. The point of providing sen- 
sory state information rather that cell IDs or coordinates 
is that XCS is then able to exhibit its generalization 
capabilities. It essentially manages to generalize over 
the sensory state space ignoring irrelevant bits and gen- 
eralizing over the states with respect to state—action 
combinations that yield the same reward. 

Performance in many of these environments has 
yielded extreme generalization capabilities. For exam- 
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Fig. 47.6a,b Two highly typical maze environments used 
as benchmarks in the LCS literature for generalized rein- 
forcement learning. Woods! is a toroidal maze. In Maze6 
the food location is much harder to find. In both cases, 
the LCS-controlled agent perceives information about the 
eight neighboring cells encoding free, blocked, and food 
cells by means of two bits. The agent can execute move- 
ments to each of these cells. Movements to blocked cells 
yield no reward. A movement to the food cell triggers re- 
ward and a reset of the agent 


ple, in the Maze6 environment (Fig. 47.6) up to 90 
irrelevant bits were introduced, which changed ran- 
domly while interacting with the environment. While 
learning was slightly delayed and a larger population 
size was needed for successful learning, the optimal 
Q-value function was still extracted from iterative in- 
teractions [47.25]. Thus, XCS learned the optimal 
Q-value function in a problem space that contained 
more than 10% potential sensory state encodings. Also 
rather noisy action outcomes did not preclude learn- 
ing success. Later, it was shown that highly effective 
generalizations are even possible when each bit in the 
sensory encoding is relevant. In [47.36] the encoding 
for each bit was changed to a nested Boolean function, 
such as the parity function. XCS was still able to learn 
the optimal Q-value function, while Q-learning with- 
out generalization failed miserably due to the large state 
space. Thus, XCS is able to identify those aspects of the 
available sensory information that are relevant for accu- 
rate reward predictions. 

To successfully apply XCS in these scenarios, one 
crucial modification was necessary to stabilize the Q- 
values and thus the derived fitness values: the update of 
the classifier predictions had to be further modified by 
the error gradient factor, converting (47.3) to 
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The exact derivation of this equation can be found in 
the literature [47.69]. The gradient term essentially re- 


r<—r+B(p—-r) (47.9) 


sults in much more stable performance and successful 
learning and generalization in problems that require 
the establishment of long reward chains. It stabilizes 
the reward learning by down-scaling updates of inaccu- 
rate and unreliable classifiers. Consequently, these rules 
do not tend to over-estimate reward, and thus learning 
progress is stabilized. As a further consequence, XCS 
with gradient-based reward predictions updates was 
also successfully applied to blocks world problems, in 
which even more generalizations are possible [47.25]. 

The generalization capabilities of LCSs reached 
even as far as being successfully applied to control 
simple light following behavior on a real robot plat- 
form [47.70,71]. In this case, however, reward learn- 
ing was maximized and no complete Q-value function 
approximation developed. Nonetheless, this work con- 
stituted one of the first successful application in the 
robotics domain. 

Besides condition—action Michigan-style LCSs, 
such as the XCS, other Michigan-style LCS techniques 
have been applied for behavioral learning and also 
for learning cognitive maps. Such anticipatory learning 
classifier systems are surveyed in the following. 


47.5.2 Anticipatory Learning 
Classifier Systems 


Anticipatory learning classifier systems (ALCSs) are 
learning systems that learn a generalized predictive 
model or cognitive map [47.72] of the encountered en- 
vironment online. ALCSs are typical Michigan-style 
LCSs. However, in contrast to the usual classifier struc- 
ture, classifiers in ALCSs have a state prediction or an- 
ticipatory part that predicts the environmental changes 
in the environment caused when executing the speci- 
fied action in the specified context. As in XCS, ALCSs 
derive classifier fitness estimates from the accuracy of 
their predictions. However, the accuracy of the antici- 
patory state predictions are considered, rather than the 
accuracy of the reward prediction. Figure 47.7 illus- 
trates the typical structures and learning processes that 
apply in an ALCS architecture. 

Rick Riolo originally proposed an ALCS that gen- 
erated its cognitive map mediated by a message list 
storage system, which was also used in Holland’s orig- 
inal classifier system architecture [47.73]. However, 
this approach appeared to not be sufficiently elegant 
to enable any serious learning. Starting with Stolz- 
mann’s anticipatory classifier system [47.74], various 
ALCS architectures were developed. Particularly in 
maze problems, optimal behavior was achieved with 
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various ALCSs [47.75—79]. To prevent the development 
of overgeneral models for concurrent reward learning, 
the reward learning process was often decoupled, yield- 
ing a system that learns a cognitive map based on 
LCS principles, and, additionally, a state value esti- 
mation system. In combination, DYNA-based learning 
techniques [47.80] were applied to improve the state 
value estimations also offline. These techniques al- 
lowed the simulation of animal-like behavioral patterns, 
such as reward adaptations based on knowledge about 
the behavioral consequences in rats in a T-maze envi- 
ronment [47.81], as well as in controlled devaluation 
or satiation experiments [47.82]. In these studies it 
was also pointed out that ALCSs do not only allow 
DYNA-based reward learning updates, but also en- 
able the application of search and planning techniques 
for improving behavioral performance of the system. 
Even curiosity mechanisms have been added [47.83] to 
speed up the learning progress. Most approaches, how- 
ever, never generalized the list of states with associated 
rewards. 

The combination of the ACS2 system with the XCS 
system for state-value estimations, terming the resulting 
system XACS (x-anticipatory classifier system), may 
be the one with the most current potential for future 
research [47.84]. XACS essentially applies two LCS 
learning mechanisms: one being an ALCS architecture 
in the form of ACS2, which learns a cognitive model of 
the encountered environment, and the other one being 
the XCS system, which learns state-value estimations 
in this case. Figure 47.8 illustrates the components in 
the XACS architecture and their interactions. 

XACS has been shown to develop optimal behav- 
ior in blocks world problems in which other approaches 
failed to yield proper generalizations and resultingly 


optimal behavior control. Moreover, the reward-based 
generalization mechanism in XACS is directly based 
on the XCS classifier system, thus enabling the in- 
corporation of any tools and representations developed 
for XCS so far. The generalizations that were de- 
veloped confirmed the identification of task-relevant 
perceptual attributes. In the XCS components, reward- 
distinguishing attributes were identified. In the ACS2 
component, on the other hand, state prediction-relevant 
components were detected. In consequence, general- 
ized detectors for prediction with respect to reward 
and state could be distinguished. The implementa- 
tion of other anticipatory mechanisms in XACS, such 
as task-dependent attentional mechanisms, further in- 
teractions of the learning components, and multiple 
behavioral modules for the representation of multiple 
motivations (or needs) [47.84] are still open issues in 
the LCS realm. Further research with ALCSs is ex- 
pected to yield highly promising, cognitive learning 
architectures. 


47.5.3 Controlling a Robot Arm with an LCS 


We end this section of behavioral learning with the 
XCSF system. Over the last decade or so it has be- 
come increasingly clear that XCS is extremely well 
suited to partition a contextual space for the generation 
of accurate predictions. Predictions, however, do not 
necessarily need to be reward predictions. Behavioral 
consequences serve just as well as a target for predic- 
tions. The forward kinematics mapping in the robotics 
domain [47.85] offers even another potential target for 
learning. 

Consequently, XCSF was modified to learn the 
forward velocity kinematics of a robotic arm in simu- 
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lation [47.86]. To do so, XCSF projects its condition angle at all [47.87]. Recent advancements in the explo- 
v parts into the joint angle space of the robotic arm. ration strategy, which can be self-induced by the XCSF 
5 However, its locally linear predictions receive as input controller during learning, have shown that XCSF is 
m small joint movements, that is, changes in joint space able to learn to control all seven degrees of freedom 
= and predict the consequent change in task space, that of a humanoid arm highly effectively — flexibly ad- 
z is, changes of the end-effector location. This mapping hering to different constraints while pursuing motions 


has the great advantage that it is locally linear so that 
given a current joint angle constellation of the arm not 
only location changes of joint angle movements can 
be predicted but also directional motion of the end- 
effector can be invoked by inverting the locally linear 
forward velocity mappings. Seeing that those are lin- 
ear, the inversion can be rather easily done using linear 
algebra techniques. Given a redundant arm system — 
one that has more degrees of freedom (i. e., joint an- 
gles to manipulate) than actual locations to move to — 
it is possible to add additional constraints to the arm 
motion. For example, the arm can be driven to main- 
tain a relaxed arm posture while pursuing a certain goal 
or it may be forced to prevent moving a certain joint 


to certain goal locations. Moreover, mappings could 
be learned in different reference frame representations. 
For example, end-effector locations were either repre- 
sented in a Cartesian coordinate system or in a distance 
plus angles encoding. XCSF learned different classi- 
fier structures due to the differences in the linearities 
encountered. Nonetheless, XCSF yielded equally good 
arm control in both cases [47.51]. Figure 47.9 illustrates 
the XCSF setup for arm control. 

These results confirmed that XCSF may very well 
be further developed into a cognitive system architec- 
ture for behavioral control. While this type of architec- 
ture was probably not the one envisioned by Holland 
originally, it may still prove highly valuable. Various 
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XCSF for control 


Fig. 47.9 In the published robot arm 
control applications, XCSF clusters 
the contextual configuration state of 
the arm and learns linear approxi- 
mations of the average Jacobian in 
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neuroscientific evidence points out that similar forward- 
inverse predictive-control structures may be found in 
the cerebellum [47.88, 89]. Only more detailed knowl- 
edge on cortical and cerebellar structures may allow the 
direct comparison of the shapes and orientations of the 
receptive fields developed by the XCSF system and po- 
tential cortical and neural structures found in the brain. 


47.6 Conclusions 


While LCSs have been applied to a wide variety of 
problems, still there are many potential developments 
that have not been further evaluated. In the following, 
potential future research directions are summarized. 

At the moment nearly all LCSs are flat in that 
they develop one population of classifiers (or compet- 
ing sets of classifiers in the Pitt-style system). All of the 
classifiers, however, apply to the same problem granu- 
larity. Ever since the introduction of LCSs by Holland, 
the development of default hierarchies was envisioned. 
However, so far it was never convincingly or rigor- 
ously accomplished [47.19]. Default hierarchies refer 
to classifier systems in which general rules predict one 
thing but more specialized rules predict exceptions of 
the general rule. The emergent development of default 
hierarchies in LCSs remains an open challenge. 

With the most recent understanding of LCSs and 
the XCS system in particular, it seems that at least 


Experienced 
(angular motion angular angular 
configuration) direction motion change 


Current Current Resulting 
angular effector angular 
configuration 


the respective subspaces. In con- 
sequence, the system can generate 
both forward predictions of move- 
ment consequences as well as inverse 
control commands when directional 
movements of the arm are desired 


Experienced 
effector location 
change 


Resulting 
effector 
location 


While the brain may not implement actual evolutionary 
techniques literally, as XCSF does, it appears plausible 
that local competitions take place [47.90]. Moreover, 
it is known that neurons populate novel information 
sources once available — as XCSF does. Further re- 
search in neural computation with LCSs may prove 
highly valuable. 


the development of a hierarchically-structured LCS 
architecture is within our grasp. We expect such a hi- 
erarchical LCS to progressively refine its predictions 
in a hierarchical way. Default rules may gain a cer- 
tain level of accuracy, but more specialized rules may 
identify exceptions of the default prediction. Alterna- 
tively, the more specialized rules may also simply add 
further accuracy to the default predictions where and 
when necessary. In the latter case, a hierarchical pre- 
dictive system may develop that allows the progressive 
refinement of activated predictions until the finest pre- 
diction granularity in the hierarchical representation is 
reached. 

When developing hierarchical LCSs, also network 
LCSs seem to be of vital importance. For example, 
when developing classifier structures in spatial do- 
mains that are intricately structured, a network structure 
may provide additional hints on the connectivity of 
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the space. Especially the case where XCSF learns ve- 
locity kinematics, or generally, contextually-dependent 
sensory-motor contingencies — as sketched out in the 
section on controlling a robot arm with XCSF above — 
a network structure can give additional hints on how the 
sensorimotor space is structured and may be traversed. 
Networks of LCS classifiers may allow the application 
of lookahead planning and goal-oriented control — as 
was pursued in early work in [47.91]. 

A network structure may also enable the speed- 
up of the XCS matching process. For example, when 
a problem space is sampled by means of a random 
walk process, overlapping classifiers may be directly 
identified within a classifier structure instead of apply- 
ing a global matching process in each iteration. Also, 
when XCSF is used for goal-directed control — as men- 
tioned above with respect to velocity kinematics — this 
may improve the efficiency of the system tremendously. 
Furthermore, given a hierarchically network structured 
LCS system matching may proceed from coarse-to- 
fine-grained levels. All these processes may speed up 
the matching, which is often considered a bottleneck in 
LCS research and has been improved by means of nu- 
merous approaches over the recent years [47.92, 93]. 

Besides these additions, also ALCSs may be pur- 
sued further, as sketched out above. From a cognitive 
modeling perspective, ALCSs essentially learn gener- 
alized schemata or production rules [47.94—96], which 
specify the expected state changes perceived after the 
execution of the specified action. Such rules may be ap- 
plied by the cognitive science community for learning, 
for example, ACT-R structures [47.97]. The lookahead 
planning capabilities, the sensorimotor generalization 
capabilities, as well as the abstraction capabilities of 
these systems still ask for further development. The re- 
cent point that ALCSs can be very effectively applied 


47.7 Books and Source Code 


Further information about learning classifier systems 
can be found in the biannually published IWLCS 
(International Workshop on Learning Classifier Sys- 
tems) workshop proceedings and yearly workshops 
on the topic. A book on LCSs and the XCS clas- 
sifier system in particular covers XCS from a the- 
oretical and application-oriented point of view and 
also provides a detailed algorithmic description of 


to factored RL problems [47.98] should be further pur- 
sued. Also, the combination of ALCS-based cognitive 
map or concept learning and XCS-based reward learn- 
ing promises further research advancements. 

Even without the addition of hierarchies, network 
structures, or anticipations, however, LCSs can be suc- 
cessfully applied to various domains including rein- 
forcement learning problems, classification and data 
mining problems, and regression problems. XCS, in 
particular, learns iteratively online, striving for the 
development of a compact, maximally general, and 
maximally accurate problem solution. Pitt-style sys- 
tems typically learn offline and are thus most promising 
in large-scale data mining tasks in which rather small 
compact sets of rules are searched for. Seeing that the 
learning mechanisms of LCSs are highly flexible, it 
is possible to substitute the condition of a classifier 
with any other form or condition structure, as long 
as this structure can be mutated and recombined in 
a way that small structural changes also yield small 
changes in the defined subspace within which the con- 
dition matches. Similarly, the prediction structure can 
be replaced with any other prediction structure that can 
be quickly and accurately adapted by suitable learning 
techniques. Thus, the available LCS techniques — such 
as GALE and GAassist on the Pitt side and XCS, XCSF, 
or XACS on the Michigan side — can be further ex- 
ploited and combined with novel structures and forms 
of representations. Learning promises to be robust due 
the combination of a flexible evolutionary component, 
which searches for optimal rule structures, and the 
gradient-based fitness estimation, which quickly yields 
useful prediction and fitness estimations. It seems only 
a matter of time until LCSs gain even more recognition 
and be successfully applied to even more diverse prob- 
lem domains and challenging research tasks. 


the system [47.25]. A more theoretical coverage of 
the approximation approach in XCS can be found 
in [47.23]. Several books also give further details on 
theoretical considerations [47.23,99] as well as on 
successful applications of LCSs [47.100, 101]. The 
source code can be found online, for example, for 
XCS in C++ [47.102] as well as for XCSF in 
Java [47.103]. 
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48. Indicator-Based Selection 


Lothar Thiele 


The goal of multiobjective evolutionary optimiza- 
tion is to determine a set of solutions that satisfies 
certain optimality properties. Recently, there is 
a growing number of very competitive search al- 
gorithms that are based on an explicit formulation 
of the optimization goal as a set property, i.e., 
they build on the concept of set indicators. These 
indicators are used to guide the selection pro- 
cess which is usually denoted as indicator-based 
selection. This major breakthrough leads to sev- 
eral advantages in terms of analysis and algorithm 
design: Algorithms are conceptually simpler and 
more robust as they are largely based on a sin- 
gle indicator; certain convergence properties can 
be proven; the optimization criterion is made ex- 
plicit; by changing the set indicator, it is possible 
to explicitly consider preferences of a user. The 
chapter introduces step-by-step the concept of 


48.1 Motivation 


Variation and selection are the main ingredients of 
evolutionary optimization algorithms. Despite of many 
variations that have been developed in the past, their 
basic iterative structure can simply be described by the 
following three steps: 


1. From the current set of solutions (parent set), a sub- 
set is determined (mating pool) by mating selection. 

2. From the solutions in the mating pool new solutions 
are generated (offspring set) through variation oper- 
ators such as mutation and recombination. 

3. Environmental selection determines the new parent 
set as a subset of the joined parent set and offspring 
set. 


As can be seen in the above template, selection 
denotes the process of forming a subset of a set of 
solutions. Mating selection determines the set of can- 
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set indicators and their use in indicator-based 
selection. 


didate solutions that will be further explored by con- 
structing new solutions, i.e., the offspring set. To this 
end, promising solutions need to be selected whose 
offsprings are expected to advance the optimization 
process most. In contrast, environmental selection com- 
bines parent and offspring sets toward the new parent 
set and thereby, it reduces the number of solutions that 
are considered in the next iteration. Loosely speaking, 
mating selection is involved in the exploration phase 
of the evolutionary optimization whereas environmen- 
tal selection is central to the decision phase. 

Set indicators map sets of solutions to scalar values. 
They characterize to which degree the set satisfies some 
desirable property. Therefore, they can be used to guide 
the selection process which is usually denoted as indica- 
tor-based selection. The following chapter concentrates 
on environmental selection as indicator-based methods 
have been applied in this context mainly. 
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The goal of multiobjective evolutionary optimiza- 
tion is to determine a set of solutions that satisfies 
certain optimality properties. The corresponding notion 
of optimality is partially defined by solution preference, 
i.e., when we consider one single solution to be prefer- 
able to another single solution. One common choice of 
such a solution-oriented preference relation is Pareto— 
dominance. But there is still a large degree of freedom 
left in defining what an optimal set of solutions is, as 
there may be many more Pareto-optimal solutions than 
can be reasonably processed, stored, or presented to the 
user. Therefore, we need additional information that de- 
scribes the preference of the user, i.e., what subset of 
Pareto-optimal solutions he/she is interested in. For ex- 
ample, the user may be interested in a diverse set of 
solutions or in solutions which cover a certain subspace 
of interest. Set indicators can now define such a prefer- 
ence relation and influence: 


a) The result of the population-based optimization and 

b) The characteristics of the sets of solutions during an 
optimization run and 

c) The search efficiency. 


Traditionally, multiobjective evolutionary optimiza- 
tion algorithms such as NSGA-II (nondominated sort- 
ing genetic algorithm ID [48.1] or SPEA2 (strength 


48.2 Basic Concepts 


Before discussing the role of indicators, selection, and 
archiving in multiobjective evolutionary algorithms, we 
will define the notation used in the forthcoming sec- 
tions. In particular, we will define the underlying class 
of multiobjective optimization problems. 


48.2.1 Notation 


We will consider the minimization of a vector-val- 
ued objective function f = (f\,...,f,) : X —> R which 
maps each point in the decision space X to an n- 
dimensional vector. The decision space X denotes the 
feasible set of alternatives for the optimization problem 
and n denotes the dimension of the minimization prob- 
lem, i.e., the number of objectives. For simplicity of 
notation, we suppose in the following that X is finite. 
Often, we call an element of the decision space a so- 
lution and the corresponding objective value z = f(x) 
is denoted as objective vector. The image of the deci- 
sion space X under the objective function f is called 


Pareto evolutionary algorithm 2) [48.2] start from so- 
lution preference, i.e., the Pareto-dominance relation 
between the solutions, and then attempt to consider 
set preferences such as diversity using heuristics. As 
a downside of this approach, deterioration and cyclic 
behavior have been reported [48.3], formal conver- 
gence results can not be obtained and unsatisfiable 
optimization results have been shown for high-dimen- 
sional objective spaces [48.4]. 

On the other hand, indicator-based selection treats 
multiobjective evolutionary optimization as a set-opti- 
mization with a single optimization criterion, namely 
the set—preference relation or its defining set indica- 
tor. In other words, instead focussing on individual 
solutions with multiple criteria, set-based methods con- 
sider sets of solutions as the object of optimization 
and a single set criterion, i.e., the set indicator. This 
is a radical change from the traditional approach. The 
set indicator directly represents the user preference 
and the optimization algorithm determines a set of 
solutions that optimizes this single set indicator. The 
advantages of this approach are obvious: Formal and 
unambiguous definition of the optimization goal, possi- 
bility to show strong convergence results, and a clear 
approach to consider user preferences in the search 
method. 


objective space Z C R” with Z = {f(a) |x € X}, i.e., it 
contains all objective vectors corresponding to solu- 
tions in X. 

In the above formulation, it is not yet clear what 
we understand as the minimization of a vector-valued 
function. In this chapter, we follow the usual concept of 
Pareto dominance which defines an order relation be- 
tween all solutions based on a preference relation, i. e., 
it defines when we call a solution better than another 
one. 


Definition 48.1 

A solution a € X weakly Pareto dominates a solution 
beX, denoted as a < b, if f(a) <f,(b) for all 1 <i 
<n. Solution a strongly Pareto dominates b, denoted 
asa < bif (axb)A(bfKa). 


We can rewrite the strong domination criterion as 
(a< b) & (ax b)A (f(a) Ff (d)). We also say that so- 
lution a is better than or weakly preferable to b if 
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a < b or a < b, respectively. Note that the weak Pareto- 
dominance relation is suitable for optimization as it 
defines a preorder on the set of solutions X. A preorder 
< on a given set X is reflexive and transitive: a < a and 
(a xb) A(b xc) > (a< c) hold for all a,b,c € X. 

In terms of optimization, we say that a solution 
a € X is Pareto-optimal if there is no better solution in 
X, i.e., if b <a for some b € X then a < b. The set of 
all Pareto-optimal solutions is denoted as the Pareto- 
optimal set and its image in the objective space as the 
Pareto-optimal front. 

Ideally, a multiobjective optimizer determines the 
Pareto-optimal set for a given objective function f and 
the corresponding decision space X. Traditionally, evo- 
lutionary multiobjective algorithms attempt to solve 
this problem by generating a suitable approximation of 
the Pareto-optimal set. To this end, they maintain and 
improve sets of solutions, denoted as populations. In 
this context, the following questions arise: 


© Ifthe set of Pareto-optimal solutions is too large to 
be determined efficiently, how do we select those 
which will be the result of the optimization process? 

@ How do we valuate a set of solutions, i.e., an ap- 
proximation of the set of Pareto-optimal solutions, 
in terms of its degree of optimality in order to guide 
the optimization process? 


The chapter describes how indicator-based selection 
can be used to answer the above questions. There- 
fore, it touches two core issues for multiobjective 
evolutionary algorithms: (a) how to formalize the op- 
timization goal in the sense of specifying what type of 
set is sought; (b) how to efficiently determine a suit- 
able subset to achieve the formalized optimization 
goal. 

The following section introduces the concept of set 
indicators that can be used to valuate a set of solutions, 
i. e., associate a quality indicator which describes its de- 
gree of optimality. 


48.2.2 Set Indicators 


Preference relations between sets of solutions are the 
basis of set-based multiobjective optimization. They 
provide the information on the basis of which the search 
is carried out, i.e., for any two Pareto set approxima- 
tions, they say whether one set is considered to be equal, 
better, or worse. 

A set indicator can now be used to define such 
a preference relation and therefore to indicate whether 
one set of solutions is preferable to another one. In addi- 


tion, it also contains quantitative information about the 
degree of preference. 

Depending on the particular definition of the prefer- 
ence relation, a set can be considered to be better than 
another one or even the other way round. With different 
definitions of such a preference relation, we can expect 
that the optimal result of a search process will be differ- 
ent as well. Therefore, the definition of an indicator is 
essential for formally defining the goal of the set-based 
optimization. In addition, it allows us to adjust the opti- 
mization goal according to the preferences of the user, 
i.e., to provide flexibility with respect to the subset of 
Pareto-optimal solutions searched for. 

But the set indicator and the resulting preference 
relation can not be chosen arbitrarily as they need to 
conform to the concept of Pareto dominance. Other- 
wise, the search process may end up with a set which 
is weakly preferable to all other sets but does not con- 
tain any Pareto-optimal solution. 

In order to derive the requirements for a well- 
behaved set indicator, let us first generalize the concept 
of Pareto dominance of solutions to Pareto dominance 
of sets. 


Definition 48.2 

A set of solutions A C X weakly Pareto dominates 
a set of solutions B C X, denoted as A = B, if (Yb € 
B: (Ja € A : a< b)). Set A strongly Pareto-dominates 
set B, denoted as A < B if 


(A 3B)^ (BZA). 


In other words, a set of solutions A weakly dom- 
inates a set of solutions B if every solution in B is 
weakly dominated by at least one solution in A. More- 
over, it can be shown that the set-based dominance 
relation defines a preorder, i. e., it is suited for optimiza- 
tion purposes. 

Now, let us define the concept of a set indicator 
and its induced preference relation. In the first part of 
this section, we restrict ourselves to unary indicators. 
A more detailed discussion on the various aspects of 
indicators is provided in [48.5]. 


Definition 48.3 

A unary indicator maps each set A C X of the decision 
space to a real number I(A) € R. Given an indicator, 
we can determine the corresponding preference relation 
= as 


A <; B:= (I(A) =1(B)). 
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In other words, the larger the set indicator of a set of 
solutions the better we consider the set. It can be shown 
that the preference relation induced by the indicator de- 
fines a total order on the set of solutions X. 

As discussed above, not all preference relations can 
be used inside search methods as they at least need 
to comply to the definition of Pareto dominance in 
Def. 48.2. To this end, the following definition describes 
the notion of preference refinement: 


Definition 48.4 
A preference relation <,er is denoted as a refinement 
of x if 


Ax<B> AX B. 


What we need to guarantee can be formulated as fol- 
lows: If a solution A C X is strictly better than a solu- 
tion B C X in the sense of Pareto dominance, i.e., A < 
B, then the preference relation used for optimization 
should say so as well, i. e., A <ref B, see also [48.5]. 
If we use this result for the unary indicator accord- 
ing to Def. 48.3, then we directly get the following 
condition for a compliant indicator, i. e., whose corre- 
sponding preference relation is a refinement of < 


A<B= (I(A)> I(B). (48.1) 


In other words, if a solution A is strictly better than 
a solution B, i.e., A < B, then the indicator should 
say so as well, i.e., (A) > 7(B). It has been shown 
in [48.5] that the Pareto-compliance guarantees that 
a set with the maximal indicator value is minimal with 
respect to the Pareto—dominance relation according to 
Def. 48.2. 

Indicators have been introduced to the area of mul- 
tiobjective evolutionary optimization first as a mean to 
compare different optimization runs [48.6—9]. The use 
of indicators to guide multiobjective search methods in 
general appeared in the year 2003, notably in [48.10] 
and later in [48.11—13]. In a more restricted setting, in- 
dicators have been used for archiving, i. e., maintaining 
a set of Pareto-approximate solutions [48.3, 14]. 

In several studies, the properties of set indicators 
have been investigated in terms of their compliance to 
the Pareto dominance [48.7, 15]. Whereas many well 
known and widely applied indicators do not fall into 
this class, there are various indicators that at least sat- 
isfy a weak refinement (A < B > A gref B), e.g., the 
unary R and R; indicators [48.16] and the multiplica- 
tive as well as the unary additive and multiplicative 


epsilon indicators [48.3, 15]. The latter two indicators 
are related to additive or multiplicative approxima- 
tion [48.17, 18]. 

Before discussing binary indicators, let us intro- 
duce an example of a set indicator that is compliant to 
Pareto dominance. The hypervolume indicator has been 
introduced to the field of multiobjective evolutionary 
optimization in [48.19] for the purpose of performance 
assessment. It can be defined as 


IHA, R) = J dz, (48.2) 


z€H(A,R) 


where H(A, R) denotes the objective space dominated 
by A and dominating R 


H(A,R) = 
{ze R”|3a €A:dr Ee R:f(a)<z<r}. 


In other words, we determine the volume covered by 
all points z € R” that are enclosed between the image 
of the solutions in objective space f(A) and the ref- 
erence set R, where enclosed is interpreted in terms 
of weak Pareto dominance. Due to its compliance to 
Pareto dominance it has been used in most of the indi- 
cator-based selection schemes to date. 

One of the major drawbacks of the hypervolume 
indicator is the associated computational overhead. 
Bringmann and Friedrich [48.20] have proven that 
the problem of computing the hypervolume is #P- 
complete, i. e., there exists no polynomial algorithm un- 
less NP = P. Several algorithms have been proposed in 
the past to determine the hypervolume indicator, start- 
ing from the hypervolume by slicing objective approach 
independently proposed by several authors (Knowles 
and Zitzler) with complexity O(N"—!) where N is the 
number of solutions in the population and n is the num- 
ber of objectives. Later on, improved version appeared 
with complexity O(N"~*logN) [48.21] and finally 
O(N log N + N”? log N) [48.22]. An approximation al- 
gorithm with proven bounds is presented in [48.23] 
which gives an €-approximation of the hypervolume 
with probability (1 — 8) in time O(log(1/8) nN/e?). 

Binary indicators 1(A,B) can be used to com- 
pare two sets A and B as described in the following 
Def. 48.5. 


Definition 48.5 
A binary indicator maps an ordered pair of sets A, B C 
X of the decision space to a real number /(A, B) € R. 
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Given a binary indicator, we can determine the corre- 
sponding preference relation <; as 


A xı B:= (I(A,B)>1(B,A)). 


In a similar way to (48.1), one can derive the condition 
for an indicator whose corresponding preference rela- 
tion is a refinement of =<, 


A<B => (I(A, 8B) > I(B, A)). 


Two popular examples of a binary indicators that 
have been successfully used in indicator-based selec- 
tion [48.11,24,25], archiving [48.3] and approximation 
schemes [48.18] are the additive and multiplicative ep- 
silon indicators. They can be defined as 


I} (A,B) = min max F$ (a,b) , 


bEB acA 


FE (a,b) = min (fib) -f:(a)) . (48.3) 


48.3 Selection Schemes 
48.3.1 Basic Search Algorithm 


Let us start the discussion of indicator-based selec- 
tion with a simple template of a multiobjective evolu- 
tionary algorithm (SPAM - set preference algorithm 
for multiobjective optimization [48.26]) as shown in 
Alg. 48.1. 


Algorithm 48.1 Simple SPAM 
1: generate initial set of solutions P of size u 
while termination criterion not fulfilled do 
generate A offspring solutions O € X 
S = select(P UO, m) 
if S <et P then P < S 
return P 


aw Ee Oy 


Obviously, the template is still very simplistic but 
it will help us to understand the integration of the 
concept of indicators and selection in multiobjective 
evolutionary algorithms. Line 3 in Alg. 48.1 refers to 
the variation of solutions that are in population P, i. e., 
starting with mating selection and then applying varia- 
tion operators such as mutation and recombination. This 
essential part of any evolutionary algorithm will not be 
discussed further here. Line 4 is denoted as environ- 


for the additive version and 
ok e : x 
Ie (A, B) = min max Fe (a,b), 


: =. fib) 
F% (a, b) = isin f(a) 


for the multiplicative one. Formally speaking, 
Ltn: B) (or IZ(A,B)) denotes the maximum 
amount one can to add to (or multiply to) every 
objective value f;(a) of every solution a € A such that 
the resulting set still weakly dominates B. 

Unfortunately, the above binary indicators do not 
induce a preorder as the resulting preference relation 
is not transitive in general [48.5]. This negative result 
needs to be considered when deciding to use it (and sim- 
ilar generalizations of unary indicators) in optimization 
algorithms. 

Next, indicator-based selection and its integra- 
tion into multiobjective search algorithms will be 
discussed. 


(48.4) 


mental selection and reduces the union of parent set P 
and offspring set O from size u +À to u again, i.e., 
SC PUO and |S| = u. Finally, line 5 is responsible 
for selecting either the old population P or the new one 
S depending on the chosen preference relation %ref. 

In the following, we will stepwise refine the selec- 
tion operator in line 4 and thereby, relate the above 
template to existing indicator-based selection schemes. 


48.3.2 Exhaustive Selection 


Let us first suppose that the selection operator in line 
4 is exhaustive in the following sense: If there exists 
a subset S C P UO with |S| = u that satisfies S <;e¢ P, 
then it will generate it with nonzero probability. Under 
this condition, one can proof an important convergence 
property of the algorithm that ensures that there is no 
deterioration behavior as reported for algorithms such 
as NSGA-II and SPEA2 [48.3]. The line of arguments 
is just sketched here as it closely follows the investiga- 
tions in [48.27]. 

In most general terms, the goal of the optimization 
is to generate as large as possible subset of the Pareto- 
optimal solutions. Therefore, what we at least require is 
that Alg. 48.1 generates such a set provided that it runs 
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long enough. Indeed, one can show that this is the case 
if: 


a) The offspring generation in line 3 is exhaustive, i. e., 
all solutions are generated with nonzero probability, 

b) The selection operator is exhaustive and 

c) Xref iS a refinement of the Pareto dominance. 


Let us suppose for simplicity of arguments, that 
there are more than u Pareto-optimal solutions in X. 
Moreover, suppose that the population at some point 
in time (still) contains a dominated solution. Then 
there is a nonzero-probability that in the set O of off- 
springs there is a Pareto-optimal solution not yet in 
P. Replacing the dominated solution with the addi- 
tional Pareto-optimal solution leads to a preferred set 
according tO %ref as it refines the Pareto dominance. 
Note, however, that the above convergence property 
does not mean that Alg. 48.1 determines an opti- 
mal set w.r.t. tO <;e, 1. e., that the resulting subset of 
Pareto-optimal solutions actually is minimal in terms 
of Xref- 


Exhaustive selection can usually not be imple- 
mented efficiently, as all possible subsets must be 
tested, i.e., (ore) possible preference relations. The 
following refinements of the basic algorithm lead to 
more efficient schemes. 


48.3.3 Steady State and Greedy Selection 


A first possibility has been proposed in the indica- 
tor-based selection and archiving schemes described 
in [48.12, 14]. The size of the offspring set is A = 1 and 
therefore, at most u + 1 preference relations need to be 
constructed in each iteration. In particular, the hyper- 
volume indicator I4 [48.9] (S measure) has been used 
to define the preference relation, i. e., ref: =<. In this 
case, the selection in line 4 just removes the solution 
that leads to the least loss in Jy. 

Still, the convergence to a Pareto-optimal subset can 
be guaranteed if the offspring generation is exhaustive. 
On the other hand, it cannot be guaranteed that the al- 
gorithm determines an optimal set w.r.t. <;ef, i. €., a set 
that is not strictly dominated by any other subset of size 
H W.t.t. X;ep. First counterexamples for various set indi- 
cators that show this property for A < jz and especially 
for A = 1 appeared in [48.5]. A more indepth discus- 
sion on this issue for the hypervolume indicator can be 
found in [48.28]. 

A second approach allows for general sizes A of the 
offspring population © and employs a simple greedy 


strategy, i.e., solutions are removed one-by-one from 
the set PU O until a set with size jz is obtained. The 
following template in Alg. 48.2 sketches the approach. 


Algorithm 48.2 Greedy Selection 
1: procedure select(P U O, u) 
2: S<PUO 
while |S| > u do 
for alla € S do 
5a < loss(S, a) 
choose p € S with 6, = minges ĝa 
S< S\ {p} 
return S 


90) at ON Ch 


If A = 1, then this template covers the steady-state 
selection scheme in [48.12, 14]. The function loss(S, a) 
quantifies the loss in set quality, if solution a is removed 
from it. In line 6, the solution with the smallest loss is 
chosen and removed from the population in line 7. If 
the preference relation is based on a unary indicator as 
shown in Def. 48.3, then the loss function can simply 
be determined as 


loss(S,a) = 1(S) —I(S \ {a}) . 


For the more general case of preference relations that 
are not based on indicators, see also [48.5]. Note that 
convergence to the set with the maximal indicator value 
can now not be guaranteed anymore as the greedy selec- 
tion is a heuristic. For the hypervolume indicator this is 
shown in [48.29]. On the other hand, we still can guar- 
antee that SPAM with greedy selection generates an as 
large as possible subset of the Pareto-optimal solutions 
if the offspring generation in line 3 is exhaustive, i. e. all 
solutions are generated with nonzero probability, and 
Xref is a refinement of the Pareto dominance. 

As shown in Alg. 48.2, we do not need to deter- 
mine the value of the set indicator but only the least 
contributor, i.e. the solutions that leads to the minimal 
loss. Depending on the choice of the indicator, this in- 
formation may be easier to compute than evaluating the 
indicator for A + u different sets and comparing the val- 
ues. In the context of the hypervolume indicator, a more 
detailed discussion on this issue is provided in [48.30]. 

As has been mentioned already, the indicator-based 
selection schemes described here have been applied 
to the problem of archiving as well. Archiving algo- 
rithms attempt to maintain a bounded set of solutions 
given a sequence of solutions [48.3, 14]. In analogy to 
the template in Alg. 48.1, the sequence of solutions 
would be generated by the offspring generation and 
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the selection process would determine the new popu- 
lation P. For archiving purposes, one is usually only 
interested in maintaining a subset of all nondominated 
solutions received so far. Dominated solutions in P are 
usually not considered in the underlying set preference 
relations. 

The Pareto—dominance relation on sets is by def- 
inition insensitive to dominated solutions. The same 
holds for set preference relations based on popular 
indicators such as the hypervolume indicator which re- 
flects the volume dominated by a set of solutions. On 
the other hand, preferences among dominated solutions 
may be of importance to guide the search. In partic- 
ular, all solutions in P (dominated and nondominated 
ones) are candidates for the mating selection and there- 
fore, may be chosen for variation in the generation of 
offspring solutions. Therefore, useful set preference re- 
lations that are refinements of Pareto dominance need 
to be constructed that allow us to consider preferences 
on dominated solutions as well. 


48.3.4 Hierarchical Set Preferences 


Most of the hierarchical indicator-based selection 
schemes that have been described so far combine set 
indicators with constructing a sequence of subsets of 
solutions [48.5, 12,31]. For example, nondominated 
sorting [48.32, 33] starts with the whole set as the first 
element of the sequence, and then removes the nondom- 
inated solutions from the previous subset to construct 
the next subset in the sequence. The following Alg. 48.3 
provides a template for a hierarchical selection scheme 
involving nondominated sorting. Many variants of the 
above basic scheme could be thought of such as other 
subset-constructions like dominance ranking [48.34]. 


Algorithm 48.3 Hierarchical Selection 
1: procedure select PU O, u 


2: S<PUO 

3: Sø 

4 lew 

5: repeat 

6: S < S US” 

T: S” < {ae S| Abe S:b<a} 
8 S< S\ S” 


9: until |S’U S”|> pu 
10: while |S’ U S”| > u do 


11: foralla € S” do 

12: ôa < loss(S”, a) 

13: choose p € S” with §, = minges” ôa 
14: S ag” \ {p} 


15: return S'U S” 


The set S in Alg. 48.3 before the execution of the 
iteration in lines 5—9 contains the last subset in the 
sequence of dominating subsets. When leaving the iter- 
ation with line 9, we have PU O = SU S'U S” where S” 
is the last dominating set that has been peeled off. The 
iteration in lines 10—14 removes solutions from $” one- 
by-one as in Alg. 48.2 until the set S’ U S” contains u 
solutions. In [48.5], a detailed analysis of the optimiza- 
tion and convergence properties of such constructions 
is provided. In particular, conditions are derived under 
which the corresponding preference relation is a refine- 
ment of Pareto dominance, 1. e., can safely be used in an 
indicator-based selection. 


48.3.5 Using Binary Indicators 


The previous discussion concentrated on the use of 
unary indicators in multiobjective evolutionary algo- 
rithms. On the other hand, one of the first indicator- 
based selection schemes used for optimization was 
based on the concept of binary indicators following 
Def. 48.5 [48.11]. In particular, the use of a binary vari- 
ant of the hypervolume indicator, (48.2), as well as the 
use of the binary additive epsilon indicator, (48.3), have 
been described. 

The structure of IBEA (indicator-based evolution- 
ary algorithm) follows directly the greedy selection 
scheme as described in Alg. 48.2. In the basic scheme 
of IBEA, i.e., without parameter adaptation, the loss- 
function is computed as follows, 


loss(S,a j=- SO he (48.5) 
beS\{a} 


Let us interpret the above loss function by means of 
the binary additive epsilon indicator, i.e., (A,B) := 
It (A,B). The solution with the smallest loss- 
function will be selected for removal from the set. 
This actually is the solution with the largest sum 
oes\ fa} e!tb}.ta})/K | Tf considering a large scaling 
factor x, the sum of exponentials actually acts similar 
to sorting the indicators and just considers the largest 
one. If « is smaller, then not only the largest indicator 
is taken into account but also smaller ones. As a result, 
the sum is dominated by the solution b which leads to 
the largest value of [({b}, {a}). 

Remember that /({b}, {a}) = miny<j<, (f(a) — 
fi(b)) denotes the maximal amount one can add to 
every objective value f;(b) of solution b such that it 
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still weakly dominates a. As a result for large values 
of k we can summarize the selection as follows: For 
each solution a, we determine the solution b to which 
we can add the largest amount such that it still weakly 
dominates a. The solution a is removed for which 
this amount is largest. In some sense, the first step 
determines the closest solution (or strongest dominator) 
to a in the epsilon indicator and the second step then 
removes the solution which has the closest neighbor (or 
strongest dominator). If « is smaller, also the closeness 


48.4 Preference-Based Selection 


Recently, there has been increasing interest in con- 
structing evolutionary optimization methods that allow 
to consider search preferences of the user. In other 
words, the resulting set of solutions should not contain 
an arbitrary subset of Pareto-optimal solutions but one 
that satisfies secondary criteria. For example, it may 
be desirable to preferably determine solutions that are 
close to some reference point, that are along a direction 
in the objective space, or have some other predefined 
distribution. 

The choice of the right subset of solutions as a re- 
sult of an evolutionary multiobjective optimization was 
of major concern since the beginning. In particular, it 
was a major goal to design algorithms that lead to well- 
distributed solutions that are close to the Pareto-opti- 
mal front. Various heuristics have been implemented 
in standard algorithms like SPEA2 [48.2] and NSGA- 
II [48.1] to achieve such an implicitly defined objective. 

The concept of indicator-based selection changed 
this approach fundamentally. It not only allows to for- 
malize the objective of population-based multiobjective 
optimization in general but also to design algorithms 
that optimize a set of solutions toward it. As a result of 
this achievement, the focus of research moved toward 
the following questions: 


@ What kind of user preference is useful in the context 
of preference-based search? 

@ How can these preferences be mathematically for- 
mulated and incorporated in set indicators? 

@ How canapreference-based algorithm be integrated 
into an interactive optimization approach that in- 
volves the decision maker? 


Including preference information in multiobjective 
evolutionary methods has been investigated since the 
beginning [48.37] for a survey. In a very early attempt, 


(or domination) of other solutions b is taken into 
account. 

Recently, an indicator-based algorithm has been 
developed [48.25], which uses a similar principle. It 
uses the binary epsilon indicator as defined in (48.3) 
and (48.4), but instead of comparing successive popu- 
lations as in IBEA, it uses a possibly growing archive of 
the best Pareto-approximations found so far as the ref- 
erence set. The solution to be removed according to the 
template in Alg. 48.2 is determined by sorting. 


Fonseca and Fleming [48.34] suggested to assign ranks 
to the members of a population. Much later it was pro- 
posed in [48.38] to include preferences through the use 
of reference points, guided dominance schemes and 
a biased crowding scheme. Preference-based multiob- 
jective evolutionary methods can be used within a hy- 
brid approach that combine ideas from both, evolution- 
ary and interactive multiobjective optimization [48.24]: 
In an iterative approach, several consecutive runs of the 
evolutionary algorithm are performed. The user is asked 
to give preference information in terms of his refer- 
ence point consisting of desirable aspiration levels for 
objective functions. This information is used in a prefer- 
ence-based evolutionary algorithm that generates a new 
population by combining the fitness function and a so- 
called achievement scalarizing function containing the 
reference point. 

In the meantime, many other possibilities to for- 
malize user preferences have been investigated [48.35, 
39], for example weight functions in the objective space 
which change the desired (nonuniform) density of so- 
lutions, stressing objectives, guiding the search toward 
preference points, transforming objective functions, 
weighted Tchebycheff approaches using ideal points, 
epsilon-constraint methods, and desirability functions, 
just to name a few. 

In the following, two examples for considering 
preference information in selection schemes will be 
described in some more detail. In [48.24], the greedy 
selection scheme according to Alg. 48.2 with a binary 
indicator according to (48.5) has been used. In particu- 
lar, a normalization according to 


PAb}, {a}) = IGD}, {a})/s" (8, f(a) 


is proposed. The normalization function s(g,f(x)) is 
closely related to the concept of achievement scalariz- 


Indicator-Based Selection | 48.4 Preference-Based Selection 


a) Stressing an extreme 
Al 


0 = 0 
0 1 0 


b) Emphasizing a preference point 


0 


n WY 
0 1 


Fig. 48.1a,b The figures show the Pareto front approximations (dots) found by HypE (after [48.31]) using different 
weight distribution functions, shown as contour lines at intervals of 10% of the maximum weight value. For both rows 
one parameter of the sampled distribution was modified, i.e., on top the rate parameter of the exponential distribution, 
on the bottom the spread of a multivariate normal distribution (after [48.35]). The test problem is ZDT1 (after [48.36]) 


where the Pareto front is shown as a solid line. The graphics appeared in [48.35] 


ing functions, first proposed by Wierzbicki [48.40] 
s(g, f(a)) = max (fi(a) — gi) . 
1<i<n 


where g denotes the reference point whose components 
represent the desired values of the objective functions. 
The function s* is obtained from s by normaliza- 
tion such that only positive values are obtained, i.e., 
s* (g, f (a)) > 0 [48.24]. 

A second approach uses the the concept of 
a weighted hypervolume indicator as proposed 
in [48.41]. In extension to (48.2), we now determine 
the weighted volume that is covered by all points 
z€ R” that are enclosed between the image of the 
solutions in objective space f(A) and the reference 
set R, where enclosed is interpreted in terms of weak 
Pareto dominance: 


Definition 48.6 

Given a set of solutions A C X, a set of reference 
points R C R” and a positive weight function w: 
IR — Rso. Then the weighted hypervolume indica- 


tor (A, R) of A with respect to R is defined as 


KWA, R) = w(z)-dz, (48.6) 


z€H(A,R) 


where H(A, R) denotes the objective space dominated 
by A and dominating R 


H(A,R) = 
{z€ R”|Ja € A:sreR: f(a) <z<r}. 


The weight function is supposed to be integrable on any 
bounded set, i.e., Soc.y) W@dz <oo for any y >0, 
where B(0, y) is the open ball centered in 0 and of ra- 
dius y. 


In a similar way to (48.2), the weighted hypervol- 
ume indicator is compliant to Pareto dominance and 
can safely be used in the previously described indica- 
tor-based selection schemes. 

In later work [48.35,39], the approach [48.41] 
has been extended toward more general weight func- 
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tions, their relation to typical user preference spec- 
ifications and higher dimensions. Moreover, it is 
well known that the exact computation of the hy- 
pervolume is expensive in the number of objec- 
tives, i.e., it is exponential unless P = NP. To this 


48.5 Concluding Remarks 


There has been a major shift in our understanding of 
population-based multiobjective optimization. In a cer- 
tain sense, the focus of classical algorithms such as 
NSGA-II or SPEA2 was the Pareto—dominance rela- 
tion between individual solutions. Properties such as 
a large diversity of solutions in the final population was 
achieved through (clever) heuristics and tuning of the 
selection mechanisms. 

The role of set indicators in multiobjective opti- 
mization was first limited to the performance assess- 
ment. The possibility to assign a single measure to a set 
was used in elaborated methods that allow us to com- 
pare the results of optimization runs and to statistically 
verify whether one algorithms is preferable to another 
one. In this context, indicators have been compared in 
terms of their suitability for performance assessment, 
e.g., whether they comply to the underlying preference 
relation between solutions. 

Recently, there is a growing number of very com- 
petitive search algorithms that are based on an explicit 
formulation of the optimization goal as a set prop- 
erty, i.e., they build on the concept of set indicators. 
In simplified terms, they can be regarded as optimiza- 
tion methods that deal with sets of solutions as their 
optimization object. In contrast, single-objective op- 
timization traditionally works with single solutions. 
One can simply draw the correspondence between tra- 
ditional single-objective optimization and population- 
based multiobjective optimization as follows: single 
solution versus set of solutions, and single objective 
function versus single set indicator. 

This major breakthrough leads to several advan- 
tages in terms of analysis and algorithm design: 


© Algorithms are conceptually simpler as they are 
based on a single indicator and do not rely on 
heuristics to a large extent. As a result, it can be ex- 
pected that they are more robust and less parameter 
tuning is necessary. 

@ Certain convergence properties can be derived. As 
a result, the new class of algorithms does not show 


end, efficient sampling methods [48.31] have been 
combined with the general concept of weight func- 
tions. Figure 48.1 shows some examples of the ef- 
fect of weighting the hypervolume indicator, taken 
from [48.35]. 


deterioration and/or cyclic behavior. It also appears 
in some of the experiments that have been con- 
ducted, that they are more robust toward increasing 
the number of objectives. 

@ The optimization criterion is made explicit, i. e., the 
discussion about convergence versus diversity in the 
research community can now be based on quantita- 
tive measures. 

@ By changing the set indicator, it is possible to ex- 
plicitly consider preferences of a user. It will be 
seen whether this possibility will lead to interactive 
methods that involve the decision maker in the opti- 
mization process. 


The purpose of the chapter was to introduce the con- 
cept of set indicators and to discuss several ways to 
use them as essential parts of a set-based multiobjective 
optimization algorithm. Many other important aspects 
have not been discussed in detail. In particular, due to 
its superior properties in terms of: 


a) Compliance to Pareto dominance, and 
b) Sensitivity to changes of single solutions and 
c) Simple interpretation. 


The hypervolume indicator has been very popular 
as a component of set-based optimization methods. Un- 
fortunately, it has some detrimental properties, such 
as its computation complexity with respect to growing 
number of objectives. In addition, the complexity to de- 
termine a subset of solutions that has the least influence 
on the hypervolume increases exponentially with the 
size of the subset. 

Finally, it has been shown that removing solutions 
one-by-one may lead to a major loss in optimiza- 
tion quality. Therefore, its use in indicator-based se- 
lection needs to be done with care. As described in 
this chapter, recent methods overcome some of these 
difficulties by using advanced methods such as sam- 
pling. A more detailed investigation and review of the 
hypervolume indicator can be found e.g., in [48.31, 
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49, Multi-Objective Evolutionary Algorithms 


Kalyanmoy Deb 


Evolutionary algorithms (EAs) have amply shown 
their promise in solving various search and opti- 
mization problems for the past three decades. One 
of the hallmarks and niches of EAs is their ability 
to handle multi-objective optimization problems 
in their totality, which their classical counterparts 
lack. Suggested in the beginning of the 1990s, 
evolutionary multi-objective optimization (EMO) 
algorithms are now routinely used in solving 
problems with multiple conflicting objectives in 
various branches of engineering, science, and 
commerce. In this chapter, we provide an overview 
of EMO methodologies by first presenting princi- 
ples of EMO through an illustration of one specific 
algorithm and its application to an interesting 
real-world bi-objective optimization problem. 
Thereafter, we provide a list of recent research 
and application developments of EMO to provide 
a picture of some salient advancements in EMO re- 
search. The development and application of EMO to 
multi-objective optimization problems and their 
continued extensions to solve other related prob- 
lems has elevated EMO research to a level which 
may now undoubtedly be termed as an active field 
of research with a wide range of theoretical and 
practical research and application opportunities. 
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49.1 Preamble 


Search and optimization problems, particularly involv- 
ing nonlinear, non-convex and non-differentiable objec- 
tive and constraint functions, provide a stiff challenge 
even today. No known mathematical algorithm exists 
to solve such problems to optimality. In such cases, 
the use of meta-heuristic optimization methods such 
as evolutionary algorithms [49.1—3], simulated anneal- 
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ing [49.4], tabu search [49.5,6], and other methods 
motivated by another natural or physical phenomenon 
have been popularly applied. 

EAs were traditionally used for solving problems 
having a single goal or objective. However, most real- 
world problems have multiple conflicting goals and 
theoretically they give rise to a set of trade-off solu- 
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tions. The classical literature to solve multi-objective 
optimization problems has been mostly indirect, mainly 
due to the fact there did not exist any search and op- 
timization methods which could find multiple optimal 
solutions in a single simulation. While the scientific 
community was waiting for a suitable algorithm for 
handling such problems, evolutionary algorithms with 
their population approach caught the eyes of a number 
of researchers. This spurred the development of a series 
of first generation evolutionary multi-objective opti- 
mization (EMO) algorithms around 1993-1995. A set 
of three different algorithms (but all motivated by 
a single idea portrayed by legendary EA researcher, 
Prof. Goldberg [49.1]) showed the world that EMOs 
are viable candidates for multi-objective optimization, 
and that there are meta-heuristic-based approaches for 
finding multiple trade-off solutions in a single simula- 
tion. EMO researchers, and in that spirit the whole EA 
research community, realized the niche of EAs in such 
problem-solving tasks and promoted the developmen- 
tal and application studies using EMO. Subsequently, 
EMO methodologies were made to be better, faster, and 
more accessible. The algorithms were commercialized 
by various software companies and they made the field 


of EMO more popular and applicable to many different 
problems, which academic researchers alone could not 
have done. 

In this chapter, we provide a brief overview of the 
EMO principle, present one EMO algorithm in de- 
tail, and emphasize the importance of using EMO in 
practice. Besides this specific algorithm, there exist 
a number of other equally efficient EMO algorithms, 
which we do not describe here for brevity. Instead, 
in this chapter, we discuss a number of recent ad- 
vancements of EMO research and applications that 
are driving researchers and practitioners ahead. Fortu- 
nately, researchers have utilized the EMO principle of 
solving multi-objective optimization problems in han- 
dling various other problem-solving tasks. The diversity 
of EMO’s research is bringing together researchers 
and practitioners with different backgrounds, includ- 
ing computer scientists, mathematicians, economists, 
and engineers. The topics that we discuss here am- 
ply demonstrate why and how EMO researchers from 
different backgrounds must and should collaborate on 
complex problem-solving tasks, which have become the 
need of the hour in most branches of science, engineer- 
ing, and commerce today. 


49.2 Evolutionary Multi-Objective Optimization (EMO) 


Before we discuss an evolutionary algorithm for multi- 
objective optimization, we present a generic problem 
that involves multiple conflicting objectives. A multi- 
objective optimization problem involves a number of 
objective functions that are to be either minimized or 
maximized, subject to a number of constraints and vari- 
able bounds 


subject to g(x) > 0, JH 1,2, se ge S 
h(x) = 0, BS 2yec8 E 
sP axe, i=1,2,...,n. 

(49.1) 


A solution xe R” is a vector of n decision vari- 
ables: x = (x1,%2,...,X,)’. The solutions satisfying 
the constraints and variable bounds constitute a fea- 


sible set S in the decision variable space R”. One 
of the striking differences between single-objective 
and multi-objective optimization is that in multi- 
objective optimization the objective function vec- 
tors belong to a multi-dimensional objective space 
R”. The objective function vectors constitute a fea- 
sible set Z in the objective space. For each so- 
lution x in S, there exists a point z € Z, denoted 
by f(x) = z= (z.22,....Z)’. To make the descrip- 
tions clear, we refer to a decision variable vector as 
a solution and the corresponding objective vector as 
a point. 

The optimal solutions in multi-objective optimiza- 
tion can be defined from the mathematical concept 
of partial ordering [49.7]. In the parlance of multi- 
objective optimization, the term domination is used 
for this purpose. In this section, we restrict ourselves 
to discussing unconstrained (without any equality, in- 
equality or bound constraints) optimization problems. 
The domination between two solutions is defined as fol- 
lows [49.8, 9]: 
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Definition 49.1 
A solution xí’ is said to dominate another solution x®, 
if both the following conditions are true. 


1. The solution x is no worse than x® in all ob- 
jectives. Thus, the solutions are compared based on 
their objective function values (or location of the 
corresponding points (z“ and z)) in the objective 
function set Z). 

2. The solution x“) is strictly better than x in at least 
one objective. 


For a given set of solutions (or corresponding 
points in the objective function set Z, for example, 
those shown in Fig. 49.la), a pair-wise comparison 
can be made using the above definition and whether 
one point dominates another point can be established. 
All points that are not dominated by any other mem- 
ber of the set are called non-dominated points of class 
one, or simply non-dominated points. For the set of six 
points shown in the figure, they are points 3, 5, and 
6. One property of any two such points is that a gain 
in an objective from one point to the other happens 
only due to a sacrifice in at least one other objec- 
tive. This trade-off property between non-dominated 
points makes practitioners interested in finding a wide 
variety of them before making a final choice. These 
points make up a front when viewed together on 
the objective space; hence non-dominated points are 
often visualized to represent a non-dominated front. 
The theoretical computational effort needed to se- 
lect the points of the non-dominated front from a set 
of N points is O(NlogN) for 2 and 3 objectives, 
and O(N log”? N) for M > 3 objectives [49.10], but 
for a moderate number of objectives, the procedure 
need not be particularly computationally effective in 
practice. 


fi (maximize) 


With the above concept, now it is easier to define the 
Pareto-optimal solutions in a multi-objective optimiza- 
tion problem. If the given set of points for the above task 
contain all feasible points in the objective space, the 
points lying on the first non-domination front, by defini- 
tion, do not become dominated by any other point in the 
objective space; hence they are Pareto-optimal points 
(together they constitute the Pareto-optimal front), and 
the corresponding pre-images (decision variable vec- 
tors) are called Pareto-optimal solutions. However, 
more mathematically elegant definitions of Pareto- 
optimality (including the ones for continuous search 
space problems) exist in the multi-objective optimiza- 
tion literature [49.9, 11]. Some convergence analyses of 
EMO under certain assumptions can also be found else- 
where [49.12—15]. 


49.2.1 EMO Principles 


In the context of multi-objective optimization, the ex- 
tremist principle of finding the optimum solution cannot 
be applied to one objective alone, when the rest of the 
objectives are also important. This clearly suggests two 
ideal goals of multi-objective optimization. 


© Convergence: find a (finite) set of solutions which 
lies on the Pareto-optimal front. 

© Diversity: find a set of solutions which is diverse 
enough to represent the entire range of the Pareto- 
optimal front. 


EMO algorithms attempt to follow both the above 
principles, similar to the a posteriori multiple cri- 
teria decision-making (MCDM) method. Figure 49.2 
schematically shows the principles followed in an EMO 
procedure. Since EMO procedures are heuristic based, 
they may not guarantee finding exact Pareto-optimal 
points, as a theoretically provable optimization method 
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Fig. 49.2 Schematic of a two-step multi-criteria optimization and decision-making procedure 


would do for tractable (for example, linear or con- 
vex) problems. However, EMO procedures have es- 
sential operators to constantly improve the evolving 
non-dominated points (from the point of view of con- 
vergence and diversity mentioned above) similar to 
how most natural and artificial evolving systems con- 
tinuously improve their solutions. To this effect, a re- 
cent study [49.16] demonstrated that a particular EMO 
procedure, starting from random non-optimal solu- 
tions, can progress towards theoretical Karush—Kuhn— 
Tucker (KKT) points with iterations in real-valued 
multi-objective optimization problems. The main dif- 
ference and advantage of using EMO compared to 
a posteriori MCDM procedures is that multiple trade- 
off solutions can be found in a single run of an 
EMO algorithm, whereas most a posteriori MCDM 
methodologies would require multiple independent 
runs. 

In Step 1 of the EMO-based multi-objective op- 
timization and decision-making procedure (the task 
shown vertically downwards in Fig. 49.2), multiple 
trade-off, non-dominated points are found. Thereafter, 
in Step 2 (the task shown horizontally, towards the 
right), higher-level information is used to choose one 
of the trade-off points obtained. 


49.2.2 A Posteriori MCDM Methods and EMO 


In the a posteriori MCDM approaches (also known 
as generating MCDM methods), the task of finding 
multiple Pareto-optimal solutions is achieved by ex- 
ecuting many independent single-objective optimiza- 
tions, each time finding a single Pareto-optimal solu- 
tion [49.9]. A parametric scalarizing approach (such 
as the weighted-sum approach, €-constraint approach, 
and others) can be used to convert multiple objectives 
into a parametric single-objective objective function. 
By simply varying the parameters (the weight vector 
or the €-vector) and optimizing the scalarized function, 
different Pareto-optimal solutions can be found. In con- 
trast, in an EMO, multiple Pareto-optimal solutions are 
attempted to be found in a single run of the algorithm by 
emphasizing multiple non-dominated and isolated solu- 
tions in each iteration of the algorithm and without the 
use of any scalarization of objectives. 

Consider Fig. 49.3, in which we sketch how mul- 
tiple independent parametric single-objective optimiza- 
tions (through a posteriori MCDM method) may find 
different Pareto-optimal solutions. It is worth high- 
lighting here that the Pareto-optimal front corresponds 
to the global optimal solutions of several problems, 
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Fig. 49.3 A posteriori MCDM methodology employing 
independent single-objective optimizations 


each formed with a different scalarization of objec- 
tives. During the course of an optimization task, algo- 
rithms must overcome a number of difficulties, such 
as infeasible regions, local optimal solutions, flat or 
non-improving regions of objective landscapes, isola- 
tion of optimum, etc., to finally converge to the global 
optimal solution. Moreover, due to practical limita- 
tions, an optimization task must also be completed in 
a reasonable computational time. All these difficul- 
ties in a problem require that an optimization algo- 


rithm strikes a good balance between exploring new 
search directions and exploiting the extent of search 
in currently-best search direction. When multiple runs 
of an algorithm need to be performed independently 
to find a set of Pareto-optimal solutions, the above 
balancing act must be performed in every single run. 
Since runs are performed independently from one an- 
other, no information about the success or failure of 
previous runs is utilized to speed up the overall pro- 
cess. In difficult multi-objective optimization problems, 
such memory-less, a posteriori methods may demand 
a large overall computational overhead to find a set 
of Pareto-optimal solutions [49.17]. Moreover, despite 
the issue of global convergence, independent runs may 
not guarantee achieving a good distribution among 
obtained points by an easy variation of scalarization 
parameters. 

EMO, as was mentioned earlier, constitutes an 
inherent parallel search. When a particular popula- 
tion member overcomes certain difficulties and makes 
a progress towards the Pareto-optimal front, its vari- 
able values and their combination must reflect this fact. 
When a recombination takes place between this so- 
lution and another population member, such valuable 
information of variable value combinations is shared 
through variable exchanges and blending, thereby mak- 
ing the overall task of finding multiple trade-off solu- 
tions a parallely processed task. 


49.3 A Brief Timeline for the Development of EMO Methodologies 


During the seventies and eighties, EA researchers re- 
alized the need for solving multi-objective optimiza- 
tion problems in practice and mainly resorted to using 
weighted-sum approaches to convert multiple objec- 
tives into a single goal [49.18, 19]. 

However, the first implementation of a real multi- 
objective evolutionary algorithm (vector-evaluated GA 
(genetic algorithm) or VEGA) was suggested by Schaf- 
fer in 1984 [49.20]. Schaffer modified the simple 
three-operator genetic algorithm [49.2] (with selection, 
crossover, and mutation) by performing independent 
selection cycles according to each objective. The selec- 
tion method is repeated for each individual objective 
to fill up a portion of the mating pool. Then the en- 
tire population is thoroughly shuffled to apply crossover 
and mutation operators. This is performed to achieve 
the mating of individuals of different subpopulation 
groups. The algorithm worked efficiently for some gen- 
erations but in some cases suffered from its bias towards 


some individuals or regions (mostly individual objec- 
tive champions). This does not fulfill the second goal of 
EMO, discussed earlier. 

Ironically, no significant study was performed for 
almost a decade after the pioneering work of Schaf- 
fer, until a revolutionary 10-line sketch of a new 
non-dominated sorting procedure suggested by Gold- 
berg in his seminal book on GAs [49.1]. Since an 
EA needs a fitness function for reproduction, the trick 
was to find a single metric from a number of ob- 
jective functions. Goldberg’s suggestion was to use 
the concept of domination to assign more copies to 
non-dominated individuals in a population. Since di- 
versity is the other concern, he also suggested the 
use of a niching strategy [49.21] among solutions of 
a non-dominated class. To get this clue, at least three 
independent groups of researchers developed differ- 
ent versions of multi-objective evolutionary algorithms 
during 1993-1994 [49.22—24]. These algorithms differ 
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in the way a fitness assignment scheme is introduced 
to each individual. Independently, Poloni [49.25] sug- 
gested a domination-based EMO approach (he called it 
multi-objective genetic algorithm (MOGA)) in which 
instead of niching, a toroidal grid-based local selection 
method was used to find multiple trade-off solutions. 
These early EMO methodologies gave a good head- 
start to the research and application of EMO, but 
suffered from the fact that they did not use an elite- 
preservation mechanism in their procedures. Inclusion 


49.4 Elitist EMO: NSGA-II 


The NSGA-II procedure [49.27] is one of the popularly 
used EMO procedures which attempt to find multiple 
Pareto-optimal solutions in a multi-objective optimiza- 
tion problem and has the following three features: 


1. It uses an elitist principle. 
2. It uses an explicit diversity preserving mechanism. 
3. It emphasizes non-dominated solutions. 


At any generation t, the offspring population (say, 
Q,) is first created by using the parent population (say, 
P,) and the usual genetic operators. Thereafter, the two 
populations are combined to form a new population 
(say, R,) of size 2N. Then, the population R, is clas- 
sified into different non-dominated classes. Thereafter, 
the new population is filled by points of different non- 
dominated fronts, one at a time. The filling starts with 
the first non-dominated front (of class 1) and continues 
with points of the second non-dominated front, and so 
on. Since the overall population size of R, is 2N, not 
all fronts can be accommodated in the N slots available 
for the new population. All fronts that could not be ac- 
commodated are deleted. When the last allowed front 
is being considered, there may exist more points in the 
front than the slots remaining in the new population. 
This scenario is illustrated in Fig. 49.4. Instead of arbi- 
trarily discarding some members from the last front, the 
points that will make the diversity of the selected points 
the highest are chosen. 

The crowded-sorting of the points of the last front 
which could not be accommodated fully is achieved in 
the descending order of their crowding distance values, 
and points from the top of the ordered list are chosen. 
The crowding distance d; of point i is a measure of the 
objective space around i which is not occupied by any 
other solution in the population. Here, we simply calcu- 


of elitists in an EMO provides a monotonically non- 
degrading performance [49.26]. The second generation 
EMO algorithms implemented an elite-preserving op- 
erator in different ways and gave birth to elitist EMO 
procedures, such as non-dominated sorting GA NSGA- 
II [49.27], strength Pareto EA (SPEA) [49.28], Pareto- 
archived ES (PAES) [49.29], and others. Since these 
EMO algorithms are state-of-the-art and commonly- 
used procedures, we describe one of these algorithms 
in detail. 


late this quantity d; by estimating the perimeter of the 
cuboid (Fig. 49.5) formed by using the nearest neigh- 
bors in the objective space as the vertices (we call this 
the crowding distance). 


49.4.1 Sample Results 


Here, we show results from several runs of the NSGA- 
II algorithm on two test problems. The first problem 
(ZDT2 — Zitzler-Deb-Thiele) is a two-objective, 30- 
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Fig. 49.4 Schematic of the NSGA-II procedure 
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Fig. 49.5 The crowding distance calculation 
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variable problem with a concave Pareto-optimal front 


minimize fi(x) =x, 


minimize f(x) = s(x)[1 — (f{(x)/s(x))’] 


ZDT2: ) where s(x) =14+ 22,4), 
O<x, <1, 
-l<x, <1, i=2,3,...,30. 


(49.2) 


The second problem (KUR — Kurswae), with three vari- 
ables, has a disconnected Pareto-optimal front 


minimize fi(x) =o, 


m |-10exp (02s Fii )| ; 
` | minimize A(x) = 2, [lil + 5 sinG3)] , 


—S5<x, <5, i=1,2,3. 
(49.3) 


NSGA-II is run with a population size of 100 and for 
250 generations. The variables are used as real numbers 
and a simulated binary crossover (SBX) recombina- 
tion operator [49.30] with pe = 0.9, a distribution index 
of ne = 10, and a polynomial mutation operator [49.8] 
with pm = 1/n (n is the number of variables) and a 
distribution index of nm = 20 are used. Figures 49.6 
and 49.7 show that NSGA-II converges to the Pareto- 
optimal front and maintains a good spread of solutions 
in both test problems. 

There also exist other competent EMOs, such as 
the strength Pareto evolutionary algorithm (SPEA) 
and its improved version SPEA2 [49.31], the Pareto- 
archived evolution strategy (PAES) and its im- 
proved versions pareto-envelope based selection al- 
gorithm (PESA) and PESA2 [49.32], multi-objective 
messy GA (MOMGA) [49.33], multi-objective micro- 
GA [49.34], neighborhood constraint GA [49.35], adap- 
tive range MOGA (ARMOGA) [49.36], and others. 
Moreover, there exist other EA-based methodologies, 
such as particle swarm-based EMO [49.37, 38], ant- 
based EMO [49.39, 40], and differential evolution- 
based EMO [49.41]. Simulated annealing method is 
used to find multiple Pareto-optimal solutions for 
multi-objective optimization problems [49.42]. The 
tabu search method is also used for multi-objective 
optimization [49.43]. 


49.4.2 Constraint Handling in EMO 


The constraint handling method modifies the binary 
tournament selection, where two solutions are picked 
from the population, and the better solution is chosen. 
In the presence of constraints, each solution can be ei- 
ther feasible or infeasible. Thus, there may be at most 
three situations: 


i) Both solutions are feasible. 
ii) One is feasible and other is not. 
iii) Both are infeasible. 


We consider each case by simply redefining 
the domination principle as follows (we call it the 
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Fig. 49.6 NSGA-II on ZDT2 
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constrained-domination condition for any two solu- 
tions x and x): 


Definition 49.2 

A solution x is said to be a constrained-dominated 
solution x (or x <.x), if any of the following 
conditions are true: 


1. Solution x is feasible and solution x“ is not. 

2. Solutions x and x are both infeasible, but so- 
lution x has a smaller constraint violation, which 
can be computed by adding the normalized viola- 
tion of all constraints 


J K 
CV(x) = = max (o -8g (x)) + oy osae) 
k=1 


j=1 


The normalization is achieved with the population 
minimum ((g)min) and maximum ((g;)max) con- 
straint violations 


ga) = ((gj(x)) a (gj) min) / ((3) max = (8) min) . 


3. Solutions x and x are feasible and solution x 
dominates solution x in the usual sense (Defini- 
tion 49.1). 


The above change in the definition requires a mini- 
mal change in the NSGA-II procedure described earlier. 


49.5 Applications of EMO 


Since the early development of EMO algorithms in 
1993, they have been applied to many challeng- 
ing real-world optimization problems. Descriptions of 
some of these studies can be found in books [49.8, 
44-47], dedicated conference proceedings [49.48—53], 
and domain-specific books, journals, and proceed- 
ings. A repository of most research and application 
papers of EMO is available [49.54]. In this sec- 
tion, we describe one case study that clearly demon- 
strates the EMO philosophy which we described in 
Sect. 49.2.1. 


49.5.1 Spacecraft Trajectory Design 


Coverstone-Carroll et al. [49.55] proposed a multi- 
objective optimization technique using the original non- 
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Fig. 49.8 Non-constrained-domination fronts 


Figure 49.8 shows the non-dominated fronts on a six- 
member population due to the introduction of two 
constraints (the minimization problem is described as 
CONSTR elsewhere [49.8]). In the absence of the con- 
straints, the non-dominated fronts (shown by dashed 
lines) would have been ((1,3,5), (2,6), (4)), 
but in their presence, the new fronts are ((4,5), 
(6), (2), (1), (3)). The first non-dominated 
front consists of the best (that is, non-dominated and 
feasible) points from the population and any feasible 
point lies on a better non-dominated front than an in- 
feasible point. 


dominated sorting algorithm (NSGA) [49.24] to find 
multiple trade-off solutions in a spacecraft trajectory 
optimization problem. To evaluate a solution (trajec- 
tory), the SEPTOP (solar electric propulsion trajectory 
optimization) software [49.56] is called, and the de- 
livered payload mass and the total time of flight are 
calculated. The multi-objective optimization problem 
has eight decision variables controlling the trajectory 
and three objective functions: 


i) Maximize the delivered payload at destination. 

ii) Maximize the negative of the time of flight. 

iii) Maximize the total number of heliocentric revolu- 
tions in the trajectory, and three constraints limiting 
the SEPTOP convergence error and minimum and 
maximum bounds on heliocentric revolutions. 
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On the Earth—Mars rendezvous mission, the study 
found interesting trade-off solutions [49.55]. Using 
a population of size 150, the NSGA was run for 30 
generations. The non-dominated solutions obtained are 
shown in Fig. 49.9 for two of the three objectives, and 
some selected solutions are shown in Fig. 49.10. It is 
clear that there exist short-time flights with smaller de- 
livered payloads (solution marked 44 with 1.12 years 
of flight and delivering 685.28 kg load) and long-time 
flights with larger delivered payloads (solution marked 
36 with close to 3.5 years of flight and delivering about 
900kg load). While solution 44 can deliver a mass 
of 685.28kg and requires about 1.12 years, solution 
72 can deliver almost 862kg with a travel time of 
about 3 years. In these figures, each continuous part 
of a trajectory represents a thrusting arc and each 
dashed part of a trajectory represents a coasting arc. 
It is interesting to note that only a small improve- 
ment in delivered mass occurs in the solutions between 
73 and 72 with a sacrifice in flight time of about 
1 year. 

The multiplicity in trade-off solutions, as depicted 
in Fig. 49.10, is what we envisaged in discovering 
in a multi-objective optimization problem by using 
a posteriori procedure, such as a generating method 
or using an EMO procedure vis-a-vis an a priori ap- 
proach in which a single scalarized problem is solved 
with a single preferred parameter setting to find a sin- 
gle Pareto-optimal solution. This aspect is also shown 
in Fig. 49.2. Once a set of solutions with a good trade- 
off among objectives is obtained, one can analyze them 
to choose a particular solution. For example, in this 
problem context, it makes sense not to choose a solu- 
tion between points 73 and 72 due to poor trade-off 
between the objectives in this range, a matter which 
is only revealed after a representative set of trade-off 
solutions are found. On the other hand, choosing a so- 
lution within points 44 and 73 is worthwhile, but which 
particular solution to choose depends on other mission- 
related issues. However, by first finding a wide range 
of possible solutions thereby revealing the shape of 
front in a computationally quicker manner, EMO can 
help a decision-maker in narrowing down the choices 
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Fig. 49.9 Non-dominated solutions obtained using NSGA 


and in allowing a better decision to be made. With- 
out the knowledge of such a wide variety of trade-off 
solutions, proper decision-making may be a difficult 
task. With the use of an a priori approach to find 
a single solution using, for example, the €-constraint 
method with a particular € vector, the decision-maker 
will always wonder what solution would have been 
derived if a different € vector had been chosen. For 
example, if €; = 2.5 years is chosen and the mass de- 
livered to the target is maximized, a solution in between 
points 73 and 72 will be found. As discussed earlier, 
this part of the Pareto-optimal front does not pro- 
vide the best trade-offs between the objectives that this 
problem can offer. A lack of knowledge of good trade- 
off regions before a decision is made may allow the 
decision-maker to settle for a solution which, although 
optimal, may not be a good compromise solution. The 
EMO procedure allows a flexible and a pragmatic pro- 
cedure for finding a well-diversified set of solutions 
simultaneously so as to enable picking a particular re- 
gion for further analysis or a particular solution for 
implementation. 
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49.6 Recent Developments in EMO 


An interesting aspect regarding research and applica- 
tion of EMO is that soon after a number of efficient 
EMO methodologies had been suggested and applied 
in various interesting problem areas, researchers did 
not waste any time to look for opportunities to make 
the field broader and more useful by diversifying EMO 
applications to various other problem-solving tasks. In 
this section, we describe a number of such salient recent 
developments of EMO. 


49.6.1 Hybrid EMO Algorithms 


The search operators used in EMO are heuristic based. 
Thus, these methodologies are not guaranteed to find 
Pareto-optimal solutions with a finite number of so- 
lution evaluations in an arbitrary problem. In single- 
objective EA research, hybridization of EAs is common 
for ensuring convergence to an optimal solution; it is not 
surprising that studies on developing hybrid EMOs are 
now being pursued to ensure that true Pareto-optimal 
solutions are found by hybridizing them with mathe- 
matically convergent ideas. 


EMO methodologies provide adequate emphasis 
on currently non-dominated and isolated solutions so 
that population members progress towards the Pareto- 
optimal front iteratively. To make the overall procedure 
faster and to perform the task with a more theo- 
retical emphasis, EMO methodologies are combined 
with mathematical optimization techniques having lo- 
cal convergence properties. A simple-minded approach 
would be to start the process with an EMO and the 
solutions obtained from EMO could be improved by 
optimizing a composite objective derived from multi- 
ple objectives to ensure a good spread by using a local 
search technique [49.57]. Another approach would be 
to use a local search technique as a mutation-like op- 
erator in an EMO, so that all population members are 
at least guaranteed to be local optimal solutions [49.57, 
58]. To save computational time, instead of performing 
the local search for every solution in a generation, a mu- 
tation can be performed only after a few generations. 
Some recent studies [49.58-60] have demonstrated the 
usefulness of such hybrid EMOs for a guaranteed con- 
vergence. 
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Although these studies concentrated on ensuring 
convergence to the Pareto-optimal front, some emphasis 
should now be placed on providing adequate diversity 
among the solutions obtained, particularly when a con- 
tinuous Pareto-optimal front is represented by a finite 
set of points. Some ideas of maximizing the hypervol- 
ume measure [49.61] or the maintenance of a uniform 
distance between points are proposed for this purpose, 
but how such diversity-maintenance techniques would 
be integrated with convergence-ensuring principles in 
a synergistic way would be interesting and useful future 
research. Some relevant studies in this direction exist al- 
ready [49.59, 62—65]. 


49.6.2 Multi-Objectivization 


Interestingly, the act of finding multiple trade-off so- 
lutions using an EMO procedure has found its ap- 
plication outside the realm of solving multi-objective 
optimization problems. The concept of finding near- 
optimal trade-off solutions is applied to solve other 
kinds of optimization problems as well. For example, 
the EMO concept is used to solve constrained single- 
objective optimization problems by converting the task 
into a two-objective optimization task of additionally 
minimizing an aggregate constraint violation [49.66]. 
This eliminates the need to specify a penalty param- 
eter while using a penalty-based constraint handling 
procedure. If viewed this way, the usual penalty func- 
tion approach used in classical optimization studies is 
a special weighted-sum approach to the bi-objective 
optimization problem of minimizing the objective func- 
tion and minimizing the constraint violation, for which 
the weight vector is a function of the penalty parameter. 
A well-known difficulty in genetic programming stud- 
ies, called bloating, arises due to the continual increase 
in the size of genetic programs evolved with iteration. 
The reduction of bloating by minimizing the size of 
a program as an additional objective has helped find 
high-performing solutions with a smaller size of the 
code [49.67, 68]. In clustering algorithms, minimizing 
the intra-cluster distance and maximizing inter-cluster 
distance simultaneously in a bi-objective formulation 
of a clustering problem is found to yield better solu- 
tions than the usual single-objective minimization of the 
ratio of the intra-cluster distance to the inter-cluster dis- 
tance [49.69]. An EMO is found to solve a minimum 
spanning tree problem better than a single-objective 
EA [49.70]. A recently edited book [49.71] describes 
many interesting applications in which EMO method- 
ologies have helped to solve problems that are oth- 


erwise (or traditionally) not treated as multi-objective 
optimization problems. 


49.6.3 Uncertainty-Based EMO 


A major surge in EMO research has taken place in 
handling uncertainties among decision variables and 
problem parameters in multi-objective optimization. 
Practice is full of uncertainties and almost no parameter, 
dimension, or property can be guaranteed to be fixed at 
the value it is aimed at. In such scenarios, evaluation of 
a solution is not precise, and the resulting objective and 
constraint function values become probabilistic quan- 
tities. Optimization algorithms are usually designed to 
handle such stochastiticies by using crude methods, 
such as Monte Carlo simulation of stochasticities in un- 
certain variables and parameters and by sophisticated 
stochastic programming methods involving nested op- 
timization techniques [49.72]. When these effects are 
taken care of during the optimization process, the re- 
sulting solution is usually different from the optimum 
solution of the problem and is known as a robust so- 
lution. Such an optimization procedure will then find 
a solution which may not be the true global optimum 
solution, but one which is less sensitive to uncertain- 
ties in decision variables and problem parameters. In 
the context of multi-objective optimization, a consider- 
ation of uncertainties for multiple objective functions 
will result in a robust frontier which may be different 
from the globally Pareto-optimal front. Each and every 
point on the robust frontier is then guaranteed to be less 
sensitive to uncertainties in decision variables and prob- 
lem parameters. Some such studies in EMO are [49.73, 
74]. 

When the evaluation of constraints under uncertain- 
ties in decision variables and problem parameters is 
considered, deterministic constraints become stochastic 
(they are also known as chance constraints) and involve 
a reliability index (R) to handle the constraints. A con- 
straint g(x) > 0 then becomes Prob(g(x) > 0) > R. In 
order to find the left-hand side of the above chance con- 
straint, a separate optimization methodology [49.75] 
is needed, thereby making the overall algorithm a bi- 
level optimization procedure. Approximate single-loop 
algorithms exist [49.76] and recently one such method- 
ology was integrated with an EMO [49.72] and shown 
to find a reliable frontier corresponding a specified re- 
liability index, instead of the Pareto-optimal frontier, 
in problems having uncertainty in decision variables 
and problem parameters. More such methodologies are 
needed, as uncertainties are an integral part of practical 
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problem-solving, and multi-objective optimization re- 
searchers must look for better and faster algorithms to 
handle them. 


49.6.4 EMO and Decision-Making 


Searching for a set of Pareto-optimal solutions by us- 
ing an EMO fulfills only one aspect of multi-objective 
optimization, as choosing a particular solution for an 
implementation is the remaining decision-making task, 
which is equally important. For many years, EMO re- 
searchers have postponed the decision-making aspect 
and concentrated on developing efficient algorithms for 
finding multiple trade-off solutions. Having pursued 
that part somewhat, now for the past couple of years or 
so, EMO researchers are putting efforts to design com- 
bined algorithms for optimization and decision-making. 
In the view of the author, the decision-making task can 
be considered from two main considerations in an EMO 
framework: 


1. Generic consideration: there are some aspects that 
most practical users would like to use in narrowing 
down their choice. Above we discussed the im- 
portance of finding robust and reliable solutions in 
the presence of uncertainties in decision variables 
and/or problem parameters. In such scenarios, an 
EMO methodology can straightaway find a robust 
or a reliable frontier [49.72,73] and no subjective 
preference from any decision maker may be nec- 
essary. Similarly, if a problem resorts to a Pareto- 
optimal front having knee points, such points are 
often the choice of decision-makers. Knee points 
demands a large sacrifice in at least one objective 
to achieve a small gain in another, thereby making it 
discouraging to move out from a knee point [49.77]. 
Other such generic choices are related to Pareto- 
optimal points depicting a certain pre-specified rela- 
tionship between objectives, Pareto-optimal points 
having multiplicity (say, at least two or more so- 
lutions in the decision variable space mapping to 
identical objective values), Pareto-optimal solutions 
which do not lie close to variable boundaries, 
Pareto-optimal points having certain mathematical 
properties, such as all Lagrange multipliers with 
more or less identical magnitudes — a condition 
often desired to make an equal importance to all 
constraints, and others. These considerations are 
motivated from the fundamental and practical as- 
pects of optimization and may be applied to most 
multi-objective problem-solving tasks, without any 


consent of a decision-maker. These considerations 
may narrow down the set of non-dominated points. 
A further subjective consideration (which is dis- 
cussed below) may then be used to pick a preferred 
solution. 

2. Subjective consideration: in this category, any 
problem-specific information can be used to nar- 
row down the choices, and the process may 
even lead to a single preferred solution at the 
end. Most decision-making procedures use some 
preference information (utility functions, refer- 
ence points [49.78], reference directions [49.79], 
marginal rate of return, and a host of other consid- 
erations [49.9]) to select a subset of Pareto-optimal 
solutions. A recent book [49.80] is dedicated to 
the discussion of many such multi-criteria decision 
analysis (MCDA) tools and collaborative sugges- 
tions of using EMO with such MCDA tools. Some 
hybrid EMO and MCDA algorithms have been sug- 
gested in the recent past [49.81—85]. 


Many other generic and subjective considerations 
are needed, and it is interesting that EMO and 
MCDM researchers are collaborating on developing 
such complete algorithms for multi-objective optimiza- 
tion [49.80]. 


49.6.5 EMO for Handling a Large Number 
of Objectives: Multi-Objective EMO 


Initial studies of EMO amply showed that EMO algo- 
rithms can be used to find a wide spread of trade-off 
solutions on two and three-objective optimization prob- 
lems. However, their performance on four or more 
objective problems have not been studied enough. Re- 
cently, such studies have become important and are 
known as many-objective optimization studies in the 
EMO literature. 

A detailed study [49.86] made on eight-objective 
problems revealed somewhat negative results about 
the existing EMO methodologies. However, in his 
book [49.8] and recent other studies [49.87—90] the 
author has clearly explained the reasons for this behav- 
ior of EMO algorithms. EMO methodologies work by 
emphasizing non-dominated solutions in a population. 
Unfortunately, as the number of objectives increases, 
most population members in a randomly created pop- 
ulation tend to become non-dominated to each other. 
For example, in a three-objective scenario, about 10% 
of the members in a population of the size 200 are 
non-dominated, whereas in a 10-objective problem sce- 
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nario, as much as 90% of the members in a population 
of size of 200 are non-dominated. Thus, in a large- 
objective problem, an EMO algorithm runs out of room 
to introduce new population members into a genera- 
tion, thereby causing a stagnation in the performance 
of an EMO algorithm. It has been argued that to make 
EMO procedures efficient, an exponentially large popu- 
lation size (with respect to the number of objectives) is 
needed. This makes the EMO procedure slow and com- 
putationally less attractive. 

However, recent techniques use a fixed set of refer- 
ence points [49.91—93] or reference directions [49.94] 
and are promising, as they are shown to find a widely 
distributed set of solutions in 3 to 15-objective test and 
real-world problems. 

However, practically speaking, even if an algo- 
rithm can find tens of thousands of Pareto-optimal 
solutions for a multi-objective optimization problem, 
besides simply getting an idea of the nature and shape 
of the front, they are simply too many to be con- 
ceivable for any decision-making purposes. Keeping 
these views in mind, EMO researchers have taken two 
different approaches in dealing with many-objective 
problems. 


Finding a Partial Set 
Instead of finding the complete Pareto-optimal front 
in a problem having many objectives, EMO proce- 
dures can be used to find only a preferred part of the 
Pareto-optimal front. This can be achieved by indicating 
preference information by various means. Ideas such 
as reference point-based EMO [49.81, 85], light beam 
search [49.82], biased sharing approaches [49.95], 
cone-dominance [49.96], etc. have been suggested for 
this purpose. Each of these studies has shown that for up 
to 10 and 20-objective problems, although finding the 
complete frontier is a difficulty, finding a partial fron- 
tier corresponding to certain preference information is 
not that difficult a proposition. 

The use of a parallel or a distributed computing plat- 
form can be used with the above idea, and the complete 
Pareto-optimal front can be obtained by a distributed 
computing procedure [49.96]. In the study, each pro- 
cessor in a distributed computing environment receives 
a unique cone defining domination. The cones are de- 
signed carefully so that at the end of such a distributed 
computing EMO procedure, solutions are found to exist 
in various parts of the complete Pareto-optimal front. 
A collection of these solutions is then able to provide 
a good representation of the entire original Pareto- 
optimal front. 


Identifying and Eliminating 

Redundant Objectives 
Many practical optimization problems can easily list 
a large of number of objectives (often more than 10), 
as many different criteria or goals are often of inter- 
est to practitioners. In most instances, it is not entirely 
sure whether or not the chosen objectives are all in con- 
flict with each other. For example, the minimization of 
weight and the minimization of cost of a component 
or a system are often mistaken to have an identical 
optimal solution, but may lead to a range of trade-off 
optimal solutions. Practitioners do not take any chances 
and tend to include all (or as many as possible) objec- 
tives into the optimization problem formulation. There 
is another fact which is more worrisome. Two appar- 
ently conflicting objectives may show a good trade-off 
when evaluated with respect to some randomly created 
solutions. However, if these two objectives are evalu- 
ated for solutions close to their optima, they tend to 
show a good correlation. That is, although objectives 
can exhibit conflicting behavior for random solutions, 
near their Pareto-optimal front, the conflict vanishes 
and the optimum of one becomes close to the optimum 
of the other. 

Thinking of the existence of such problems in 
practice, certain researchers [49.90, 97,98] performed 
linear and non-linear principal component analysis 
(PCA) to a set of EMO-produced solutions. Objec- 
tives causing a positively correlated relationship be- 
tween the the obtained NSGA-II solutions were iden- 
tified and declared as redundant. The EMO proce- 
dure is then restarted with non-redundant objectives. 
This combined EMO-PCA procedure is continued 
until no further reduction in the number of objec- 
tives is possible. The procedure has handled practi- 
cal problems involving five and more objectives and 
has shown to reduce the choice of real conflicting 
objectives to a few. On test problems, the proposed 
approach has been shown to reduce an initial 50- 
objective problem to the correct three-objective Pareto- 
optimal front by eliminating 47 redundant objectives. 
Another study [49.99] used an exact and a heuristic- 
based conflict identification approach on a given set 
of Pareto-optimal solutions. For a given error mea- 
sure, an effort is made to identify a minimal subset of 
objectives that does not alter the original dominance 
structure on a set of Pareto-optimal solutions. This 
idea was recently introduced within an EMO [49.100], 
but a continual reduction of objectives through a suc- 
cessive application of the above procedure would be 
interesting. 
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This is a promising area of EMO research and 
more computationally faster objective-reduction tech- 
niques are definitely needed for the purpose. A recent 
approach uses previously-fixed multiple directional 
searches to find a widely distributed set of Pareto- 
optimal points [49.94]. In this direction, the use of 
alternative definitions of domination may be beneficial. 
One such idea redefined the definition of domination: 
a solution is said to dominate another solution, if the 
former solution is better than the latter one in more 
objectives. This certainly excludes finding the entire 
Pareto-optimal front and helps an EMO to converge 
near the intermediate and central part of the Pareto- 
optimal front. Another EMO study used a fuzzy domi- 
nance [49.101] relation (instead of Pareto-dominance), 
in which superiority of one solution over another in any 
objective is defined in a fuzzy manner. Many other such 
definitions are possible and can be implemented based 
on the problem context. 


49.6.6 Knowledge Extraction Through EMO 


One striking difference between single-objective opti- 
mization and multi-objective optimization is the car- 
dinality of the solution set. In the latter, multiple 
solutions are the outcome and each solution is theoret- 
ically an optimal solution corresponding to a particu- 
lar trade-off among the objectives. Thus, if an EMO 
procedure can find solutions close to the true Pareto- 
optimal set, what we have in our hands is a number 
of high-performing solutions trading-off the conflicting 
objectives considered in the study. Since these solu- 
tions are all near optimal, they can be analyzed for 
finding properties which are common to them. Such 
a procedure can then become a systematic approach in 
deciphering the important and hidden properties that 
optimal and high-performing solutions must have for 
that problem. In a number of practical problem-solving 
tasks, the so-called innovation procedure is shown to 
find important knowledge about high-performing so- 
lutions [49.102]. Such useful properties are expected 
to exist in practical problems, as they follow certain 
scientific and engineering principles at the core, but 
in the past not much attention had been paid to find- 
ing them through a systematic scientific procedure. The 
principle of first searching for multiple trade-off and 
high-performing solutions using a multi-objective opti- 
mization procedure and then analyzing them to discover 
useful knowledge certainly remains a viable way for- 
ward. The current efforts [49.103, 104] to automate the 
knowledge extraction procedure through a sophisticated 


data-mining task should make the overall approach 
more appealing and useful in practice. 


49.6.7 Dynamic EMO 


Dynamic optimization involves objectives, constraints, 
or problem parameters that change over time. This 
means that as an algorithm approaches the optimum 
of the current problem, the problem definition changes 
and now the algorithm must solve a new problem. 
This is not equivalent to another optimization task in 
which a new and different optimization problem must 
be solved afresh. Often, in such dynamic optimization 
problems, an algorithm is usually not expected to find 
the optimum, instead it is best expected to track the op- 
timum changing with the iteration. The performance of 
a dynamic optimizer then depends on how close it is 
able to track the true optimum (which changes with it- 
eration or time). Thus, practically speaking, it may be 
hoped that optimization algorithms can handle prob- 
lems that do not change significantly with time. With 
respect to the algorithm, since here the problem is not 
expected to change too much from one time instance to 
another and some good solutions to the current problem 
are already at hand in a population, researchers fancied 
solving such dynamic optimization problems using evo- 
lutionary algorithms [49.105]. 

A recent study [49.106] proposed the following pro- 
cedure for dynamic optimization involving single or 
multiple objectives. Let P(t) be a problem that changes 
with time ¢ (from t= 0 to f= T). Despite the contin- 
ual change in the problem, we assume that the problem 
is fixed for a time period t, which is not known a pri- 
ori, and the aim of the (offline) dynamic optimization 
study is to identify a suitable value of t for an accurate 
as well as a computationally faster approach. For this 
purpose, an optimization algorithm with t as a fixed 
time period is run from t= 0 to f= T with the prob- 
lem assumed fixed for every t time period. A measure 
T (t) determines the performance of the algorithm and 
is compared with a pre-specified and expected value 
Ty. If P(t) >I), for the entire time domain of the 
execution of the procedure, we declare t to be a per- 
missible length of stasis. Then, we try with a reduced 
value of t and check if a smaller length of statis is 
also acceptable. If not, we increase t to allow the op- 
timization problem to remain static for a longer time 
so that the chosen algorithm can now have more iter- 
ations (time) to perform better. Such a procedure will 
eventually come up with a time period t*, which would 
be the smallest time of statis allowed for the optimiza- 
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tion algorithm to work based on chosen performance 
requirement. Based on this study, a number of test prob- 
lems and a hydro-thermal power dispatch problem were 
tackled recently [49.106]. 

In the case of dynamic multi-objective problem- 
solving tasks, there is an additional difficulty which is 
worth mentioning here. Not only does an EMO algo- 
rithm need to find or track the changing Pareto-optimal 
fronts, in a real-world implementation, it must also 
make an immediate decision about which solution to 
implement from the current front before the problem 
changes to a new one. Decision-making analysis is con- 
sidered to be time-consuming, involving execution of 
analysis tools, higher-level considerations, and some- 
times group discussions. If dynamic EMO is to be 
applied in practice, automated procedures for making 
decisions must be developed. Although it is not clear 
how to generalize such an automated decision-making 
procedure in different problems, problem-specific tools 
are certainly possible and a worthwhile and fertile area 
for research. 


49.6.8 Quality Estimates for EMO 


When algorithms are developed and test problems with 
known Pareto-optimal fronts are available [49.107— 
110], an important task is to have performance mea- 
sures with which the EMO algorithms can be evaluated. 
Thus, a major focus of EMO research has been used to 
develop different performance measures. Since the fo- 
cus in an EMO task is multi-faceted — convergence to 
the Pareto-optimal front and diversity of solutions along 
the entire front, it is also expected that one performance 
measure to evaluate EMO algorithms will be unsatisfac- 
tory. In the early years of EMO research, three different 
sets of performance measures were used: 


1. Metrics evaluating convergence to the known 
Pareto-optimal front (such as error ratio, distance 
from reference set, etc.) 

2. Metrics evaluating spread of solutions on the known 
Pareto-optimal front (such as spread, spacing, etc.). 

3. Metrics evaluating certain combinations of conver- 
gence and spread of solutions (such as hypervol- 
ume, coverage, R-metric, etc.). 


Some of these metrics are described in texts [49.8, 
44]. A detailed study [49.111] comparing most ex- 
isting performance metrics based on out-performance 
relations recommended the use of the S-metric (or 
the hypervolume metric) and the R-metric suggested 


by [49.112]. A recent study argued that a single unary 
performance measure or any finite combination of them 
(for example, any of the first two metrics described 
above in the enumerated list or both together) can- 
not adequately determine whether one set is better 
than another [49.113]. That study also concluded that 
binary performance metrics (indicating usually two dif- 
ferent values when a set of solutions A is compared 
against B and B is compared against A), such as an 
epsilon-indicator, a binary hypervolume indicator, util- 
ity indicators R1 to R3, etc., are better measures for 
multi-objective optimization. The flip side is that the 
chosen binary metric must be computed K(K — 1) times 
when comparing K different sets to make a fair com- 
parison, thereby making the use of binary metrics com- 
putationally expensive in practice. Importantly, these 
performance measures have allowed researchers to use 
them directly as fitness measures within indicator-based 
EAs (IBEAs) [49.114]. In addition, the attainment in- 
dicators of [49.115,116] provide further information 
about location and inter-dependencies among the solu- 
tions obtained. 

The hypervolume metric is a popular metric used in 
EMO studies. However, the computation of the hyper- 
volume metric for more than three-objective problems 
becomes a computationally challenging task. Recent 
studies on computationally fast estimation methods of 
the hypervolume metric have gained popularity among 
theoretical minds [49.62, 63, 117, 118]. These methods 
compute the proportion of randomly generated objec- 
tive points that are dominated by the current set of 
non-dominated points to estimate the hypervolume met- 
ric. A reliable computation method of these studies will 
facilitate the use of the hypervolume metric in design- 
ing efficient EMO algorithms. 


49.6.9 Exact EMO with Run-Time Analysis 


Since they were first suggested, efficient EMO algo- 
rithms have been increasingly applied in a wide vari- 
ety of problem domains to obtain trade-off frontiers. 
Simultaneously, some researchers have also devoted 
their efforts to developing exact EMO algorithms with 
a theoretical complexity estimate in solving certain 
discrete multi-objective optimization problems. The 
first such study [49.119] suggested a pseudo-Boolean 
multi-objective optimization problem — a two-objective 
LOTZ (leading ones trailing zeroes) — and a couple 
of EMO methodologies — a simple evolutionary multi- 
objective optimizer (SEMO) and an improved version 
fair evolutionary multi-objective optimizer (FEMO). 
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The study then estimated the worst-case computational 
effort needed to find all Pareto-optimal solutions of the 
LOTZ problem. This study spurred a number of im- 
proved EMO algorithms with run-time estimates and re- 
sulted in many other interesting test problems [49.120- 
123]. Although these test problems may not resem- 
ble common practical problems, the working principles 
of suggested EMO algorithms to handle specific prob- 
lem structures bring in a plethora of insights about the 
working of multi-objective optimization, particularly 
in comprehensively finding all (not just one or a few) 
Pareto-optimal solutions. 


49.6.10 EMO with Meta-Models 


The practice of optimization algorithms is often lim- 
ited by the computational overheads associated with 
evaluating solutions. Certain problems involve expen- 
sive computations, such as numerical solution of par- 
tial differential equations describing the physics of 
the problem, finite difference computations involving 
an analysis of a solution, computational fluid dynam- 
ics simulation to study the performance of a solu- 
tion over a changing environment, etc. In some such 
problems, evaluation of each solution to compute con- 
straints and objective functions may take a few hours 
to a complete day or two. In such scenarios, even 
if an optimization algorithm needs one hundred so- 
lutions to get anywhere close to a good and feasible 
solution, the application needs an easy 3 to 6 months 
of continuous computational time. In most practical 


49.7 Conclusions 


The research and application in evolutionary multi- 
objective optimization (EMO) over the past 15 years 
have resulted in a number of efficient algorithms for 
finding a set of well-diversified, near Pareto-optimal 
solutions. EMO algorithms are now regularly being ap- 
plied to different problems in most areas of science, 
engineering, and commerce. This chapter has discussed 
the principles of EMO and illustrated the principle 
by depicting one efficient and popularly used EMO 
algorithm. Results from an inter-planetary spacecraft 
trajectory optimization problem reveal the importance 
of the principles followed in EMO algorithms. There- 
after, a specific constraint handling procedure used in 
EMO studies was briefly described. 

The main highlight of this chapter has been the 
description of some of the current research and appli- 


purposes, this is considered a luxury in an indus- 
trial set-up. Optimization researchers are constantly on 
their toes in coming up with approximate, yet faster 
algorithms. 

A little thought brings out an interesting fact about 
how optimization algorithms work. The initial iterations 
deal with solutions which may not be close to opti- 
mal solutions. Therefore, these solutions need not be 
evaluated with high precision. Meta-models for objec- 
tive functions and constraints have been developed for 
this purpose. Mostly two different approaches are fol- 
lowed. In one approach, a sample of solutions is used 
to generate a meta-model (an approximate model of 
the original objectives and constraints), and then ef- 
forts are made to find the optimum of the meta-model, 
assuming that the optimal solutions of both the meta- 
model and the original problem are similar to each 
other [49.124, 125]. In another method, a successive 
meta-modeling approach is used in which the algorithm 
starts to solve the first meta-model obtained from a sam- 
ple of the entire search space [49.126-128]. As the 
solutions start to focus near the optimum region of the 
meta-model, a new and more accurate meta-model is 
generated in the region dictated by the solutions of the 
previous optimization. A coarse-to-fine-grained meta- 
modeling technique based on artificial neural networks 
is shown to reduce the computational effort by about 30 
to 80% on different problems [49.126]. Other success- 
ful meta-modeling implementations for multi-objective 
optimization are based on Kriging and response surface 
methodologies exist [49.128, 129]. 


cation activities in EMO. One critical area of current 
research lies in collaborative EMO-MCDM algorithms 
for achieving a complete multi-objective optimization 
task of finding a set of trade-off solutions and finally 
arriving at a single preferred solution. Another di- 
rection taken by researchers is to address guaranteed 
convergence and diversity of EMO algorithms through 
hybridizing them with mathematical and numerical op- 
timization techniques as local search algorithms. Inter- 
estingly, EMO researchers have discovered its potential 
in solving traditionally hard optimization problems, 
but not necessarily multi-objective ones in nature, in 
a convenient manner using EMO algorithms. So-called 
multi-objectivization studies are attracting researchers 
from various fields to develop and apply EMO algo- 
rithms in many innovative ways. Considerable interest 
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in research and application has also been shown in ad- 
dressing practical aspects in existing EMO algorithms. 
In this direction, handling uncertainty in decision vari- 
ables and parameters, meeting an overall desired system 
reliability in obtained solutions, handling dynamically 
changing problems (on-line optimization), and han- 
dling a large number of objectives have been discussed 
in this chapter. Besides the practical aspects, EMO has 
also attracted mathematically-oriented theoreticians to 
develop EMO algorithms and design suitable problems 
for coming up with a computational complexity anal- 
ysis. There are many other research directions which 
could not even mention due to space restrictions. 

In the short span of about 15 years, it has become 
clear that the field of EMO research and application 
now has efficient algorithms and numerous interesting 
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50. Parallel Multiobjective Evolutionary Algorithms 


Francisco Luna, Enrique Alba 


The use of evolutionary algorithms (EAs) for solving 
multiobjective optimization problems has been 
very active in the last few years. The main rea- 
sons for this popularity are their ease of use with 
respect to classical mathematical programming 
techniques, their scalability, and their suitabil- 
ity for finding trade-off solutions in a single run. 
However, these algorithms may be computationally 
expensive because (1) many real-world optimiza- 
tion problems typically involve tasks demanding 
high computational resources and (2) they are 
aimed at finding a whole front of optimal so- 
lutions instead of searching for a single optimum. 
Parallelizing EAs emerges as a possible way of re- 
ducing the CPU time down to affordable values, 
but it also allows researchers to use an advanced 
search engine — the parallel model — that provides 
the algorithms with an improved population di- 
versity and enable them to cooperate with other 
(eventually nonevolutionary) techniques. The goal 
of this chapter is to provide the reader with an up- 
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to-date review of the recent literature on parallel 
EAs for multiobjective optimization. 


50.1 Multiobjective Optimization and Parallelism 


Multiobjective optimization arises in many real-world 
applications, especially in engineering, in which sev- 
eral performance criteria conflict with each other. These 
conflicting objectives make the optimization results in 
that no single solution can usually optimize them all 
simultaneously. Indeed, the aim of multiobjective opti- 
mization is to find a set of compromise solutions with 
different tradeoffs among criteria, also known as the 
Pareto optimal set. When this set is plotted in the ob- 
jective space it is called the Pareto front [50.1, 2]. 
Many different techniques have been proposed 
in the multiobjective research community to address 
multiobjective optimization problems (MOPs). Unlike 
classical mathematical programming approaches, meta- 


heuristics in general, and EAs (multiobjective evo- 
lutionary algorithms or MOEAs) in particular, have 
attracted growing attention over the last decade because 
of two main facts. On the one hand, EAs have the ability 
to generate several members of the Pareto optimal set 
in one single run, as opposed to classical multicriteria 
decision-making techniques. They are also less sensi- 
tive to the shape of the Pareto front so therefore can 
deal with a large variety of MOPs. On the other hand, as 
randomized black-box algorithms, EAs can address op- 
timization problems with nonlinear, nondifferentiable, 
or noisy objective functions. 

In spite of these advantages, these algorithms might 
be computationally expensive because, on the one hand, 
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they need to explore larger portions of the search 
space since they seek the entire Pareto front, which 
usually results in more function evaluations being per- 
formed; on the other hand, and even more importantly, 
many real-world multiobjective problems typically use 
computationally expensive methods for computing the 
objective functions and constraints. 

These issues are usually addressed in two differ- 
ent ways. First, one can use surrogate models of the 
fitness functions instead of true fitness function evalua- 
tions [50.3—5]. The second more important line lies in 
using parallel computing platforms to speed up the EA 
search [50.6]. This is the mainstream of this chapter. 

Due to their population-based approach, EAs are 
very suitable for parallelization because their main op- 
erations (i.e., crossover, mutation, and in particular 
function evaluation) can be carried out independently 
on different individuals. There is a vast amount of liter- 
ature on how to parallelize EAs; the reader is referred 
to [50.7-10] for surveys on this topic. However, par- 
allelism here is not only a way for solving problems 
more rapidly, but also for developing new and more 
efficient search models: a parallel EA can be more ef- 
fective than a sequential one, even when executed on 
a single processor. The advantages that parallelism offer 
to single-objective optimization also hold in multiob- 
jective optimization. Of particular interest for parallel 


MOFA s is the improvement in population diversity that 
shall help to fully approximate the entire Pareto front of 
the given optimization problems. 

The contribution of this chapter is to provide the 
reader with a recent review of publications related to 
parallel MOEAs, showing the latest advances in the 
field. Given the shear volume of papers, we have been 
forced to restrict ourselves to only those works which 
have been published since 2008/09, the years to which 
the two best known surveys date back [50.11, 12]. The 
structure of this chapter distinguishes between the nu- 
merical model of the parallel MOEA and its physical 
parallelization. In seminal papers in the fields [50.13], 
it was assumed that the model maps directly onto the 
parallel computing platform, but this is no longer true 
and any (MO)EA can be deployed in parallel, but not al- 
ways resulting in a high performance. The next section 
is therefore devoted to presenting the classical models 
for parallel EAs and a recent proposal that not only 
considers EAs, but metaheuristics and exact algorithms 
in general. Section 50.3 dives into the details of more 
than 80 publications, analyzing particular features of 
MOEFAs (fitness assignment, diversity preservation) as 
well as on their parallelization (model, topology, paral- 
lel platform). Finally, the last section presents the main 
conclusions and the trends for future research on paral- 
lel MOEAs. 


50.2 Parallel Models for Evolutionary Multi-Objective Algorithms 


Parallelism arises naturally when dealing with pop- 
ulations of individuals, since each individual is an 
independent unit. As a consequence, the performance 
of population-based algorithms is particularly improved 
when run in parallel. The main models for parallel 
MOEAs have been proposed within two clear scopes: 
especially EA-targeted models coming from the EA 
community [50.7, 14, 15] and those proposed for par- 
allel metaheuristics in general (of which EAs are a sub- 
class) [50.12, 16]. They are briefly presented in the 
following subsections. 


50.2.1 Specialized Models for Parallel EAs 


The most well-known models for parallel MOEAs have 
been inherited directly from the single-objective paral- 
lel EA community, in which two parallelizing strategies 
are defined for population-based algorithms: (1) par- 
allelization of computation, in which the operations 


commonly applied to each individual are performed 
in parallel, and (2) parallelization of population, in 
which the population is split into different parts, each 
one evolving in semi-isolation (individuals can be ex- 
changed between subpopulations). 

The simplest parallelization scheme of EAs is 
the well-known master-slave or global parallelization 
method (Fig. 50. 1a). In this scheme, a central processor 
performs the selection operations while the associated 
slave processors perform the recombination, mutation, 
and/or the evaluation of the fitness function. This al- 
gorithm is the same as the traditional (one population, 
panmictic), although it is faster, especially for time- 
consuming objective functions. Its simplicity has made 
it the most popular among practitioners. 

However, other models for parallel EAs utilize some 
kind of spatial disposition of the individuals (it is said 
that the population is then structured), and afterward 
parallelize the resulting chunks in a pool of proces- 
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Fig. 50.1a-f Different models of 
parallel EAs: (a) global paralleliza- 
tion, (b) coarse grain, and (c) fine 


grain. Many hybrids have been de- 


fined by combining parallel EAs at 


two levels: (d) coarse and fine grain, 
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sors. Among the most widely known types of structured 
EAs, the distributed (dEA) (or coarse-grain) and cel- 
lular (CEA) (fine-grain or diffusion) algorithms are 
very popular optimization procedures [50.7]. In the case 
of distributed EAs (Fig. 50.1b), the population is par- 
titioned into a set of islands in which isolated EAs 
run in parallel. Sparse individual exchanges are per- 
formed among these islands, with the goal of inserting 
some diversity into the subpopulations, thus avoiding 
them getting stuck in local optima. Islands may apply 
the same (homogeneous) or different (heterogeneous) 
EAs [50.17]. In the case of a cellular EA (Fig. 50.1c), 
subpopulations are typically composed of one individ- 
ual, which may only interact with its nearest neighbors 
in the breeding loop, i. e., the concept of neighborhood 
is introduced. These neighborhoods are overlapped, 
which implicitly defines a migration mechanism and 
allows a smooth diffusion of the best solutions through- 
out the population. This parallel scheme was targeted 
to massively parallel computers but nowadays it can 
be used sequentially on a regular computer or in par- 
allel on graphic processing units (GPUs). Also, hybrid 
models have been proposed (Fig. 50.1d—f) in which 
a two-level approach of parallelization is undertaken. 
In these models, the higher level for parallelization uses 
to be a coarse-grain implementation and the basic is- 
land performs a CEA, a master-slave method, or even 
another distributed one. 

This taxonomy holds as well for parallel 
MOEAs [50.15], so we can consider master-slave 
MOEAs (msMOEAs), distributed MOEAs (dMOEAs), 


(e) coarse grain and global paral- 
lelization, and (f) coarse grain at the 
two levels 


and cellular MOEAs (cMOEFAs). Nevertheless, these 
two decentralized population approaches need a further 
particularization for MOPs [50.14]. As we stated be- 
fore, the main goal of any multiobjective optimization 
algorithm is to find the optimal Pareto front for a given 
MOP. It is clear that in msMOEFAs the management of 
this Pareto front is carried out by the master processor. 
But, when the search process is distributed among 
different subalgorithms, as happens in dMOEAs and 
cMOEAs, the management of the nondominated set of 
solutions during the optimization procedure becomes 
a capital issue. Hence, it can be distinguished when the 
Pareto front is distributed and locally managed by each 
sub-EA during the computation, or it is a centralized 
element of the algorithm. They have been called 
centralized Pareto front (CPF) structured MOEAs and 
Distributed Pareto Front (DPF) structured MOEAs, 
respectively [50.16]. 

For distributed MOEAs, very specialized models 
have been proposed in the literature which are aimed at 
capturing the different approaches for partitioning the 
search of each island so as to avoid them overlapping 
their exploration [50.18]. On the one hand, each island 
may consider a different subset of the objectives and 
then either aggregate them into a single-objective prob- 
lem [50.19] or use a coevolutionary approach [50.20]. 
On the other hand, the search space (either the de- 
cision space or the objective space) can be explicitly 
partitioned and assigned to different islands. As stated 
in [50.11], in a general multiobjective problem it is dif- 
ficult to design an a priori distribution so that it: 
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Covers the entire search space, 

. Assigns regions of equal size, and 

3. Aggregates a minimum complexity to constraint 
demes to their assigned region. 


| e 


50.2.2 General Models 
for Parallel Metaheuristics 


Several models have been proposed for parallelizing 
metaheuristics [50.21,22] in which EAs, as a type of 
metaheuristic, perfectly fit. For parallel MOEAs, two 
main approaches have been proposed in the literature. 
In [50.16], the authors distinguish between single-walk 
and multiple-walk parallelizations. The former is aimed 
at speeding up the computations by parallelizing the 
evaluation of the objective functions or the search op- 
erators. In the latter, several search threads (EAs or 
any other search method) cooperate to better explore 
the search space (not only accelerating the execution). 
The same issue with the Pareto front as in the parallel 
MOEA models emerges here, so the authors also subdi- 
vide into centralized and distributed Pareto front models 
(CPF and DPF, respectively). 


On the other hand, Talbi etal. [50.12] catego- 
rize parallel metaheuristics in three major hierarchical 
models. The self-contained parallel cooperation is tar- 
geted to parallel computing platforms with limited 
communication. The search is performed by several 
subalgorithms in parallel, which might cooperate by 
exchanging some kind of information. It embraces the 
island model or dMOEAs explained before. Two main 
groups are distinguished: cooperating subpopulations, 
which are based on partitioning the objective/search 
space; and the multistart approach, in which several op- 
timization algorithms run separately in parallel. In the 
former, subpopulations can be homogeneous or hetero- 
geneous, explore separate regions of the search space, 
etc. The latter lies in running several local search al- 
gorithms in parallel. On a second and third level of 
the hierarchy, the authors consider those models aimed 
merely at speeding up the computations: problem in- 
dependent parallelization, which mainly comprises the 
master-slave approach of parallel fitness evaluation, 
and problem dependent parallelization, which focuses 
on subdividing single evaluations into parallel tasks that 
speed up the evaluation step. 


50.3 An Updated Review of the Literature 


This section is devoted to presenting and analyzing the 
most recent contributions in the literature to the parallel 
evolutionary multiobjective optimization field. We have 
structured the published material according to the clas- 
sical parallel EA models, i. e., master/slave, distributed, 
cellular, and hybrid models (Sect. 50.2.1) because this 
chapter is targeted precisely to EAs and, as a conse- 
quence, this classification better captures the design 
principles of the different contributions. Table 50.1 in- 
cludes, ordered by the year of publication, an updated 
review of the field. Also, in order to help the reader with 
the terminology of this table, Table 50.2 displays the 
symbols used and their definitions. Then, for each row 
of Table 50.1, the following information is shown: 


© FA-DP (Fitness assignment and diversity preserva- 
tion): As two of the most important design issues in 
EMO algorithms, the fitness assignment and diver- 
sity preservation mechanisms allow, respectively, 
to better guide the search toward Pareto optimal 
solutions and to spread out these Pareto optimal 
solutions along the entire Pareto front. They are 
frequently merged into one single measure that 
translates the vector of objective functions value 


of a multiobjective problem into one single scalar 
value which is used to rank solutions properly (from 
a Pareto optimality point of view, nondominated so- 
lutions are noncomparable). 

@ PM (Parallel model): It can take the values MS 
(Master/slave), Dis (distributed model), Cell (cel- 
lular model) or Hyb (hybrid), according to the 
classical parallel EA categorization. 

@ PFC (Pareto front computation): This column dis- 
tinguishes between the CPF and DPF strategies 
defined before. 

@ PP (parallel platform): When applicable, this col- 
umn indicates the kind of parallel computing plat- 
form in which the given algorithm is executed 
(GPUs, multicore, cluster, grid, etc.). 

@ Topology: Communication topology of the parallel 

MOEA (Star, Hybrid, all-to-all [A2A], etc.). 

@ Programming: When publicly reported, the pro- 

gramming language used to implement the parallel 

MOEA is included in this column. 

@ Description: The main features of the parallel 

MOEA in a few words. 

@ Application domain: The area in which the parallel 

MOEA has been applied. 
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50.3 An Updated Review of the Literature 


50.3.1 Analysis by Year 


The first point of analysis of the published material is 
done with respect to the number of publications over 
the years considered in this chapter, i.e., the period 
between 2008 and 2011. Figure 50.2 displays this infor- 
mation not only for the period analyzed in this chapter, 


Table 50.2 List of symbols used in Table 50.1 


Column Symbol Definition 
FA-DP R Ranking 
RS Ranking and sharing 
RC Ranking and crowding 
SRF Strength raw fitness 
WS Weighted sum (aggregation) 
Ib Indicator-based 
PT Pareto tournaments 
E€ Epsilon dominance 
Tcheb Tchebycheff aggregation 
PM MS Master/slave model 
Dis Distributed model 
Cell Cellular model 
Hyb Hybrid model 
PFC CPF Centralized Pareto front 
DPF Distributed Pareto front 
PP Seq Sequential algorithm 
GPU Graphics processing unit 
Topology A2A All-to-All 
Rand Random 
Isol Isolated 
Eucl Euclidean 
Hier Hierarchical 
Hyb Hybrid topology 
Programming PVM Parallel virtual machine 
MPI Message Passing Interface 
Mpich-G2 An MPI implementation for 
grid computing 
DEVS Discrete event system 
SOA Service-oriented architecture 
OpenMP Open multiprocessing 
CUDA Compute unified device 
architecture 
OpenMOLE Open MOdeL Experiment 
Description MS Master/slave 
MO Multiobjective 
SO Single-objective 
DE Differential evolution 
dNSGA-II Distributed NSGA-II 
Tech. techniques 
Async. Asynchronous 
Het. Heterogeneous 
- N/A Not available 


but also for the period 1993-2005 presented in [50.14]. 
The trend is fairly clear: this research topic has been 
active during the last few years. Indeed, if one com- 
pares this evolution with that presented in [50.14] by 
2006, where the highest number of works per year was 
10, it can be seen that the published material is dou- 
bled (more than 20 publications/year in 2009, 2010, 
and 2011). Despite the relative lack of novel, attractive 
approaches in the field, parallelism remains as a power- 
ful tool in the EMO community because of one major 
factor: the optimization problems addressed require to 
reduce the execution times to affordable values. This is 
emphasized with the current availability of cheap par- 
allel computing platforms such as multicore processors 
and, lately, GPUs. Indeed, the keyword multicore in col- 
umn PP in Table 50.1 is the second that appears the 
most. 


50.3.2 Analysis of the Parallel Models 


In this section, the different contributions are analyzed 
from the point of view of the characteristics of the 
parallel model. We will pay particular attention to the 
columns FA-DP, PM, PFC, and Topology. 

The fitness assignment and diversity preservation 
is a major issue in parallel MOEAs because, in many 
cases, the Pareto front is spread between different sub- 
algorithms (especially in the distributed models). The 
management of optimal solutions (via fitness assign- 
ment) and how they are distributed along the Pareto 
front (diversity) deserves a brief review. The FA-DP 
column shows that the Ranking and Crowding mech- 
anisms inherited from the most widely used algorithm 
in the area, namely NSGA-II [50.95], are also the most 
present in the literature as long as NSGA-II is the 
base algorithm for many of these parallel MOEAs. 
In the case of the distributed and cellular models, it 
is worth mentioning that Crowding is applied locally, 
i.e., diversity if kept within the same subalgorithm. 
If no advanced mechanism is devised to partition the 
search space (such as in [50.96,97]), the algorithm 
will be accepting/discarding solutions that should prob- 
ably be in the same region as those computed by 
the other subalgorithm components. The same hap- 
pens with classical FA-DP methods such as the strength 
raw fitness (SRF) or the indicator-based (IB) in col- 
umn FA-DP, respectively). As a final note, we strongly 
believe that these algorithms based on decomposition 
such as MOEA/D [50.98] are especially well suited to 
profit from parallel platforms. Indeed, they are based on 
decomposing the multiobjective problems into a num- 
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Fig. 50.2 Number of publications on 
parallel MOEAs grouped by the year 
of publication in the periods 1993— 
2005 (after [50.14]) and 2008-2011 
(this chapter) 
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ber of scalar subproblems that are distributed along the 
Pareto front. Therefore, the partition of the search space 
is implicitly done and they take full advantage of mul- 
ticore processors. The texts [50.80, 81] follow this line 
of research. 

If one analyzes the usage of the different paral- 
lel models in the revised publications (Fig. 50.3), i. e., 
MS, distributed (Dis), and cellular (Cell), several clear 
conclusions can be drawn. First, despite the simplic- 
ity of the MS model, it appears in almost half of the 
related literature (45%). Multiobjective optimization 
problems are becoming more and more complex and 
demand high-end computational resources, what makes 
this approach very suitable in this context. Indeed, the 
underlying search model remains unchanged (entirely 
located at the master process), because authors are usu- 
ally only interested at speeding up the computations. 
Second, the distributed models are still receiving much 
attention from the multiobjective community (47% of 
the analyzed publications use this model) in the quest 
for engineering an improved algorithm that reaches the 
Pareto fronts in a more effective way (not only reducing 
execution times). The promising results published with 


Fig. 50.3 Percentage of use of the different parallel 
MOEA models in the revised literature 


their single objective counterparts [50.7] have pushed. 
forward the research on this area. The major issue with 
distributed models arises with the difficult management 
of the Pareto front, as stated in Sect. 50.2.1 and as will 
be discussed below. Finally, a special comment about 
the cellular model: even though it is to be exploited 
in future literature, its percentage of use has doubled 
since the previous literature review in 2006 [50.14]. 
The point is that these algorithms are usually exe- 
cuted sequentially, with no parallelism at all, because 
they were originally targeted to massively parallel ma- 
chines, and these kind of machines fell into disuse. 
There are, of course, some exceptions such as [50.46], 
where a cellular-like MOFA for aerodynamic optimiza- 
tion is deployed on a cluster of computers. 

The PFC column is a hot topic in parallel MOEAs. 
Handling the nondominated solutions found during the 
search, when the search is distributed in probably sep- 
arate processors, has promoted and is still promoting 
fundamental research within the community. The most 
widely used strategy, however, is to keep a central 
pool of nondominated solutions (CPF), that is, there is 
one single front. This approach appears in 55% of the 
analyzed publications and totally matches both the mas- 
ter/slave and the cellular parallel models. This design 
option is straightforward and makes common sense. 

Almost the same happens with the DPF strategy 
(45% of the papers), which is mostly used with the dis- 
tributed model of parallel MOEAs: the Pareto front is 
approximated separately for each of the subalgorith- 
mic components during the search, only merged into 
one single front at the end of the exploration. In gen- 
eral, all DPF strategies are complemented with eventual 
CPF phases, which allows the search of the different 
subalgorithms to get overlapped [50.11]. A couple of 
exceptions are to be found among the revised litera- 
ture, in which a distributed computation of the Pareto 
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front is endowed with a fully centralized Pareto front. 
In [50.50], a distributed MOEA has one single Pareto 
front that is computed with several isolated islands. 
Each island uses a weighted sum approach (with differ- 
ent weights) and is targeted to one region of the search 
space. The second exception appears in [50.80, 81], in 
which a multithreaded parallelization of MOEA/D is 
presented. The Pareto front is stored in global mem- 
ory and the different threads are in charge of separate 
groups of weight combinations. 

The final aspect under analysis in this section is the 
interconnection topology of the different components of 
the parallel algorithms. The Star topologies are widely 
used in two scenarios: (i) master/slave models and (ii) 
in distributed models with periodic gathering opera- 
tions required to generate a single Pareto front. A star 
topology is able to capture the idea of topologies with 
a central master that delivers tasks to a set of worker 
nodes and this is why it is so popular in these two pre- 
vious cases. The column Topology in Table 50.1 also 
reveals that All-to-All (A2A) and Random topologies 
also exist in the literature. The former enables the dif- 
ferent components of the parallel algorithm to be tightly 
coupled, thus quickly spreading the nondominated solu- 
tions found for a faster convergence toward the optimal 
Pareto front. The later implies that the genetic material 
may take longer to reach all the algorithmic compo- 
nents, thus promoting diversity. 


50.3.3 Review of the Software 
Implementations 


This section is mainly targeted to summarize the con- 
tents of the column Programming in Table 50.1, in 
which a note on the implementation of the algorithms 
is given. A quick look at the items of the column 
clearly states that the combination of C/C++ as the 
programming language and MPI (Message Passing In- 
terface) [50.99] as the technology for enabling the 
parallel communication between the different com- 
ponents of the parallel algorithms are the preferred 
options. This can be explained by the strong engineer- 
ing background of most of the MOPs addressed (and 
researchers), a field in which C/C++ has had a dom- 
inant position for many years. Indeed, C/C++ allows 
researchers to include very low level routines (even as- 
sembler code) that enable full control of all parts of their 
applications. MPI, in turn, is a standard (not just a li- 
brary) for which many implementations exist (MPICH, 
LAM, etc.), so its use always guarantee correctness and 
efficiency. 


Despite this clear fact, only two novels, relevant 
trends on this topic that are worth mentioning in de- 
tail can be found. On the one hand, even though 
clusters of computers are able to provide researchers 
with a large computational power, there are MOPs 
that require still more additional resources. These re- 
sources can only be supplied by grid computing plat- 
forms [50.100]. This has promoted the parallelization of 
EMO algorithms with grid-enabled technologies such 
as MPICH-G2 [50.40] or MatGrid [50.34]. On the 
other hand, there already are several seminal works 
on the parallelization of multiobjective optimizers in 
GPUs, as stated in Sect. 50.3.1. To the best of our 
knowledge, only implementations with C and CUDA 
(compute unified device architecture) [50.101] have 
been proposed in [50.66, 93], but nowadays other op- 
portunities have also emerged such as, for example, 
OpenCL [50.102]. 


50.3.4 Main Application Domains 


One of the main reasons, if not the main one for 
the popularity of MOEAs, is their success in solv- 
ing real-world problems. Parallel EMO algorithms are 
no exception. As a consequence, the variability in the 
application domains is very large, which makes the 
task of classification rather difficult. By partially fol- 
lowing the categorization proposed in [50.18], three 
main areas of application are distinguished: engineer- 
ing, industrial, and scientific. Figure 50.4 summarizes 
the percentage of revised publications that fall into 
each area. Besides these three categories, we have also 
displayed in this figure a fourth item devoted to bench- 


Engineering 
35% AN 


Benchmarking << 


Industrial 


Fig. 50.4 Application domains 
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marking. This latter is not an application but it appears 
a lot in the revised literature. There are well-established 
testbeds such as Zitzler-Deb-Thiele (ZDT) [50.103], 
Deb-Thiele-Laumanns-Zitzler (DTLZ) [50.104], or 
the walking fish group (WFG) [50.105], that have been 
widely used as a comparison basis for introducing new 
algorithmic proposals. 

Among the real-world applications, engineering ap- 
plications are, by far, the most popular domain within 
the parallel EMO literature (also in the entire EMO 
field), principally because they usually have suitable 
mathematical models for this kind of algorithms. In- 
deed, Fig. 50.4 shows that almost 35% of the papers 
analyzed address an optimization problem from the 
engineering domain. Several relevant works among 
those analyzed are devoted to aerodynamic shape opti- 
mization [50.27, 30, 46], reconfiguration of operational 
satellite constellations [50.41], or the combustion con- 
trol for different types of vehicles [50.85, 91]. 

The second place in terms of popularity is occupied 
by the industrial applications, that appear in 21% of the 
papers reviewed. These applications are related to the 
fields of manufacturing, scheduling, and management. 


50.4 Conclusions and Future Works 
50.4.1 Summary 


In this chapter, we have carried out a comprehensive 
survey of the literature concerning parallel MOEAs 
since 2008/2009, the year when two of the most 
well-known comprehensive surveys were published. 
We have first described the existing parallel models 
for MOEAs, distinguishing between those specifically 
targeted at EAs and those aimed at capturing the 
essence of parallel metaheuristics in general. Based 
on the former model (as long as we are interested 
in surveying parallel MOEAs), more than 80 relevant 
papers have been carefully analyzed (many dozens 
more studied but left out because of little relevance 
to this survey). Fundamental aspects such as the fit- 
ness assignment and the diversity preservation, the 
parallel model used, the management of the approx- 
imated Pareto front, the underlying parallel platform 
used (if any), and the communication topology of the 
algorithms have been revised. Their main application 
domains have been gathered and structured into engi- 
neering, industrial, and scientific real-world multiobjec- 
tive problems. 


Very interesting problems have been addressed in this 
industrial domain, such as the optimization of sonic 
crystal attenuation properties [50.52], the Camembert 
cheese ripening process [50.78], and evaluation of the 
input tax to regulate pollution from agricultural produc- 
tion [50.65]. 

Scientific applications, the third category of real- 
world applications analyzed (16% of the papers), is in- 
tended to group optimization problems in bioinformat- 
ics, chemistry, and computer science. The most success- 
ful applications here are devoted to the bioinformatic 
and chemistry fields, in problems on molecular dock- 
ing [50.33], protein design [50.40], drug design [50.56], 
and phylogenetic inference [50.69, 70]. In our view, the 
reason these applications (either engineering, industrial 
or scientific) have been tackled with parallel MOEAs 
is precisely the tasks involved to compute their objec- 
tive functions. When a new enhanced parallel search 
model is sought, then authors usually rely on bench- 
marking functions. Indeed, as stated above, 30% of 
the revised papers (Fig. 50.4) use these testbeds either 
exclusively [50.61, 68, 82, 94] or prior to solving a real- 
world application [50.5, 20, 51]. 


Despite the lack of new, attractive ideas in the area, 
this survey has revealed that the research on parallel 
MOFAs is still moving forward for two main reasons. 
On the one hand, researchers think of parallelism as 
a way to not only speed up computations, but also as 
a strategy to enhance the search engines of the algo- 
rithms. On the other hand, the computational demands 
of many multiobjective problems (dimensionality, un- 
certainty, simulations, etc.) means that the parallelism 
is the only suitable option to address them. 

In the following section, some topics for future re- 
search for engineers interested in parallel MOEAs are 
outlined. In the authors’ opinion, these topics merit par- 
ticular attention so that the current state-of-the-art can 
be improved. 


50.4.2 Future Trends 


We have structured the future trends section as 
a bottom-up approach, proposing research lines for par- 
allel MOEAs that range from low-level algorithmic 
details to more complex enhanced strategies at a higher 
level. We will also suggest that different studies that are 
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missing in the literature need to be carried out and, in 
our opinion, once completed, might have a great impact 
within the community. 

Let us start with one of the main issues of parallel 
MOEAs which is such that, in the end, the algorithms 
have to compute one single approximation to the Pareto 
front, namely, they have an inherent centralized struc- 
ture. When several subalgorithms cooperate to approx- 
imate such a Pareto front (i. e., the distributed models), 
they may coordinate themselves somehow in order to 
effectively sample different regions of the front in order 
to avoid overlaps. These overlaps are the main rea- 
son for the distributed algorithms being outperformed 
by centralized approaches (such as the master/slave). 
This issue has been partially addressed in the litera- 
ture [50.96, 97] with some degree of success. However, 
these cited results have not been widely used yet and 
new advances are needed. 

We propose here two lines of research that might 
help the distributed models of parallel MOEAs to over- 
come this issue. The first one is based on designing 
a fully distributed diversity preservation method (e.g., 
a distributed crowding). The research question here is 
whether it is possible to devise a density estimator that 
considers both local and global information from the 
other components of the distributed MOEA. We think 
so. Instead of trying to allocate a given portion of the 
Pareto front to each island, these islands should period- 
ically broadcast a list with the objective values of their 
local solutions (but not the decision variables). Then, 
when checking whether a solution is to be stored in its 
local Pareto front, the island has to consider both the 
local information and the global information received 
from the other islands. To the best of our knowledge, 
such a mechanism does not exist in the literature. The 
second proposal is to use rough set theory [50.106] to 
effectively partition the search space and allocate dif- 
ferent portions to different components of the parallel 
MOEA. There does exist a preliminary work [50.107] 
that uses a multiobjective simulated annealing, and 
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51. Many-Objective Problems: 
Challenges and Methods 


Antonio López Jaimes, Carlos A. Coello Coello 


This chapter presents a short review of the state- 
of-the-art efforts for understanding and solving 
problems with a large number of objectives 
(usually known as many-objective optimization 
problems, MOP s). The first part of the chapter 
presents the current studies aimed at discovering 
the sources that make a multiobjective optimiza- 
tion problem (MOP) harder when more objectives 
are added, degrading in this way, the performance 
of a multiobjective evolutionary algorithm (MOEA). 
Next, some of the most relevant techniques de- 
signed to deal with MOPs are presented and 
categorized. 
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51.1 Background 


Since the first implementation of an MOEA in the mid- 
1980s [51.1], a wide variety of new MOEAs have been 
proposed, gradually improving in both their effective- 
ness and efficiency to solve MOPs [51.2]. However, 
most of these algorithms have been evaluated and ap- 
plied to problems with only two or three objectives, in 
spite of the fact that many real-world problems have 
more than three objectives [51.3—6]. 

Recent experimental [51.7—9] and analytical [51.10, 
11] studies have shown that MOEAs based on Pareto 
optimality [51.12] scale poorly in MOPs with a high 
number of objectives (4 or more). These MOPs are 
usually known in the community as MOPs. Although 
those scalability issues seem mainly to affect Pareto- 
based MOEAs, as we will see later in this chapter, 
optimization problems with a large number of objec- 
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tives introduce some difficulties common to any other 
multiobjective optimizer. 

The goal of this chapter is to present a general 
view of the difficulties posed by many-objective prob- 
lems for Pareto-based MOEAs. Specifically, we present 
a review of the potential sources of difficulty currently 
found in the specialized literature. Likewise, we present 
a brief review of the current proposals to deal with these 
sources of difficulty. These proposals are classified into 
five classes. Among the most common approaches to 
deal with MOPs, we can find the use of preference 
relations to further rank nondominated solutions, the 
removal of redundant objectives during or after the 
search, and the incorporation of preference information. 
Finally, at the end of the chapter some future research 
paths are outlined. 
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51.2 Basic Concepts and Notation 


In this section, we will introduce the concepts and no- 
tation that will be used throughout the rest of the paper. 
Since some of these proposals are based on conflict 
information among the objectives, some definitions of 
conflict are also provided. 


51.2.1 Multiobjective Optimization Problems 


Definition 51.1 Multiobjective optimization prob- 
lem 
An MOP is defined as 


Minimize f(x) = [fi(x),fo(%),..., KOF, 


subject toxe X. (51.1) 


The vector x € R” is formed by n decision vari- 
ables representing the quantities for which values are 
to be chosen in the optimization problem. The feasi- 
ble set X C R” is implicitly determined by a set of 
equality and inequality constraints. The vector func- 
tion f : X — RÝ is composed of k > 2 scalar objective 
functions fi: X —>R G@=1,...,). In multiobjective 
optimization, the sets R” and R* are known as the 
decision variable space and objective function space, 
respectively. The image of X under the function f is 
a subset of the objective function space denoted by 
Z = f(X) and referred to as the feasible set in the ob- 
jective function space. 

In order to define precisely the multiobjective op- 
timization problem stated in Definition 51.1, we have 
to establish the meaning of minimization in R*. That 
is to say, we need to define how vectors z = f(x) € R4 
have to be compared for different solutions x € R”. In 
single-objective optimization the relation less than or 
equal (<) is used to compare the scalar objective val- 
ues. By using this relation there may be many different 
optimal solutions x € X, but only one optimal value 
fm = min{f (x) |x € X} since the relation < induces 
a total order in R (i. e., every pair of solutions is com- 
parable, and thus, we can sort solutions from the best to 
the worst one). In contrast, in multiobjective optimiza- 
tion problems, there is no canonical order of R‘, and 
thus, we need weaker definitions of order to compare 
vectors in R4. 

In multiobjective optimization, the Pareto domi- 
nance relation is usually adopted. This relation was 
originally proposed by Edgeworth in 1881 [51.13], but 
generalized by the French-Italian economist Pareto in 
1896 [51.12]. 


Definition 51.2 Pareto dominance relation 
We say that a vector z! dominates vector z”, denoted by 
z! <2’, if and only if 


Vie fl... kiz < z2 (51.2) 
and 


HEV ck ig! < zZ. (51.3) 


If z! =z? or z! >z? for some i, then we say that 
z! does not dominate z? (denoted by z! 4 z”). Thus, to 
solve an MOP, we have to find those solutions x € X 
whose images, z = f(x), are not dominated by any other 
vector in the feasible space. It is said that two vectors, 
z! and z, are mutually nondominated vectors if z! x z? 
and z? 4 z!. 


Definition 51.3 Pareto optimality 
A solution x* € X is Pareto optimal if there does not 
exist another solution x € X such that f(x) < f(x*). 


Definition 51.4 Pareto optimal set 
The Pareto optimal set, Pop, is defined as 


Pop = {x €X |AV EX : f(y) <fa). (51.4) 


Definition 51.5 Pareto front 
For a Pareto optimal set Pop, the Pareto front PF op is 
defined as 


PFop = {2 = A), fE) |X € Popy. (51.5) 


In decision variable space, these vectors are referred 
to as decision vectors of the Pareto optimal set, while in 
objective space, they are called objective vectors of the 
Pareto optimal set. In practice, the goal of MOEAs is 
to find the best approximation set of the Pareto opti- 
mal front. An approximation set is a finite subset of Z 
composed of mutually nondominated vectors and is de- 
noted by PF approx. Currently, it is well accepted that the 
best approximation set is determined by the closeness to 
the Pareto optimal front, and the spread over the entire 
Pareto optimal front [51.2, 14, 15]. 
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A common approach to deal with multiobjective 
optimization problems is formulating it as a single opti- 
mization problem by means of a kind of function called 
scalarizing function. 


Definition 51.6 Scalarizing function 

A scalarizing function is a parameterized function s: 
R? — R. Thus, the multiobjective problem is trans- 
formed into the following scalar problem: 


Minimize s(z) , 
subject toz € Z. (51.6) 


It is worth noting, however, that scalarizing func- 
tions generate one point at a time (instead of several, as 
happens when using the definition of Pareto optimality). 
A common scalarizing function is based on the Cheby- 
shev distance (Loo metric) [51.16, 17]. 


Definition 51.7 Weighted Chebyshev scalarizing 
function 
The weighted Chebyshev scalarizing function (or 
Chebyshev function for short) is defined by 

Sœ (z, z") = [max Aiki- gh (51.7) 
where z" is a reference point, A = [A,,..., Ag] is a vec- 
tor of weights such that Vi A; > 0 and, for at least one i, 
Ài >0. 


51.2.2 Notions of Conflict Among Objectives 


One important condition of a multiobjective problem 
is the conflict among their objectives. If the objectives 
have no conflict among them, then we could solve the 
problem optimizing each objective function indepen- 
dently. Nonetheless, it has been found that in some 
problems, although a conflict exists elsewhere, some 
objectives behave in a nonconflicting manner. Although 
different authors have proposed definitions for con- 
flict (nonconflict) among objectives [51.18—21], in this 
chapter we only present conflict (nonconflict) defini- 
tions relevant to this document. 


Definition 51.8 

Let Sx be a subset of X, then, according to Carlsson and 
Fullér, two objectives can be related in the following 
ways (assuming minimization): 


1. fiis in conflict with f on Sx if fŒ!) < fix?) implies 
fœ) > f(x?) for all x!,x? € Sx. 


2. f; supports f on Sx if f,(x!) > f(x?) implies f(x!) > 
f(x?) for all x!,x? € Sx. 
3. fi and f are independent on Sx, otherwise. 


In the cases 2 and 3, those objectives are also called 
nonconflicting objectives. When Sx = X, it is said that 
fj is in conflict with (or supports) f; globally. However, in 
many MOPs the relation among the objectives changes 
when comparing different subsets of X. Figure 51.1 
shows an example in which two functions are in con- 
flict in some subsets of X, while in others, they support 
each other. 

Nonconflicting objectives are also known as 
nonessential or redundant objectives because, as 
pointed out by Gal and Hanne [51.22], when a non- 
conflicting objective is removed from the original set of 
objectives, the resulting Pareto front does not change. 
Based on the notion of nonessential objectives, Brock- 
hoff and Zitzler [51.21] proposed a conflict definition 
that verifies whether the Pareto dominance relation 
changes when some objectives are removed, or not. The 
Pareto dominance relation induced by a given set of ob- 
jectives, F C {fi,f,...,f;}, is defined as 


Sr= ty) | x,y E€ X and Vf € F: fix) <fi)}. 


Definition 51.9 

Let F,, F2 C Ø be two subsets of objectives, where Ø 
is the entire set of objectives = {f,,fo,...,f,}. Then, 
we call F; nonconflicting with Fz iff (<p,C<,r,) A 
(Xr, Er). 


In other words, F; and F3 are called nonconflicting 
if and only if the corresponding relations <p, and <p, 


Conflict Support 


Fig. 51.1 Two objective functions can be in conflict in 
some subsets of the feasible space, and can be supportive 
in other subsets 
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are identical, but not necessarily F; = F2. The noncon- 
flicting definition is useful since if F and F’ C F are 
nonconflicting, then we can replace F with F’ and ob- 
tain the same Pareto optimal front. The objectives in F’ 
are then called essential objectives, whereas the objec- 
tives in F \ F’ are known as nonessential or redundant 
objectives. 

In practice, however, it is useful to allow a certain 
extent of change on the Pareto front when an objec- 
tive is omitted in order to define degrees of nonconflict 
among objectives. In this direction, Brockhoff and Zit- 
zler proposed to use the additive e-dominance indicator 
to measure the change between two dominance rela- 


tions. The ¢-dominance relation induced by a set F 
is defined by <%= {(x,y)|x,y € X and Vf; € F : fi(x) 
—€ <fi(y)}. 


Definition 51.10 

Let F1, F2 C F be two subsets of objectives, where F 
is the entire set of objectives. Then, we call F; 6- 
nonconflicting with F3 iff (<r, ext) A (Srm E$). 


In this case, if an objective subset F’ C F is 6- 
nonconflicting with F, then we can omit all objectives 
in F\ F’ without causing a larger error than 6 in the 
omitted objectives. 


51.3 Sources of Difficulty to Solve Many-Objective Optimization Problems 


51.3.1 Deterioration of the Search Ability 


A widespread explanation for this problem is based 
on the fact that the proportion of nondominated solu- 
tions (i. e., equally good solutions according to Pareto 
dominance) in a population increases rapidly with the 
number of objectives [51.23, 24]. In order to illustrate 
this condition, Fig. 51.2 shows the nondominated re- 
gions with respect to a given solution z. 

In general, as presented by Farina and Am- 
ato [51.23], the expression to compute the proportion, e, 
of mutually nondominated regions and the whole search 
space is given by e = (2% —2)/2*, where k is the num- 
ber of objectives. This proportion goes to infinity when 
the number of objectives approaches infinity. 


f 


fi 


E| Worst Better Equal 


Fig. 51.2 Example of the increasing proportion of nondominated 
solutions: for 2 objectives 1/2 of the search space is composed of 
nondominated regions, whereas for 3 objectives 3/4 of the search 
space consists of nondominated regions. In general, for k objectives, 
(2t — 2)/2* of the objective space comprises nondominated regions 


Therefore, since in MOPs with a high number of 
objectives almost all solutions are equivalent, many re- 
searchers have suggested [51.11, 23, 25-28] that in such 
problems, the selection of the appropriate individuals 
for steering the population toward the Pareto optimal 
set gets more difficult. As a result, an MOP gets harder 
to solve as more objectives are added. 

However, as pointed out by Schiitze et al. [51.29], 
the increase of the number of nondominated individ- 
uals is not a sufficient condition for an increase of 
the hardness of a problem. Specifically, they conclude 
that in a class of uni-modal problems, their diffi- 
culty is marginally increased when more objectives are 
added despite the exponential growth of the propor- 
tion of nondominated solutions with k. Nonetheless, 
they suggest that the hardness increase observed in ex- 
perimental studies might be the result of the addition 
of local optima to the problem as more objectives are 
aggregated. 

Therefore, although the rise of the proportion of 
incomparable solutions does not significantly deter- 
mine the difficulty of an MOP per se, it seems that 
the addition of objectives aggravates some particular 
difficulties observed in the context of 2 or 3 objec- 
tives. This is the case of the so-called dominance 
resistant solutions (DRSs) or outliers [51.14, 30-32]. 
DRSs are solutions with a poor value in at least 
one of the objectives, but with near optimal values 
in the others. In other words, those are nondomi- 
nated solutions, but far from the Pareto optimal front. 
Figure 51.3 shows an example of DRSs in the well- 
known test problem DTLZ2 [51.14]. These kinds of 
solutions represent potential difficulty since, as many 
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researchers have pointed out [51.14, 30-32], the num- 
ber of DRSs grows as the number of objectives is 
increased. 


51.3.2 Effectiveness of Crossover Operators 


In a combinatorial class of MOPs, Sato et al. [51.33] 
performed a series of experiments that revealed that so- 
lutions in the variable space become more distant (in 
terms of the Hamming distance between binary en- 
coded solutions.) from each other as more objectives 
are added to the problem. In this scenario, the re- 
combination of two parents close to the Pareto front 
might generate an offspring far from the Pareto front 
since a conventional crossover operator might be too 
disruptive. 


51.3.3 Dimensionality of the Pareto Front 


Due to the curse of dimensionality, the number of points 
required to represent accurately a Pareto front increases 
exponentially with the number of objectives. Formally, 
the number of points necessary to represent a Pareto 
front with k objectives and resolution r is bounded by 
O(krk—!) [51.34]. This expression is derived assum- 
ing that each solution is contained inside a hypercube 
to preserve an even distribution. As can be seen in 
Fig. 51.4, the number of hypercubes determines the 
resolution of the Pareto front, i.e., r is the number of 
hypercubes per dimension. An example of the shortest 
connected and nondegenerated 2-objective Pareto front 
(a straight line) is shown on the left side of Fig. 51.4. 
The figure also shows a bound for the largest Pareto 
front for 2 and 3 objectives. In general, the bounding 
Pareto front is formed by k hyperplanes containing r*—! 
hypercubes each (see, for example, the 3-objective case 
shown on the right side of Fig. 51.4). This way, the max- 
imum number of points of a 2-objective Pareto front 
with resolution r = 6 is 2x 62~! = 12, whereas for 3 
objectives and r = 5 is 3x5?! = 75. Table 51.1 shows 
the maximum number of points required to represent 
a Pareto front for different numbers of objectives using 
a resolution of r= 25, which is a conservative num- 
ber considering that a resolution of r = 50 is usually 
used in several studies to obtain 100 solutions in 2- 
objective problems. Notwithstanding, for 5 objectives, 
we would require approximately 2 million points to 
represent a Pareto front with resolution r = 25. There 
are other formulations leading to a similar exponen- 
tial expression with respect to k. For example, using 
the concept of €-dominance, Laumanns et al. [51.35] 


Fig. 51.3 Illustration of some DRSs in problem DTLZ2: although 
solutions marked as DRSs seem to be dominated by some solu- 
tion in the lower part of the circled solutions, they achieve marginal 
improvements in objectives fı or f2, and therefore, they are nondom- 
inated solutions, but having poor values in objective f3, though 


> 


Fig. 51.4 Number of points required to represent a Pareto 
front with a resolution r, i.e., the number of hypercubes 
per dimension 


Table 51.1 Bound for the number of points required to rep- 
resent a Pareto front with resolution r = 25 


k Points 
2 50 
4 62 500 
5 1953 125 
7 1708 984 375 


and Schiitze et al. [51.36] give a similar exponential 
bound for the size of an approximation of a Pareto 
front. 

This poses some difficulties to solve MOPs. The 
most important one is the number of function eval- 
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uations required to deal with a large number of so- 
lutions. This is a serious issue since plenty of real- 
world problems (e.g., [51.37—43]), due to time con- 
straint reasons, have a small budget of function eval- 
uations. In fact, there is an important research effort 
toward designing MOEAs that generate good approx- 
imations of the Pareto front using less than 1000 
function evaluations (e.g., [51.44-47]). Other chal- 
lenges are related to the design of both data struc- 
tures to efficiently manage that number of points, 
and density estimators to achieve an even distribu- 
tion of the solutions along the Pareto front. Unfortu- 
nately, even if we could efficiently obtain an accurate 
approximation of the Pareto front, the selection of 
one solution among such a huge number of solutions 


would be a very difficult task for a decision maker 
(DM). 


51.3.4 Visualization of the Pareto Front 


Clearly, with more than three objectives it is not pos- 
sible to plot the Pareto front as usual. This is a serious 
problem since visualization plays a key role for a proper 
decision-making process. Parallel coordinates [51.48] 
and self-organizing maps [51.49] are some of the meth- 
ods proposed to ease decision making in high dimen- 
sional problems. The reader is referred to Chapters 8 
and 9 of [51.50] for a good review of various visual- 
ization techniques. Nevertheless, more research in the 
many-objective optimization context is still required. 


51.4 Current Approaches to Deal with Many-Objective Problems 


Besides studies about the scalability of Pareto-based 
MOBAs, in the current literature we can find sev- 
eral proposals to overcome those scalability issues. 
The most common approaches can be categorized as 
follows: 


1. Adopt or propose a preference relation that yields 
a finer solution ordering than the one yielded by 
Pareto optimality. In other words, these relations 
are able to further rank nondominated solutions. In 
addition, most of these preference relations share 
the property that their optimal set of solutions 
is a subset of the Pareto optimal set. Therefore, 
these techniques can also be used as a remedy to 
cope with the dimensionality of Pareto fronts in 
MOPs. 

2. Reduce the number of objectives of the problem 
during the search process or, a posteriori, once 
an approximation of the Pareto front has been 
found [51.21, 26,51]. The main goal of these kinds 
of reduction techniques is to identify the noncon- 
flicting objectives (at least to a certain extent) in 
order to discard them. 

3. Scalarizing decomposition of an MOP. As de- 
scribed in the previous section, the degradation 
observed on MOEAs when dealing with many- 
objective problems is mainly attributed to the inef- 
ficiency of the Pareto relation in high-dimensional 
spaces. Therefore, methods that do not rely on 
Pareto dominance, like scalarizing decomposition 
methods, have been suggested as an alternative to 


deal with many-objective problems. The underlying 
idea of these types of methods is to perform a num- 
ber of single-objective searches along different 
search vectors evenly distributed over the objective 
space. Each single-objective search is formulated by 
means of a scalarizing function. This way, the ap- 
proximation of the Pareto front is composed of the 
optima found by every single-objective search. 

4. Incorporation of preference information interac- 
tively throughout the course of the optimization 
process. By incorporating preferences we can cope 
with MIOPs in two aspects. First, the search can be 
focused on the decision maker’s region of interest, 
avoiding this way, the evaluation of a huge number 
of solutions. Second, the preference relations 
usually used in interactive methods help to deal 
with a large number of objectives since they are 
able to rank incomparable nondominated solutions. 

5. Use of specialized recombination operators or 
strategies to control the mating among parents. 
The first approach tries to diminish the disruptive 
effect of recombination operators by regulating the 
proportion in which the traits of each parent con- 
tribute to create the offspring. The second approach 
restricts which individuals can be paired for recom- 
bination, for instance, using the similarity as mating 
criteria or the location in the objective space. 


In the remainder of this section, some of the most 
relevant approaches to deal with many-objective prob- 
lems are presented. 
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51.4 Current Approaches to Deal with Many-Objective Problems 


51.4.1 Preference Relations to Deal 
with Many-Objective Problems 


Bentley and Wakefield [51.52] proposed the average 
ranking (AR) and the maximum ranking (MR) pref- 
erence relations. The AR relation computes, for each 
solution, a different rank considering each objective in- 
dependently. The final rank is obtained by summing 
up the ranks on each objective. In turn, the MR rela- 
tion takes the best rank as the global rank. Clearly, this 
method favors extreme solutions, i.e., solutions with 
high performance in some of the objectives, although 
with poor overall performance. Although it is less ev- 
ident, the average ranking relation also favors extreme 
solutions. 

In the favor relation, proposed by Drechsler 
et al. [51.53], a vector z! is preferred to vector z? with 
respect to the favor relation (Z! <fayor 27), if and only if 


Hiz! <z <iis k} 
Stig >g, jak. 


In other words, the favored vector is that which outper- 
forms the other one in more objectives. Unfortunately, 
this relation emphasizes extreme solutions. 

The preference order relation (POR), developed by 
di Pierro [51.54], is based on the concept of efficiency 
of order proposed by Das [51.55], which states that: 
A vector z* is efficient of the order q if it is not domi- 
nated by any other vector in all the (‘) objective subsets 
of size q. 

Based on that definition, it is said that the vector 
z! is preferred to the vector 2 (Z! <por z”), if and only 
if, for some integer q and YI C {1,2,...,k} such that 


=q 


zJ <2 Wiel, and Jiel: z <z. 

In other words, if z! and z? do not dominate each other, 
then the solutions are compared in a lower dimensional 
space in order to break the tie. 

Sato et al. [51.56] proposed a preference relation to 
control the dominance area of solutions. This method 
controls the degree of expansion or contraction of the 
dominance area by modifying each objective vector z 
with the expression 

r-sin(@; + 5;-7) 


_ VGH 1,2) erg Kk; 
a sin(s;- 7) ' 


where s € R* is a user-defined vector, r = ||z||, and œ; 
is the declination angle between z and the axis of f;. 


If the user adopts values s; < 0.5 (Y i = 1,2,..., k), 
the dominance area is expanded and produces 
a more fine-grained ranking of solutions which would 
strengthen the selection process. Thus, we can say that 
vector z is preferred to vector y with respect to the ex- 
pansion relation (Z ~expansion Y), if and only if z’ < y’. 

Farina and Amato [51.57] proposed an alterna- 
tive relation which takes into account the number of 
improved objectives between two solutions. This rela- 
tion employs three quantities, 7 (X1, X2), Me(X1, X2) and 
Ny (X1, X2), which denote the objectives where x, is bet- 
ter, equal or worse than x2, respectively. Using these 
quantities, the concepts of (1 —k)-dominance and k- 
optimality are defined. A solution x, (1 — k) dominates 
X2 if and only if 


ne(X1, X2) < M 
=ne 


= 
np(X1, X2) = k41 


In a similar way to Pareto optimality, a solution x* 
is a k-optimum if and only if there is no x in the decision 
variable space such that x k-dominates x*. 

An important remark that we have to keep in mind 
with respect to a new preference relation is that in 
spite of the fact that some preference relations con- 
tribute to converge faster to the Pareto front than the 
Pareto dominance relation, they also stress the gener- 
ation of solutions far from the knee region (usually 
the middle region of the Pareto front). This condi- 
tion limits the applicability of these relations since, 
in the general case, it is commonly assumed that the 
DM prefers solutions from the knee region [51.58- 
61). 


51.4.2 Objective Reduction Approaches 


Deb and Saxena [51.26] proposed a method for re- 
ducing the number of objectives based on principal 
component analysis. The main assumption is that if two 
objectives are negatively correlated (taking the gener- 
ated Pareto front as the data set), then these objectives 
are in conflict with each other. To determine the most 
conflicting objectives (i. e., the most essential), the au- 
thors analyze in turn the eigenvectors (i. e., the principal 
components) of the correlation matrix. That is, by pick- 
ing the most negative and the most positive elements 
from the first eigenvector, we can identify the two most 
important conflicting objectives. To aggregate more 
objectives to the set of essential objectives, the remain- 
der of the eigenvectors are analyzed in a similar way 
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until the cumulative contribution of the eigenvalues ex- 
ceeds a threshold cut (TC). This method is incorporated 
into an iterative scheme which uses a multiobjective 
optimizer (the actual implementation uses the nondom- 
inated sorting genetic algorithm II (NSGA-II) [51.62]) 
to obtain a reduced objective set containing only the 
nonredundant objectives according to the analysis of the 
eigenvectors. In this scheme, the evolutionary multiob- 
jective optimizer is first run and then, the correlation 
analysis is carried out to obtain a reduced set of objec- 
tives. This process is repeated using the new reduced 
set of objectives. The process stops when the current 
subset is equal to the subset generated in the previous 
iteration. 

Brockhoff and Zitzler [51.21] defined two kinds of 
objective reduction problems and two corresponding al- 
gorithms to solve them. The problems proposed are the 
following: 


1. The 5-MOSS problem. Given an MOP, the 6-mi- 
nimum objective subset problem is defined as fol- 
lows. 
© Input: A Pareto front approximation of the MOP 

andadeR. 
© Task: Compute the minimum objective subset 
F’ C F such that F’ is 5-nonconflicting with F. 

2. The K-EMOSS problem. Given an MOP, the prob- 
lem of finding the minimum objective subset of size 
K with minimum error is defined as follows. 

@ Input: A Pareto front approximation of the MOP 
anda K EN. 

© Task: Compute an objective subset F’ C F with 
size |F’| < K, such that F’ is 6-nonconflicting 
with F with the minimum possible 6. 


Since both problems are NP-hard, the authors 
proposed both an exact and a greedy algorithm for 
each of them. The exact algorithms for both problems 
have time complexity O(m7k- 2"), where m is the size 
of the given nondominated set and k is the number 
of objectives. On the other hand, the greedy algo- 
rithm for the 6-MOSS problem has time complexity 
O(min{m?k>, m*k?}), while the greedy algorithm for 
the K-EMOSS problem has time complexity O(m7k*). 

A similar approach was proposed by López Jaimes 
et al. [51.51]. They proposed two different objective re- 
duction algorithms: 


1. An algorithm that finds a minimum subset of 
nonredundant objectives with the minimum error 
possible. 


2. An algorithm that finds a K-size subset of nonre- 
dundant objectives, yielding the minimum error 
possible. 


Both algorithms are based on an unsupervised 
feature selection technique proposed by Mitra 
et al. [51.63], in which the correlation coefficient 
is used to estimate the conflict among objectives. 
Specifically, a negative correlation between a pair of 
objectives means that one objective increases, while 
the other decreases and vice versa (see, for example, 
the functions in Fig. 51.1). On the other hand, if the 
correlation is positive, then both objectives increase or 
decrease at the same time. This way, we could interpret 
that the more negative the correlation between two 
objectives, the more the conflict between them. 

These two algorithms were designed to be used af- 
ter an approximation of the Pareto front has been found. 
From a general point of view, the removal of the non- 
conflicting objectives can help to the problem designer 
or the decision maker to gain knowledge about the re- 
lation and importance of the objectives according to the 
conflict. With regard to the decision-making process, 
the removal of the nonconflicting objectives eases the 
visualization of the approximation of the Pareto front. 
In cases with a moderate number of objectives (i. e., 4— 
7), the reduced objective set might be visualized using 
the traditional 3D plots. 

However, an objective reduction technique can also 
be used in the course of the search. In [51.64], for 
instance, the authors proposed the incorporation of 
an objective reduction technique into a Pareto-based 
MOEA in order to cope with many-objective problems 
during the search. One possible approach is gradually 
reducing the number of objectives throughout different 
stages of the search until a target objective subset size 
has been reached. In each reduction stage, an objec- 
tive reduction method is applied on the current Pareto 
front approximation. Toward the end of the search, the 
original objective set is used again to approximate the 
entire Pareto front. This kind of approach can be advan- 
tageous for solving real-world problems with expensive 
objective functions since only a small subset of the 
objective functions is evaluated. Additionally, the use 
of a small set of objectives throughout the course of 
the search makes possible the adoption of expensive 
ranking schemes (e.g., those based on the hypervolume 
indicator) in problems with a high number of objec- 
tives [51.65]. 

A further approach, presented in [51.66], consists 
in partitioning the objective set into several subsets so 
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that a different portion of the population focuses the 
search on a different subspace. The partitioning of the 
set of objectives is based on the analysis of the conflict 
information obtained from the current Pareto front ap- 
proximation. 


51.4.3 Preference Incorporation Approaches 


Like the alternative preference relations reviewed in 
Sect. 51.4.1, the integration of DM’s preferences pro- 
vides a finer rank of the solutions. However, unlike 
preference relation approaches, in an interactive ap- 
proach, the region of interest can be changed during 
the search according to the requirements of the decision 
maker. 

Among the earliest attempts to incorporate prefer- 
ences in an MOEA, we can find Fonseca and Flem- 
ing’s proposal [51.67, 68]. This proposal consisted of 
extending the ranking mechanism of multiobjective ge- 
netic algorithm (MOGA) [51.69] using the so-called 
preferability relation. This relation accommodates goal 
information (equivalent to a reference point in other 
methods) and priorities in a single preference relation. 
The DM should define goal values and group objec- 
tives according to its priority. Using the preferability 
relation, two solutions are first compared in terms of 
the group of objectives with the highest priority. If the 
objectives of both solutions meet all their goal values 
or, contrarily, violate some or all of their goal values 
in a similar way, the next priority objective group is 
considered. This process continues until reaching the 
lowest priority group, where solutions are compared us- 
ing the Pareto dominance relation. By setting particular 
goals and priorities the authors derived the following 
special cases: the usual Pareto relation, lexicographic 
relation, constrained optimization, and goal program- 
ming. One disadvantage of this relation is that it is 
affected by the feasibility of the goal provided by the 
decision maker. If the given goal is far away from the 
feasible region, then the solutions will be mainly com- 
pared in terms of the objective priorities, reducing the 
relation to the lexicographic relation. In addition, if two 
solutions either do or do not meet their goals, the rela- 
tion does not take into account the degree of under- or 
over-attainment. 

Deb [51.70] proposed a technique to transform goal 
programming problems into multiobjective optimiza- 
tion problems which are then solved using an MOEA. 
In goal programming, the DM has to assign goals that 
wishes to achieve for each objective, and these val- 
ues are incorporated into the problem as additional 


constraints. The objective function then attempts to 
minimize the absolute deviations from the goals to the 
objectives. Unfortunately, as the previous method, this 
approach is sensitive to the feasibility of the goal val- 
ues. If the goal is contained in the feasible space, it 
could prevent the generation of a better solution. On 
the other hand, if the goal is located far away from the 
feasible space, the effect of the method is practically 
nonexistent. 

More recently, Deb and Sundar [51.71] incor- 
porated a reference point approach into the NSGA- 
II [51.72]. They introduced a modification in the crowd- 
ing distance operator in order to select from the last 
nondominated front the solutions that would take part 
of the new population. They used the Euclidean dis- 
tance to sort and rank the population accordingly (the 
solution closest to the reference point receives the best 
rank). This method was designed to take into account 
a set of reference points. The drawback of this scheme 
is that it only guarantees weak Pareto optimality. That 
is to say, besides Pareto optimal solutions, the method 
might generate some weakly Pareto optimal solutions, 
particularly in MOPs with disconnected Pareto fronts. 
A similar approach was also proposed by Deb and 
Kumar [51.73], in which the light beam search pro- 
cedure [51.74] was incorporated into the NSGA-II. 
Similar to the previous approach, they modified the 
crowding operator to incorporate DM’s preferences. 
They used a weighted achievement function to assign 
a crowding distance to each solution in each front. 
Thus, the solution with the least distance will have 
the best crowding rank. Like in the previous approach, 
this algorithm finds a subset of solutions around the 
optimum of the achievement function adopting the 
usual outranking relation. A vector z! outranks vec- 
tor z? if z! is considered to be at least as good as 2’. 
In [51.74], three kinds of thresholds are defined to de- 
termine if one solution outranks another one, namely, 
indifference, preference, and veto threshold. However, 
in [51.73] the veto threshold is the only one used. This 
relation depends on the crowding comparison opera- 
tor. In contrast, the new preference relation presented 
in this work does not depend on external methods, 
and, therefore, it can be used in every Pareto-based 
MOEA. 

Recently, Thiele etal. [51.75] proposed a vari- 
ant of the indicator-based evolutionary algorithm 
(IBEA) [51.76], in which preference information is 
incorporated by means of an achievement scalariza- 
tion function. The basic idea is to divide the original 
indicator value (which is to be maximized) by the 
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Fig. 51.5 Nondominated solutions with respect to the 
Chebyshev relation 


achievement value (which is to be minimized). Thus, 
solutions with a smaller achievement value will be pre- 
ferred since the modified indicator value is larger. In 
a further paper, the new IBEA of Thiele et al. was used 
in [51.77] in order to approximate the entire Pareto front 
by defining several reference points. 

A recent interactive optimization method was pro- 
posed by López Jaimes etal. [51.78] to deal with 


MOpPs. This method is based on a Chebyshev achieve- 
ment function. The basic idea of the Chebyshev pref- 
erence relation is to combine the Pareto dominance 
relation and the achievement function to compare so- 
lutions in objective function space. The Chebyshev 
preference relation is defined as follows. 


Definition 51.11 
A solution z! is preferred to solution z? with respect to 
the Chebyshev relation (z! <cheby 2’), if and only if: 


1. Soo (zi, zy < Soo (z7, zi) A {z! ¢ R(z*, 8) vz? ¢ 
R(z**, 8)}, or, 
2, gag A {z!, z? ERG™,d)}, 


where 
Ra, 8) = {z | Solz, 2") < gin a 5} 


is the region of interest (ROI) with respect to the vector 
of aspiration levels z". 


As an illustration of the preference relation, con- 
sider solutions z! and z? presented in Fig. 51.5. 
Since z? €R(z™,8) and soo(z! Z") < S90(z7,2), 
then z! ~<cheby z. 


51.5 Recombination Operators and Mating Restrictions 


The idea of restricted mating is not new in the field of 
evolutionary optimization. For instance, in 1989 Deb 
and Goldberg [51.79] suggested the use of restrictive 
mating with respect to the phenotypic (i. e., using the 
decoded values of the variables) distance using some 
metric. A different approach consisted in distributing 
solutions on a logical topology. For example, Baita 
et al. [51.80] placed solutions on a grid and restricted 
the area within which each solution could mate. For 
more examples of restricted mating, the reader is re- 
ferred to [51.2]. 

Recently, specific mating techniques to deal with 
many-objective problems have been proposed. Sato 
etal. [51.81] described a local recombination scheme 
that recombines individuals if they have similar search 
directions in the objective space. The search direction is 
defined by the polar coordinates of each solution, i. e., 
its norm and declination angles to the axis associated 
with the first k— 1 objectives. 


In order to control the disruptive effect of re- 
combination, in [51.33], a crossover operator for bi- 
nary representation was proposed, namely the con- 
trolling crossed genes (CCG) operator. This tech- 
nique was applied into the two-point and uniform- 
crossover operators. In two-point crossover, from the 
three binary segments in which two parents are di- 
vided, the middle segment is exchanged between 
the parents to produce two children. Thus, in the 
CCG operator for two-point crossover, the length of 
the middle segment is regulated by a user param- 
eter. This way, as the middle segment gets shorter, 
the generated children become more similar to each 
parent. 

Regarding uniform crossover, the number of ex- 
changed bits between parents is regulated with the 
probability of writing a 1 or a 0 in the bit mask string 
that determines which parent bit will be copied into the 
produced offspring. 
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51.6 Scalarization Methods 


Most of the scalarization methods have in common the 
following mechanisms (although they differ in the way 
in which they are implemented): 


© Aclass of scalarizing function to evaluate solutions. 

© A mechanism to generate a uniform distribution of 
search direction vectors. 

© A mechanism to obtain an overall ranking of the 
solutions derived from the evaluation of each scalar- 
izing function. 


Hughes [51.82] proposed a method in which the 
weighted Chebyshev function and the vector angle dis- 
tance scaling are used as scalarizing functions. The 
method to generate the search direction is formulated 
as the problem of maximizing the angle between each 


51.7 Conclusions and Research Paths 


This chapter presented a short review of the current ad- 
vances to cope with optimization problems with a high 
number of objectives MOPs using MOEA. We covered 
results aimed at discovering and studying the causes 
that make an MOP more difficult as more objectives are 
aggregated. We also described and classified some of 
the current techniques to deal with MOPs. 

Regarding the sources of difficulty of many- 
objective optimization problems, we can realize that 
most of the initial works are based on experimental 
analysis, and only a few studies are focused on in- 
vestigating the nature of the problem using theoretical 
considerations. When the interest on many-objective 
optimization problems begun, some hypotheses about 
the causes of the poor performance of MOEA on MOPs 
were suggested. Although some of them were con- 
sidered highly probable and may turn out to be true, 
further investigation is still needed to confirm or refute 
these hypotheses. This was the case of the propor- 
tion of nondominated solutions, which was often taken 
as a sufficient condition to increase the difficulty of 
an MOP. However, recent studies have shown that there 
exists some problems, in which this proportion rises 
exponentially, while the hardness of the problem only 


pair of neighboring search vectors. The fitness of each 
solution in the current population is based on the best 
result obtained over all the scalarizing function, i. e., the 
search direction in which the solution performs better. 

Another algorithm that has been recently tested 
in many-objective problems is the multiobjective 
evolutionary algorithm based on decomposition 
(MOEA/D) [51.83]. In [51.84], the performance of 
MOEA/D using either a weighted sum function or 
a Chebyshev function was studied using several in- 
stances of a knapsack problem. The results showed that 
the weighted sum function provided better results than 
the Chebyshev function, while in nonconvex problems, 
the Chebyshev function helped to achieve a better 
performance of MOEA/D. 


increases marginally. In this sense, future research paths 
must be channeled to investigate other sources of diffi- 
culty. Some promising areas of future research are, for 
example, the following: 


@ Since DRS are not present in every MOP, a charac- 
terization of the problems that promote the creation 
of DRSs is required. 

@ Investigate if recombination operators in continuous 
spaces also represent an issue as observed in dis- 
crete spaces. 


Regarding the methods to solve MOPs, many pro- 
posals have been designed to improve the search ability 
of MOEAs in high-dimensional scenarios. However, 
a few efforts are perceived for developing visualization 
methods specialized for MOPs. Similarly, more pro- 
posals for coping with the dimensionality of the Pareto 
front are needed. For instance, diversity mechanisms 
that are effective in large spaces or data structures to 
efficiently manage a large number of solutions. With 
respect to the assessment of a new MOEA in many- 
objective scenarios, our recommendation is adopting 
a diverse set of MOPs, taking instances from different 
families of test suites. 
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52. Memetic and Hybrid Evolutionary Algorithms 


Jhon Edgar Amaya, Carlos Cotta Porras, Antonio J. Fernandez Leiva 


This chapter presents an overview of hybridiza- 
tion mechanisms in evolutionary algorithms. Such 
mechanisms are aimed to introducing prob- 
lem knowledge in the optimization technique 
by means of the synergistic combination of 
general-purpose methods and problemspecific 
add-ons. This combination is presented in this 
work from two wide perspectives: memetic al- 
gorithms and cooperative optimization models. 
Memetic algorithms are based on the smart or- 
chestration of global (population-based) and local 
(trajectorybased) techniques, using an algorithmic 
scheme in which the latter are often subordinated 
to the former. As to cooperative models, they are 
based on the collaboration of different optimiza- 
tion techniques that exchange information in order 
to boost their respective performances. Both ap- 
proaches, memetic algorithms and cooperative 


52.1 Overview 


Heuristic methods are aimed to efficiently produce 
near-optimal solutions for hard problems (optimiza- 
tion problems in particular). We are here specifically 
concerned with those methods used to solve an op- 
timization problem by means of an intelligent explo- 
ration of the search space and the fruitful exploitation 
of knowledge about the problem structure. This is ad- 
mittedly a very broad class of methods that comprise — 
among others — classical artificial intelligence tools 
such a the A* algorithm as well as modern optimiza- 
tion techniques such as metaheuristics [52.1]. The latter 
are general-purpose techniques for optimization that 
guide some underlying basic heuristics for intelligently 
exploring the search space of the problem under con- 
sideration. 

There exists a plethora of metaheuristic methods, 
each of them with its own distinctive features and gov- 
erning parameters, typically (yet not always) based 
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models, provide a framework to achieve synergis- 
tic algorithmic combinations for the resolution of 
large-scale combinatorial problems. 


on some analogy of a real-world phenomenon (be it 
in the area of biology, zoology, physics, etc.) Indeed, 
there have been several attempts in the literature to 
classify these techniques according to different crite- 
ria, e.g. whether they are inspired by nature or not, 
use of memory, neighborhood structure, use of sin- 
gle solutions or populations thereof, etc. Blum and 
Roli [52.1] proposed a classification in which a dis- 
tinction was firstly made between trajectory-based (or 
single-point search) and population-based techniques 
(see also Fig. 52.1). The former can be depicted as fol- 
lowing a particular trajectory (sequence of points) in 
the search space by the smart exploration of the neigh- 
borhood of a single solution (this is to some extent 
an oversimplification, since trajectory-based techniques 
are often endowed with intensification/diversification 
mechanisms that may turn this trajectory into complex 
branching paths; nevertheless, it serves as an initial 
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analogy). The latter are, however, better imagined as 
a cloud of points moving through the search space, 
expanding and contracting according to some internal 
dynamics. 

Despite what the above depiction may suggest in 
terms of the superiority or adequateness of methods 
falling within some particular class, it is not possi- 
ble to state that any method is better than any other 
one, at least not in a general sense. This is a some- 
what counterintuitive result that was formally derived 
by Wolpert and Macready [52.2] in the so-called “no 
free lunch theorem” (NFL). This theorem can be for- 
mulated as 


>> PAn fA D=) P@mIf.B.e), (52.1) 
f f 


where P(x; |f, A, e) is the probability that algorithm A 
detects the optimal solution for a generic objective 
function f using computational effort e (i.e., gener- 
ating e different solutions) and P(x,,|f,B,e) is the 
analogous probability for algorithm B. In other words, 
the average performance of any pair of algorithms 
across all possible problems defined on particular do- 
mains and co-domains is identical. Hence, whenever an 
algorithm performs well on a certain problem or class 
of problems, it follows that it will exhibit degraded per- 
formance on the set of all remaining problems. While 
the initial assumptions from which the NFL theorem is 
derived are questionable (most importantly, the consid- 
eration of all possible problems include many functions 
that are random or incompressible in an algorithmic 
information-complexity sense, and hence cannot be ef- 
ficiently calculated, thus rendering them irrelevant from 
an optimization point of view), the concept that there 
is no universal optimizer had a significant impact on 
the scientific community and provides a safe ground 
onto which particular optimization procedures can be 
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Fig. 52.1 Classification of metaheuristics according to Blum and 
Roli (after [52.1]) 


built. To be precise, the NFL theorem highlights the 
limitations of black-box optimization procedures, i. e., 
techniques whose search strategy is independent or un- 
aware of the internal working of the objective function 
that is being optimized, and emphasizes the need for 
trying to exploit domain knowledge within the search 
algorithm in order to tailor the optimization process to 
the problem under consideration. 

The argument above is commonly used to support 
the development and utilization of hybrid metaheuris- 
tics, where the term hybrid is used to denote in a broad 
sense the exploitation of problem-dependent knowl- 
edge (typically attained via the sensible combination of 
general-purpose and problem-specific mechanisms). In- 
deed, these hybrid methods can be shown to provide 
an efficient behavior and notable flexibility for deal- 
ing with real-world problems. The general idea here 
is achieving a synergetic combination of complemen- 
tary techniques in order to enhance their strengths and 
having their weaknesses alleviated. Roughly speaking, 
such hybrid approaches can be attained via two dif- 
ferent (and complementary) approaches: cooperation 
(the techniques involved exchange information in order 
to boost their respective performances) and integration 
(one of the techniques is subordinated to the other one, 
which uses the former as a tool to achieve some internal 
goal) [52.3]. 

Arguably, one of the advantages (if not from 
the performance point of view at least from the de- 
sign point of view) of population-based methods over 
trajectory-based methods is their greater flexibility 
when it comes to integrating different metaheuristics. 
For example, cooperative methods can often be de- 
fined as a population of (possibly heterogenous) search 
agents exchanging information according to some un- 
derlying connection topology. Different architectures 
for such cooperative methods have been defined, e.g. 
MAGMA [52.4] or COSEARCH [52.5], depending 
on the communication strategy and the intervening 
methods. We can also cite hyper-heuristics [52.6, 7] 
in this regard, i.e., the use of a high-level heuristic 
to control the application of a population of low-level 
heuristics. 

The above ideas fit nicely with the notion of 
memetic algorithm (MA). MAs are a family of meta- 
heuristics that try to blend several concepts from 
population-based and trajectory-based techniques. The 
term memetic comes from meme, a word coined by 
Dawkins [52.8] as an analogy to the gene in the con- 
text of cultural evolution. In this sense, there is, indeed, 
a connection between cultural evolution and memetic 
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algorithms, in the sense that memes are much more 
plastic and flexible than genes — and hence evolve 
faster — and can be subject to lifetime learning, thus 
leading to the transmission of acquired traits (much un- 
like biological evolution). Due to the way in which this 
can be implemented, MAs are often termed hybrid EAs 
or Lamarckian EAs, among other fancy terms. From 
a general perspective, we can say that an MA is a search 
strategy in which a population of optimizing agents — 
explicitly concerned with using knowledge from the 
problem being solved — synergistically cooperate and 
compete [52.9]. 

Focusing on combinatorial optimization problems, 
that is, problems whose solution space is composed 
of combinatorial structures such as graphs, trees, sets, 
lists, permutations, etc., built on a discrete collec- 
tion of variables, MAs and hybrid metaheuristics are 
very well suited to their resolution. On the one hand, 
the solutions to these problems are information-rich 
structures that the algorithmic designer can analyze 
in order to extract problem information to be later 
used in the optimizer (some attempts have been made 
to automatically extract this information in combina- 


torial contexts as well [52.10]). This contrasts with 
most continuous optimization problems in which the 
high-dimensionality and highly non-linear coupling of 
variables makes them much more opaque in general 
(not to mention that black-box scenarios are more fre- 
quent in this continuous domain, e.g., optimization of 
physical or industrial processes via simulations of the 
system). On the other hand, it is very often the case 
that the objective function for combinatorial problems 
is decomposable or at least incrementally computable, 
meaning that after a small perturbation has been intro- 
duced in a solution the latter does not need to be fully 
evaluated from scratch (only an incremental term de- 
pendent on the modification done must be computed). 
This makes the use of local search strategies much 
more computationally amenable than in most contin- 
uous domains. Before describing MAs in more detail, 
let us first overview some generic ideas about evo- 
lutionary algorithms (EAs) and hybrid metaheuristics. 
Throughout the discussion we will focus on combi- 
natorial problems and provide illustrative examples 
on the instantiation of these techniques for discrete 
optimization. 


52.2 A Bird's View of Evolutionary Algorithms 


An EA is a stochastic iterative procedure for gener- 
ating candidate solutions for a certain problem. The 
algorithm manipulates a pool pop of individuals (the so- 
called population), each of them carrying one or more 
chromosomes. Chromosomes are, in turn, composed 
of smaller pieces called genes, each of them taking 
a value from a certain domain (the allele set). Chro- 
mosomes represent a solution for the problem at hand 
via an encoding/decoding process. More precisely, EAs 
assume the existence of a phenotype space compris- 
ing the solutions for the problem under consideration 
and a genotype space, comprising all possible chro- 
mosomes. It is between these two sets that the growth 
(or expression) function is defined so as to have the 
mapping between chromosomes and solutions. While 
in some cases these two spaces may be identical, this 
does not generally happen. In this general situation, the 
growth function is merely required to be surjective. 
The pool of solutions is initialized either at ran- 
dom or by means of some heuristic seeding procedure. 
Each individual then receives a fitness value quantify- 
ing how good the solution it carries is. This value will 
be used by the EA for guiding the search. The ini- 


tial population is actually the playground on which the 
EA will subsequently work, iteratively applying some 
evolutionary operators to modify its contents. More pre- 
cisely, the process comprises two major stages: selec- 
tion (promising solutions are selected for breeding and 
survival), and reproduction (new solutions are created 
by modifying selected solutions using some reproduc- 
tive operators). Selection is further decomposed into 
two sub-stages: the first one is selection for reproduc- 
tion (often simply called selection) in which solutions 
from the population are picked and fed to the repro- 
duction stage; the second one is selection for survival 
(commonly called replacement) in which the new solu- 
tions obtained in the reproduction stage are inserted in 
the population at the expense of removing some older 
solutions. Both selection sub-stages are present in EAs 
(although in some cases one of these sub-stages may 
take a very simplistic form; e.g., random selection for 
reproduction is sometimes used in evolution strategies). 
This selection—production cycle is repeated until a cer- 
tain termination criterion (usually reaching a maximum 
number of fitness computations; some more complex 
criteria based on stagnation detection are also possi- 
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ble [52.11]) is fulfilled. Each iteration of this process 
is commonly termed a generation. The whole process is 
illustrated in Algorithm 52.1. Every possible instantia- 
tion of this algorithmic template leads to a different EA. 


Algorithm 52.1A Basic Evolutionary Algorithm 
1: function BasicEA (in P: Problem, in par: Parame- 
ters): Solution; 
2: begin 


pop < INITIALIZE(par, P) ; 

repeat 
newpop, < SELECT(pop, par, P) ; 
newpop2 <- REPRODUCE(newpop, par, P) ; 
pop <+ REPLACE (pop, newpop2) ; 

until TERMINATIONCRITERION(par); 

9: return GETBEST(pop) ; 

10: end 


BO: SON TS OF 


52.3 From Hybrid Metaheuristics to Hybrid EAs 


As was mentioned before, hybrid metaheuristics (and 
in particular hybrid EAs) are developed aiming to at- 
tain a synergistic combination of several techniques, 
exploiting their strengths and mitigating their weak- 
nesses [52.12]. Besides the theoretical justifications for 
hybrid metaheuristics (arising, for example, from the 
NFL theorem sketched before), these techniques have 
been repeatedly vindicated by their practical success. 
Before getting to hybrid EAs, let us first focus on how 
hybridization can be approached. 


52.3.1 Hybridization Mechanisms 


As was already mentioned in Sect. 52.1, attempts to 
classify hybrid metaheuristics are manifold [52.13— 
17]. We will focus in the following on two of these, 
namely the classification of Talbi [52.13] and that of 
Raidl [52.16]. 


Heterogeneous/ 


homogeneous 


Ss 


Se 


Fig. 52.2 Classification of hybrid metaheuristics by Talbi (af- 
ter [52.13]) 


Talbi proposed a hierarchical taxonomy based on 
two design issues: functionality and algorithmic archi- 
tecture. According to this, we can distinguish between 
high/low-level hybrids and relay/team-work hybrids. 
Low-level hybridization addresses the functional com- 
position of a single optimization method in which 
a certain function of a metaheuristic is replaced by 
another metaheuristic. On the contrary, in high-level hy- 
brids, the internals of different metaheuristics are non- 
intersecting. As for relay hybridization, it comprises 
models in which a set of metaheuristics are sequen- 
tially applied, each using the output of the previous as 
its input. On the other hand, teamwork hybridization 
represents cooperative optimization models. These two 
distinctions (low versus high, relay versus teamwork) 
are orthogonal, and hence lead to four different com- 
binations. These four classes can, in turn, be refined 
using three additional dichotomies, namely homoge- 
neous versus heterogeneous (referring to the type of 
metaheuristics involved in the hybrid), global versus 
partial (referring to whether or not each technique ex- 
plores the whole search space) and specialist versus 
general (referring to whether or not all algorithms solve 
the same optimization problem). Figure 52.2 shows this 
taxonomy. 

Raidl [52.16], in turn, proposed a hybrid meta- 
heuristic classification centered around four elements: 
type of hybridization, level/strength of hybridiza- 
tion, control strategy, and execution order. Regard- 
ing the type of hybridization, we can distinguish 
between: 


1) Combinations of different metaheuristics 

2) Combination of metaheuristics and problem-spe- 
cific algorithms 

3) Combinations of metaheuristics with general oper- 
ational research (OR), artificial intelligence (AI), or 
constraint programming (CP) techniques. 
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Regarding the hybridization strength, we can 
distinguish high-level/weakly-coupled hybrids and 
low-level/strongly-coupled hybrids. As to the con- 
trol strategy, there are two possibilities: integrative 
(a technique takes a subordinate role) and collaborative 
(exchange of information without subordination). 
Finally, the order of execution captures the temporal 
aspect of the interaction among techniques. Thus, we 
can have sequential execution (a technique takes as 
input the output of another technique), intertwined 
execution (both techniques alternate parts of their 
execution at a computational or algorithmic level), 
and parallel execution (the techniques run in parallel). 
Figure 52.3 shows this classification. 


52.3.2 Hybrid EAs 


One of the most classical hybridization approaches 
for EAs is defined in the context of knowledge- 
augmented representations, particularly in the case 
that the solutions sought have an extremely complex 
structure for which a direct search does not seem 
adequate, or with problems that exhibit constraints. 
In the latter case, these can be handled in three 
ways: 


i) By using penalty functions that guide the search to 
feasible solutions 

ii) By using repairing mechanisms that turn infeasible 
solutions into feasible ones 


iii) By defining reproductive operators that always re- 
main in the feasible region. 


While the complexity of the representation and the 
operators can be kept low in the first two cases (i. e., 
the complexity is moved to the fitness function and the 
repairing function, respectively), the third case requires 
either a careful representation safeguarding feasibility, 
or complex operators intelligently handling the con- 
straints of the problem. Focusing on representations, 
decoders [52.18] are commonly used. These provide 
a complex genotype-to-phenotype mapping that may 
not just produce feasible solutions, but can also pro- 
vide better quality solutions. Consider, for example, 
the knapsack problem: solutions are sets of objects in 
this case, but clearly a random set may be infeasible 
due to the knapsack capacity constraint. This could 
be handled with a penalty term to account for this 
capacity violation or by adding/removing some ob- 
jects to turn the solution into a feasible one [52.19]. 
A decoder approach could, however, encode solutions 
as permutations, indicating the order in which ob- 
jects are to be considered for inclusion in the knap- 
sack. Since any object violating the capacity constraint 
would be skipped, a feasible solution would be al- 
ways obtained. Problem-space search [52.20] — the 
use of a construction heuristic that is guided through 
problem-space — also falls within this class of low- 
level/strong hybrids. Following with the knapsack prob- 
lem, solutions could in this case be represented as 
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Fig. 52.3 Classification of hybrid metaheuristics by Raid (after [52.16]) 
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perturbations of the value of objects. Each of these 
solutions would be evaluated by constructing the so- 
modified problem instance, solving it with a greedy 
heuristic and using the original instance to evalu- 
ate the quality of the solution obtained. This strategy 
is very competitive for this problem, as is shown 
in [52.21]. 

On the other hand, high-level/weak hybrid evolu- 
tionary algorithms are most typically obtained either by 
integrating within the EA a local-search (single-point 
or trajectory-based) method, or other techniques from 
the realm of OR/AI/CP/..., etc. Regarding the former 
approach, the underlying idea is to boost the intensi- 
fication capabilities of the algorithm by improving the 
solutions generated by the population-based search en- 
gine. This kind of combination dates back to the late 
1980s, when it used to take the form of a genetic 
algorithm hybridized with simulated annealing (SA), 
[52.22]. A particularly interesting hybrid EA along this 
line is the parallel recombinative simulated annealing 
algorithm [52.23], in which a pool of SA algorithms 
cooperate/compete in a genetic algorithm framework. 
Tabu search (TS) is another popular local search meta- 
heuristic to be hybridized with EAs (see [52.24—29], to 
mention just a few). 

Other EA techniques such as estimation of distribu- 
tion algorithms (EDAs) have also been hybridized with 
local search approaches, e.g., Campelo et al. [52.30] 
for the design of electromagnetic devices and Laguna 
et al. [52.31] for maximum cut. It is also worth men- 
tioning the work by Santana et al. [52.32] on the combi- 
nation of variable neighborhood search [52.33] (VNS) 
with EDAs for protein structure prediction. This is 
done in different ways, most notably either integrating 
VNS within the EDA or alternating the two algorithms. 
Zhang etal. [52.34] propose an analogous approach 
for quadratic programming based on the hybridization 


52.4 Memetic Algorithms 


MAs are population-based metaheuristics and as such 
they keep a population of candidate solutions for the 
problem under consideration. While these solutions 
were called individuals in EA jargon, in the context 
of MAs it is sometimes more appropriate to think of 
them as agents, thus highlighting their more active na- 
ture (i. e., behavior purposefully directed at optimizing 
some problem) in contrast to the passive nature of EA 
individuals (which are mere information placeholders 


of EDAs and 2-opt hill climbing. A very interesting 
approach was also proposed by Peña etal. [52.35], 
who hybridized a steady-state genetic algorithm and an 
EDA; each of these algorithms is responsible for gen- 
erating a part of the population. On the other hand, 
Zhou et al. [52.36] and Ahn et al. [52.37] propose the 
hybridization of an EDA with particle swarm optimiza- 
tion (PSO) where the latter is used for intensification 
purposes. 

As for hybridization with techniques from the 
realms of AI/OR or constraint programming, exam- 
ples date back to the mid 1990s. Particularly interesting 
is the combination of EAs with exact techniques and 
derivatives thereof. For example, branch and bound 
(BnB) can be integrated within an EA as a recom- 
bination operator [52.38, 39] or in the decoding pro- 
cess [52.40]. Conversely, an EA can be used for the 
strategic guidance of BnB [52.41]. As for collaborative 
combinations, intertwined approaches were considered. 
in [52.42] by combining EAs and BnB within a par- 
allel multiagent system, and in [52.38, 43] by defining 
a model in which the exact technique provided par- 
tial promising solutions, and the EA returned improved 
bounds. A related multilevel approach involving beam 
search and an EA hybridized with local-search algo- 
rithm can be found in [52.4446]. For further details on 
this kind of exact/metaheuristic hybridization we refer 
the reader to [52.3]. 

Most of the above hybrid EAs can be safely de- 
scribed as memetic algorithms, if only under the broad 
interpretation of MAs emanating from seminal and 
early works on the topic [52.9,47]. Indeed, the algo- 
rithmic hybridizations mentioned above can be seen 
as combinations of global and local search, prob- 
ably the most widely recognized feature of MAs. 
The next section provides a more detailed overview 
of MAs. 


subject to evolutionary operations). The particular way 
in which this active behavior can be captured will be 
discussed later on. Algorithm 52.2 shows the general 
pseudocode of a simple MA. 


Algorithm 52.2 Pseudocode of a basic MA based on 
a local search LS 
1: function Basic MA (in P: Problem, in par: Param- 
eters): Solution; 
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2: for i € N, do 

3: pop|i] <-GENERATE-SOLUTION(P); 

4: pop{i]<-LOCAL-IMPROVEMENT  (pop{i], P, 
par); 

5: end for 

6: i< 0; 

7: while i < MaxEvals do 

8: auxpop[0] <-SELECT (pop); 

9: for j <1 to #op do 


10: auxpop|j] < APPLY(op|j], auxpopļj— 1], P, 
par), 
11: end for 


12: newpop < LOCAL-IMPROVEMENT 
(auxpop|#op], P, par); 

13: pop < REPLACE(pop, newpop); 

14: if DEGENERATED(pop) then 

15: RESTART (pop, P); 

16: endif 

17: end while 

18: return GetBest (pop); 


First of all, the population must be initialized. Prob- 
lem knowledge can be introduced in this stage by using 
constructive heuristics. For example, greedy strate- 
gies based in the nearest neighbor heuristic [52.48] 
could be used to generate solutions for the travel- 
ing salesman problem (TSP) — see also [52.49-51] 
for other examples in the context of scheduling and 
timetabling. Then, the population of agents is subject 
to processes of competition and mutual cooperation 
much like in EAs. Competition (i.e., selection and 
replacement) can be done in general using any of 
the well-known strategies used in EAs, e.g., tourna- 
ment, ranking, or fitness-proportionate selection, and/or 
comma replacement, etc. As for cooperation, it is ac- 
complished by using a number of reproductive opera- 
tors. Many different such operators can be used in an 
MA, as illustrated in the general pseudocode shown in 
Algorithm 52.2: an array op of operators is sequen- 
tially applied to the population in a pipeline fashion. 
Note also how these operators receive as input not 
just the solutions they act on but also problem data, 
thus emphasizing the usage of problem knowledge. 
While it is possible to consider local improvement as 
one of these operators, it plays such a distinctive role 
in most MAs that it is independently depicted in the 
pseudocode. 

Recombination is the algorithmic component that 
best captures cooperation among two (or more [52.52]) 
agents in MAs. By using this operation, the relevant 
information contained in the parents is combined to 


produce new solutions. Relevance here amounts to 
be significant when it comes to evaluating the qual- 
ity of solutions. As an example, consider again the 
TSP. While solutions can be encoded as permuta- 
tions, a standard permutational recombination opera- 
tor will not perform adequately in general. The rea- 
son is that permutations are information-rich struc- 
tures carrying positional, precedence, and adjacency 
information [52.53]. Clearly, the latter is the really 
relevant piece of information when the TSP is in- 
volved. Hence, an edge-manipulation operator such as 
edge recombination [52.54] (ER) will perform better 
than position-based operators such as partially-mapped 
crossover [52.55] (PMX) or uniform cycle crossover 
[52.56] (UCX). There are several principled approaches 
to define measures capturing the goodness of different 
representations (that is, the way a particular encoding 
is interpreted) among which we can cite epistasis (non- 
additive influence on the fitness function of combining 
several information units) [52.57,58], fitness variance 
of formae (variance of the fitness values of a representa- 
tive subset of solutions carrying a particular information 
unit) [52.59], and fitness correlation (correlation in the 
fitness values of parents and offspring) [52.60, 61]. 
Mutation is the other classical reproductive oper- 
ator. Its role is that of injecting new material in the 
population (at a low rate to prevent the search degrad- 
ing to a random walk in the solution space). This view 
of mutation as an important operator but it is, never- 
theless, secondary to recombination and departs from 
the interpretation of the search process made in, e.g. 
evolutionary programming [52.62]. In either case, mu- 
tation plays an important role in EAs since it favors 
the effectiveness of recombination (particularly in some 
unstructured landscapes). Furthermore, if the problem 
exhibits constraints it is commonly much easier to han- 
dle these in a local way and maintain/achieve feasibility 
by introducing small perturbations in a solution than via 
recombination (e.g., consider a university timetabling 
problem [52.63]: given a feasible solution, it is easier to 
exchange a couple of slots, and keep them feasible, than 
to produce a new feasible solution that comes from the 
combination of two feasible assignments). However, it 
must be noted that unlike classical EAs, in which re- 
combination is a mere random shuffler of information 
(and hence can be arguably cast as a macro-mutational 
process), MAs usually utilize intelligent problem-aware 
mechanisms for recombination, and thus play a cru- 
cial role in the search. Broadly speaking, this inclusion 
of problem knowledge during recombination can be 
projected on two aspects of the process, namely the se- 
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lection of the pieces of information from the parents 
that will be transmitted to the offspring, and the selec- 
tion of the external information that will be added to 
it. Regarding the former issue, it is commonly assumed 
that transmission of common features is beneficial for 
some problems [52.54, 64]. Further completion of the 
descendant can be done in several ways. Radcliffe and 
Surry [52.59] proposed the use of local improvers or 
implicit enumeration schemes. Cotta and Troya sug- 
gested the use of exact techniques to find the best 
way of combining the information present in the par- 
ents [52.39]. Ibaraki [52.65] and Gallardo et al. [52.66] 
used dynamic programming for this purpose. 

The use of local search (LS) is one of the most dis- 
tinctive components of MAs, to the extent that MAs 
are often equated to EAs endowed with LS. While this 
is certainly a very popular implementation of MAs, 
several authors [52.47,67,68] advocate a broader in- 
terpretation of the paradigm in which an explicit local 
search algorithm need not be present (e.g., local im- 
provement can take place during recombination as in 
the edge assembly crossover defined in [52.69] for the 
TSP). In its simplest incarnation, these local improvers 
can be hill climbers, exploring the neighborhood of the 
current solution and performing uphill moves in the 
corresponding fitness landscape [52.70] until a local op- 
timum is found or the computational budget assigned 
to this operator is exhausted. Obviously, much more 
complex mechanisms can be defined for this purpose, 
such as the use of fully-fledged metaheuristics, such as, 
for example, TS, SA, or VNS, just to mention a few. 
It must be also noted that it is mainly because of the 
use of this mechanism for improving solutions on a lo- 
cal (and even autonomous) basis that the term agent 
is deserved. Under this interpretation, the MA can be 
viewed as a collection of agents that autonomously ex- 
plore the search space, cooperate via recombination, 
and compete for computational resources via selection 
and replacement. This also provides an interesting link 
to cooperative models for optimization and to memetic 
computing in general [52.71, 72]. 

One of the crucial elements governing the suc- 
cessful application of local search within an MA is 
achieving a good balance between global and local 
search. This amounts to determining when to apply lo- 
cal search (how often and on which solutions) and how 
intense this local search has to be. This parameteriza- 
tion problem is very hard and constitutes an active area 
of research [52.73]. An additional issue is the selection 
of a particular local search scheme within the MA. This 
has actually led to a very fruitful line of research in 


so-called multimemetic algorithms (MMAs). Therein, 
a meme is interpreted as a lifetime learning procedure 
capable of improving individual solutions [52.74-79]. 
Each solution in a MMA carries a gene indicating 
the particular LS operator that has to be applied on 
it (a pointer to an existing operator, or the parame- 
terization of a generic local search template). Thus, 
they constitute a generalization of meta-lamarckian 
EAs [52.80] (in which the selection of the LS opera- 
tor — from a pre-fixed set — is made using some rules 
that are hard-wired into the MA) and an intermediate 
step in the direction of co-evolving MAs [52.75] (in 
which a population of LS operators co-evolve along 
with a population of solutions). Finally, it is essential 
from a purely computational perspective to be able to 
apply LS in an efficient way. As was mentioned in 
Sect. 52.1 this is normally attained in combinatorial 
problems by incrementally evaluating solutions belong- 
ing to the neighborhood area. For example, consider the 
2-opt neighborhood in the TSP [52.48]: each neighbor 
of a given solution is obtained by a 2-opt move that re- 
moves two edges and adds two new edges; the fitness of 
such a neighbor can thus be computed by taking the fit- 
ness of the initial solution and adding a term accounting 
for the difference between added edges and removed 
edges. 

Another interesting element of MAs is the restart- 
ing process invoked whenever the population is deemed 
degenerate due to a lack of diversity or any other factor 
impairing the subsequent performance of the algorithm. 
This restarting process can be done in numerous ways 
(for example, triggering hypermutation [52.81] or in- 
troducing random solutions in the population [52.82]) 
and can be often found in plain EAs as well (indeed 
the use of restarting procedures in EAs can be traced 
back to the CHC algorithm [52.83] in the early 1990s). 
This said, it constitutes a generic element to be rou- 
tinely included in MAs. Indeed, scatter search [52.84] 
(a technique that can reasonably be termed memetic, 
despite having an independent origin from MAs and 
its fair share of distinctive features such as the em- 
phasis on using deterministic strategies) has a restart 
as a crucial element in its algorithmic cycle. Note 
also that it is not unusual to have MAs without mu- 
tation, given the fact that new information can also 
be injected in the population via local search, and the 
availability of restarting mechanisms. Indeed, in some 
applications, it may be better to converge quickly and 
then restart, rather than continuously diversifying the 
search. This is not the general norm at any rate. As 
a matter of fact, one can easily find MAs that use sev- 
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eral mutation operators, either by considering different 
basic neighborhoods [52.27,85] or by defining light 
and heavy mutations that introduce different amounts 
of new information [52.86,87] — cf. hypermutation. 
Needless to say, the use of restarting strategies is a cor- 
rective measure that is taken once a diversity problem 
is encountered and can be complemented with pre- 
ventive measures aimed to hinder (or even avoid) this 
problem taking place for the first time. For example, 
structured populations [52.88] could be used to cause 


a slowdown in the propagation of information across 
the population, hence hindering the apparition of su- 
per agents that might quickly take the population over 
and destroy diversity. Also, population management 
strategies based on the use of distance measures have 
been utilized with notable success in combinatorial 
problems [52.89]. More traditional strategies for main- 
taining diversity during selection and replacement, such 
as crowding [52.90] or sharing [52.91], can be used as 
well. 


52.5 Cooperative Optimization Models 


As stated in previous section, the interpretation of MAs 
as a collection of interacting agents that autonomously 
explore the search space while cooperating/competing 
with each other seamlessly integrates with the more 
general notion of memetic computing and coopera- 
tive optimization models. According to the definition 
in [52.92], memetic computing is: 


a paradigm that uses the notion of meme(s) as units 
of information encoded in computational represen- 
tations for the purpose of problem solving, 


where meme should be interpreted as local-search op- 
erator as mentioned before. This orchestration of dif- 
ferent LS operators naturally links with cooperative 
models dating back to the late 1990s [52.93]. These 
attempts to attain an effective mechanism for explor- 
ing the search space try to escape from local optima 
by combining search agents that have diverse inten- 
sification/diversification characteristics and that start 
from different points in the search space [52.94]. Ac- 
cording to [52.95], the distinctive features of this kind 
of models are (1) a collection of autonomous algo- 
rithms (agents), each of them supporting a different 
optimization method, and (2) a cooperative scheme for 
combining these autonomous elements into an unified 
problem-solving strategy. 

Early cooperative models involve an algorithmi- 
cally homogenous collection of algorithms exchang- 
ing information. For example, Toulouse et al. [52.93] 
considered a collection of TS algorithms exchanging 
tabu attributes (notice the relation of this model with 
the parallel recombinative SA algorithm mentioned in 
Sect. 52.3.2) and later proposed a hierarchical decom- 
position approach [52.96]. A related model was also 
proposed by Crainic and Gendreau [52.97]. Crainic 


et al. [52.98] put forward an asynchronous cooperative 
search procedure on the basis of VNS. A different ap- 
proach based on the used of a central manager was 
proposed by Pelta et al. [52.99]. This central manager 
gathers information about the performance of the differ- 
ent agents and acts on them, altering their behavior — see 
also [52.100]. Other centralized approaches were de- 
fined by LeBouthillier and Crainic [52.101] by means 
of maintaining a solution warehouse upon which indi- 
vidual heuristics act. More recently, Barbucha [52.102] 
explored synchronous and asynchronous versions of 
an analogous memory-centralized approach in the con- 
text of vehicle routing problems. Leung et al. [52.103] 
proposed, in turn, a cooperative/competitive scheme in 
which the problem space is partitioned and a pool of 
agents is structured into several subgroups which repel 
each other, thus contributing to keeping diversity. 
Multi-level models have also received a lot of at- 
tention in the last years. These models consist of 
layered algorithmic approaches and are not to be con- 
fused with multilevel partitioning strategies proposed 
for combinatorial optimization [52.104, 105], in which 
the resolution of the problem is attained via its incre- 
mental reduction and further reconstruction, using some 
solver at each level and the solution obtained therein 
as seeds for solving the next higher level. Hulianytskyi 
and Sirenko [52.106] presented a two-level coopera- 
tive approach: the lower level corresponds to basic 
algorithms, whereas the upper level combines the infor- 
mation found by these and broadcasts a refined version 
back to the basic algorithms. Milano and Roli [52.4] 
developed a multiagent system called MAGMA (mul- 
tiagent metaheuristic architecture) allowing the use of 
metaheuristics at different levels (generating solutions, 
improving them, defining search strategies, and co- 
ordinating lower-level agents). Each level (or layer) 


1055 


S°@S | 4 Hed 


1056 PartE 


Evolutionary Computation 


ZS |3 Hed 


provides a different abstraction level and can contain 
several agents loaded with a particular search algo- 
rithm. The lowest layer (level 0) generates solutions 
to be fed to level 1. The latter provides local improve- 
ment of these solutions. Level 2 has a global view of the 
search space and provides the means for escaping from 
local optima. The upmost level (level 3) coordinates 
the functioning of the underlying agents, rewarding 


52.6 Conclusions 


Memetic algorithms in particular, and memetic com- 
puting in general, constitute a flexible and powerful 
optimization approach. Rather than being competi- 
tors for existing methods and/or paradigms, they are 
a very suitable framework for integrating such exist- 
ing techniques in order to attain synergistic combi- 
nations or being able to deal with the curse of di- 
mensionality in large-scale optimization settings. They 
are also a very active research area in which, in ad- 
dition to a steadily growing number of application 
works, new fundamental issues are attracting the in- 
terest of the research community. Among these we 


those which perform well or adapting and improving 
their functioning. They specifically adapt this frame- 
work for deployment of MAs within it. Finally, Amaya 
et al. [52.107] defined a multilevel model in which het- 
erogenous simple MAs (i.e., MAs obtained from the 
hybridization of an EA and a local search method) are 
combined in a cooperative model, and exchange infor- 
mation following an underlying arbitrary topology. 


can cite the theoretical study of their self-adaptation 
capabilities and their deployment on the emerging com- 
putational platforms that are available nowadays. We 
refer to [52.72, 108, 109] for recent reviews of the field. 
For an overview of the literature dealing with the 
application of these techniques to combinatorial opti- 
mization problems we refer the reader to [52.67, 108] 
for a general perspective and to [52.63] for a review of 
scheduling and timetabling applications. Finally, we re- 
fer the reader to [52.110, 111] for further information 
on the deployment of MAs on combinatorial optimiza- 
tion problems. 
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53. Design of Representations and Search Operators 


Franz Rothlauf 


Successful and efficient use of evolutionary algo- 
rithms depends on the choice of genotypes and 
the representation — that is, the mapping from 
genotype to phenotype — and on the choice of 
search operators that are applied to the genotypes. 
These choices cannot be made independently of 
each other. This chapter gives recommendations 
on the design of representations and correspond- 
ing search operators and discusses how to consider 
problem-specific knowledge. For most problems in 
the real world, similar solutions have similar fitness 
values. This fact can be exploited by evolutionary 
algorithms if they ensure that the representa- 
tions and search operators used are defined in 
such a way that similarities between phenotypes 
correspond to similarities between genotypes. 
Furthermore, the performance of evolutionary al- 
gorithms can be increased by problem-specific 
knowledge. We discuss how properties of high- 
quality solutions can be exploited by biasing 
representations and search operators. 


53.1 Representations 


Successful and efficient use of evolutionary algorithms 
(EA) and other types of modern heuristics [53.1, 2] 
depends on the choice of genotypes and the repre- 
sentation — that is, the mapping from genotype to 
phenotype — and on the choice of search operators that 
are applied to the genotypes. These choices cannot be 
made independently of each other [53.2]. The ques- 
tion whether a certain representation leads to a better 
performing EA than an alternative representation can 
only be answered when the operators applied are taken 
into account. The reverse is also true: deciding between 
alternative operators is only meaningful for a given 
representation. 

In practice, one can distinguish two complemen- 
tary approaches to the design of representations and 
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search operators [53.3]. The first approach defines 
representations (also known as decoders or indirect 
representations) where a solution is encoded in a stan- 
dard data structure, such as strings or vectors, and 
applies standard off-the-shelf search operators to these 
genotypes. To evaluate a solution, the genotype needs 
to be mapped to the phenotype space. The proper 
choice of this genotype-phenotype mapping is impor- 
tant for the performance of the search process. The 
second approach encodes solutions to the problem in 
its most natural problem space and designs search op- 
erators to operate on this search space. In this case, 
often no additional mapping between genotypes and 
phenotypes is necessary, but domain-specific search op- 
erators need to be defined. The resulting combination 
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of representation and operator is often called direct 
representation. 

This section focuses on representations. It intro- 
duces genotypes and phenotypes (Sect. 53.1.1) and 
discusses properties of the resulting genotype and phe- 
notype space (Sect. 53.1.2). Section 53.1.3 lists the 
benefits of using (indirect) representations. Finally, 
Sect. 53.1.4 gives an overview of standard genotypes. 


53.1.1 Genotypes and Phenotypes 


In 1866, Mendel recognized that nature stores the com- 
plete genetic information for an individual in pairwise 
alleles [53.4]. The genetic information that determines 
the properties, appearance, and shape of an individual 
is stored by a number of strings. Later, it was discov- 
ered that the genetic information is formed by a double 
string of four nucleotides, called DNA (deoxyribonu- 
cleic acid). 

Mendel found that nature distinguishes between the 
genetic code of an individual and its outward appear- 
ance. The genotype represents all the information stored 
in the chromosomes and allows us to describe an indi- 
vidual on the level of genes. The phenotype describes 
the outward appearance of an individual. A transfor- 
mation exists — a genotype-phenotype mapping or 
a representation — that uses the genotype information 
to construct the phenotype. To represent the large num- 
ber of possible phenotypes with only four nucleotides, 
the genotype information is not stored in the allele it- 
self, but in the sequence of alleles. By interpreting the 
sequence of alleles, nature can encode a large number 
of different phenotypes using only a few different types 
of alleles. 

Figure 53.1 illustrates the differences between chro- 
mosome, gene, and allele. A chromosome is a string of 
some length / where all the genetic information of an 
individual is stored. Although nature often uses more 
than one chromosome, many EAs use only one chro- 
mosome for encoding all phenotype information. Each 
chromosome consists of many alleles. Alleles are the 
smallest information units in a chromosome. In nature, 
alleles exist pairwise, whereas in most EA implementa- 
tions an allele is represented by only one symbol. For 
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Fig. 53.1 Alleles, genes, and chromosomes 


example, binary genotypes only have alleles with value 
zero or one. If a phenotypic property of an individual 
(solution), like its hair color or eye size is determined 
by one or more alleles, then these alleles together are 
called a gene. A gene is a region on a chromosome that 
must be interpreted together and which is responsible 
for a specific property of a phenotype. 

We must carefully distinguish between genotypes 
and phenotypes. The phenotypic appearance of a so- 
lution determines its objective value. Therefore, when 
comparing the quality of different solutions, we must 
judge them on the phenotype level. However, when it 
comes to the application of variation operators we must 
view solutions on the genotype level. New solutions that 
are created using variation operators do not inherit the 
phenotypic properties of its parents, but only the geno- 
type information regarding the phenotypic properties. 
Therefore, search operators work on the genotype level, 
whereas the evaluation of the solutions is performed on 
the phenotype level. 

Formally, we define ®, as the genotype space where 
the variation operators are applied. An optimization 
problem on ®, could be formulated as f(x): Ø, > 
R, where f assigns an element (fitness value) in R 
to every element in the genotype space ®,. A max- 
imization problem is defined as finding the optimal 
solution x* = {x € ®,|Yy € By: f(y) < f(x)}, where x 
is usually a vector or string of decision variables (al- 
leles) and f(x) is the objective or fitness function. x* 
is the global maximum. To be able to apply EAs to 
a problem, the inverse function fT! does not need to 
exist. 


53.1.2 Genotype 
and Phenotype Search Spaces 


When using a representation, we have to define — 
in analogy to nature — genotypes and a genotype— 
phenotype mapping [53.5,6]. Therefore, the fitness 
function f can be decomposed into two parts. f maps 
the genotype space ®, to the phenotype space ®,, and 
fp maps ®, to the fitness space R 


fe(x®) : By > Dy , 
HP): >R, (53.1) 
where f = fp of = fo (fa (x£)). The genotype—phenotype 
mapping f is determined by the type of genotype used. 
fp represents the fitness function and assigns a fitness 
value fp (x?) to each solution x? € ®,. The search opera- 
tors are applied to the genotypes [53.7, 8]. 
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The search space describes the set of feasible so- 
lutions of an optimization problem and defines rela- 
tionships (for example, distances) between solutions. 
A metric defined on a search space can be used for 
measuring similarities between solutions [53.2]. Usu- 
ally, defining a search space ® also defines a metric. 
Using a metric, the distance d(x, y) between two solu- 
tions x, y € ® measures how different the two solutions 
are. The larger the distance, the more different two 
individuals are with respect to the metric used. In 
principle, different metrics can be used for the same 
search space. Different metrics result in different dis- 
tances and different measurements for the similarity of 
solutions. 

In metric search spaces, the similarities between so- 
lutions are measured by a distance. Therefore, we have 
a set X of solutions and a real-valued distance func- 
tion (also called a metric) d: X x X — R that assigns 
a real-valued distance to any combination of two ele- 
ments x, y E€ X. 

An example of a metric space is the set of real num- 
bers R. Here, a metric can be defined by d(x, y) := 
|x— y|. Therefore, the distance between any solutions 
x,y € R is the absolute value of their differences. Ex- 
tending this definition to two-dimensional spaces R?, 
we obtain the city-block metric (also known as the taxi- 
cab metric or the Manhattan distance). It is defined for 
two-dimensional spaces as d(x, y) := |x; —y1| + |x2 — 
y2|, where x = (x1, x2) and y = (y1, y2). This metric is 
named the city-block metric as it describes the dis- 
tance between two points on a two-dimensional plane 
in a city like Manhattan or Mannheim with a rectan- 
gular ground plan. On n-dimensional search spaces R”, 
the city-block metric becomes d(x, y) := 7", |xi—yil, 
where x, y € R”. 

Another example of a metric that can be defined on 
R” is the Euclidean metric. In Euclidean spaces, a so- 
lution x = (x1, . . . , Xn) is a vector of continuous values 
(x; € R). The Euclidean distance between two solu- 
tions x and y is defined as d(x, y) := y X} 1 i — yi). 
For n= 1, the Euclidean metric coincides with the 
city-block metric. For n = 2, we have a standard two- 
dimensional search space and the distance between two 
elements x,y € R? is just a direct line between two 
points on a two-dimensional plane. 

If we assume that we have a binary space (x € 
{0, 1}"), a commonly used metric is the binary Ham- 
ming metric [53.9] d(x,y) =>“, |xi—yi|, where 
d(x,y) € {0,...,}. The binary Hamming distance be- 
tween two binary vectors x and y of length n is just the 
number of binary decision variables on which x and y 


differ. For continuous and discrete decision variables, it 
becomes d(x, y) = J- ;—; Zi where 


0, forx;=y;, 
Zi = (53.2) 
1, forx; Æ yi. 


In general, the Hamming distance measures the number 
of decision variables on which x and y differ. Two in- 
dividuals are neighbors if the distance between them is 
minimal. For the binary Hamming metric, the minimal 
distance between two individuals is dmin = 1. There- 
fore, two individuals x and y are neighbors if their 
distance d(x, y) = 1. 

Using the Euclidean or the Hamming metric only 
makes sense for measuring distances between solutions 
of the same length n. If solutions have different lengths, 
the Levenshtein distance (or edit distance) [53.10] can 
be used. This distance counts the minimum number 
of insertion, deletion, or substitution operations that 
transform one solution into the other. The Levenshtein 
distance between two solutions can be calculated with 
polynomial effort using dynamic programming [53.11]. 
For fixed-length solutions, the Levenshtein distance is 
equivalent to the Hamming distance. 

Using a representation fẹ, we obtain two differ- 
ent search spaces, ®, and ®,. Therefore, different 
metrics can be defined for the phenotype and the 
genotype space. The metric used on the phenotype 
search space ®, is usually determined by the specific 
problem to be solved and describes which problem 
solutions are similar to each other. Examples of com- 
mon phenotypes and corresponding metrics are given 
in Sect. 53.2.5. In contrast, the metric defined on ®, 
is not defined by the specific problem but can be 
defined by the search operators selected for the opti- 
mization method. As we can define different types of 
genotypes to represent the phenotypes, we are able to 
define different metrics on ®,. However, if the met- 
rics on ®, and ®, are different, different neighborhoods 
can exist on ®, and ®,. For example, when encoding 
phenotype integers using genotype bitstrings, the phe- 
notype xP = 5 has two neighbors, yP = 6 and z? = 4. 
When using the Hamming metric and binary geno- 
types, the corresponding binary string xë = 101 has 
three different neighbors, yë = 001, z8 = 111, and w8 = 
100 [53.12]. 

Therefore, the metric on the genotype space should 
be chosen such that it fits the metric on the phenotype 
space well. A representation introduces an additional 
genotype—phenotype mapping and thus modifies the fit. 
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When designing optimization methods, we have to en- 
sure that the metric on the genotype search space fits the 
original problem metric. We should choose the geno- 
type metric in such a way that phenotypic neighbors 
remain neighbors in the genotype search space. Repre- 
sentations that ensure that neighboring phenotypes are 
also neighboring genotypes are called high-locality rep- 
resentations (Sect. 53.3.1). 


53.1.3 Benefits of Representations 


In principle, a representation is not necessary for the 
application of EAs as search operators may also be 
directly applied to phenotypes. However, the use of 
an additional genotype-phenotype mapping has some 
benefits: 


@ The use of representations is necessary for prob- 
lems where a phenotype cannot be depicted as 
a string or in another way that is accessible to 
variation operators. A representative example is the 
shape of an object, for example the wing of an air- 
plane. EAs that are used to find the optimal shape 
usually require a representation as the direct appli- 
cation of search operators to the shape of a wing is 
difficult. Therefore, additional genotype—phenotype 
mappings are used and variation operators are ap- 
plied to genotypes that indirectly determine the 
shape. 

@ The introduction of a representation can be useful 
if there are constraints or restrictions on the pheno- 
type space that can be advantageously modeled by 
a specific encoding. An example is a tree problem 
where the optimal solution is a star. Instead of 
applying search operators directly to trees, we can 
introduce genotypes that only encode stars resulting 
in a much smaller search space. 

@ The use of the same genotypes for different types 
of problems, and only interpreting them differently 
by using a different genotype—phenotype map- 
ping, allows us to use standard search operators 
(Sect. 53.2.5) with known properties. In this case, 
we do not need to develop new operators with 
unknown properties and behavior. 

© Finally, using an additional genotype—phenotype 
mapping can change the difficulty of a problem. 
A representation can reduce the difficulty of the 
problem and make it easier to solve for a particular 
optimization method. However, usually the def- 
inition of a proper representation is difficult and 
problem specific. 


53.1.4 Standard Genotypes 


We characterize some of the most important and widely 
used genotypes. For a more detailed overview of differ- 
ent types of genotypes, we refer to [53.13, Sect. C1]. 


Binary Genotypes 
Binary genotypes are commonly used in genetic al- 
gorithms [53.14, 15]. Such EA types use recombina- 
tion as the main search operator and mutation only 
serves as background noise. A typical search space is 
P, = {0, 1}/, where / is the length of a binary vector 
x8 = (xj,...,x7). The genotype-phenotype mapping fa 
depends on the specific optimization problem to be 
solved. For many combinatorial optimization problems 
using binary genotypes allows a direct and very natural 
encoding. 

When using binary genotypes for encoding integer 
phenotypes, specific genotype—-phenotype mappings are 
necessary. Different types of binary representations for 
integers assign the integers xP € Ø, (phenotypes) in dif- 
ferent ways to the binary vectors xë € P, (genotypes). 
The most common binary genotype—phenotype map- 
pings are binary, Gray, and unary encoding [53.3, 16, 
Chap. 5]. 

When using binary genotypes to encode continu- 
ous phenotypes, the accuracy (precision) depends on 
the number of bits that represent one phenotype vari- 
able. By increasing the number of bits that are used to 
represent one continuous variable the accuracy of the 
representation can be increased. 


Integer Genotypes 
Instead of using binary strings with cardinality 7 = 
2, higher y-ary alphabets, where {xy € N|y > 2}, can 
also be used for the genotypes. Then, instead of a bi- 
nary alphabet a y-ary alphabet is used for a string of 
length Z. Instead of encoding 2! different individuals 
with a binary alphabet, we are able to encode y’ differ- 
ent possibilities. The size of the search space increases 
from |®,| = 2! to || = x’. 

For integer problems, users sometimes prefer to use 
binary instead of integer genotypes because schema 
processing is maximally efficient with binary alphabets 
when using standard recombination operators in genetic 
algorithms [53.17]. Goldberg [53.17] qualified this rec- 
ommendation and emphasized that the alphabet used 
in the encoding should be as small as possible while 
still allowing a natural representation of solutions. To 
give general recommendations is difficult, as users often 
do not know a priori whether binary genotypes allow 
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a natural encoding of integer phenotypes [53.18, 19]. 
We recommend that users use binary genotypes for en- 
coding binary decision variables and integer genotypes 
for integer decision variables. 


Continuous Genotypes 
When using continuous genotypes, the search space 
is od, =R', where l is the size of a real-valued 
string or vector. Continuous genotypes are often 
used in local search methods like evolution strate- 
gies or evolutionary programming. These types of 
optimization methods mainly rely on local search 
and search through the search space by adding 
a multivariate zero-mean Gaussian random variable 
to each continuous variable. In contrast, when us- 
ing recombination-based genetic algorithms, continu- 
ous decision variables are often represented by using 
binary genotypes. 

Continuous genotypes cannot only be used for en- 
coding continuous problems, but also for permutation 


53.2 Search Operators 


This section distinguishes between standard search op- 
erators, which are applied to genotypes, and problem- 
specific search operators that can also be applied 
to phenotypes (often called direct representations). 
We start with an overview of general design guide- 
lines. Sections 53.2.2 and 53.2.3 discuss local and 
recombination-based search operators. In Sect. 53.2.4, 
we focus on direct representations, where search op- 
erators are directly applied to phenotypes and no 
explicit genotype-phenotype mapping exists. Finally, 
Sect. 53.2.5 gives an overview of standard search 
operators. 


53.2.1 General Design Guidelines 


During the 1990s, Radcliffe developed guidelines for 
the design of search operators. It is important for search 
operators that the representation used is taken into ac- 
count as search operators are based on the metric that 
is defined on the genotype space. Radcliffe introduced 
the principle of formae, which are subsets of the search 
space [53.24-29]. Formae are defined as equivalence 
classes that are induced by a set of equivalence rela- 
tions. Any possible solution of an optimization problem 
can be identified by specifying the equivalence class 
to which it belongs for each of the equivalence re- 


and combinatorial problems. Trees, schedules, tours, or 
other combinatorial problems can easily be represented 
by using continuous genotypes and special genotype— 
phenotype mappings (for an example, see weighted 
encodings for trees [53.20, 21]). 


Messy Representations 

In all previously discussed genotypes, the position of 
each allele is fixed along the chromosome and only 
the corresponding value is specified. A first gene- 
independent genotype was proposed by [53.22], where 
an inversion operator changes the relative order of the 
alleles in the string. The position of an allele and the 
corresponding value are coded together as a tuple in 
a string. This concept can be used for all types of 
genotypes such as binary, integer, and real-valued al- 
leles, and allows an encoding which is independent of 
the position of the alleles in the chromosome. Later, 
Goldberg et al. [53.23] used this position-independent 
representation for the messy genetic algorithm. 


lations. For example, if we have a search space of 
faces [53.30], basic equivalence relations might be 
same hair color or same eye color, which would in- 
duce the formae red hair, dark hair, green eyes, etc. 
Formae of higher order like red hair and green eyes 
are then constructed by composing simple formae. The 
search space, which includes all possible faces, can be 
constructed with strings of alleles that represent the dif- 
ferent formae. For the definition of formae, the structure 
of the phenotypes is relevant. For example, for binary 
problems, possible formae would be bit i is equal to 
one/zero. 

It is an unsolved problem to find appropriate equiv- 
alences for a particular problem. From the equiva- 
lences, the genotype search space ®, and the genotype— 
phenotype mapping f can be constructed. Usually, 
a solution is encoded as a string of alleles. The 
value of an allele indicates whether the solution sat- 
isfies a particular equivalence. Radcliffe [53.25] pro- 
posed several design guidelines for creating appro- 
priate equivalences for a given problem. The most 
important one is that the generated formae should 
group together solutions of related fitness [53.28], 
in order to create a fitness landscape or structure 
of the search space that can be exploited by search 
operators. 
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Radcliffe recognized that the genotype search space, 
the genotype—phenotype mapping, and the search op- 
erators belong together, and their design cannot be 
separated from each other [53.26]. He assumed that 
search operators create offspring solutions from a set 
of parent solutions. For the development of appropriate 
search operators that are based on predefined formae, he 
formulated the following four design principles [53.25, 
29]: 


@ Respect: offspring produced by recombination 
should be members of all formae to which both their 
parents belong. For the face example this means that 
offspring should have red hair and green eyes if both 
parents have red hair and green eyes. 

@ Transmission: an offspring should be equivalent to 
at least one of its parents under each of the basic 
equivalence relations. This means that every gene 
should be set to an allele which is taken from one 
of the parents. If one parent has dark hair and the 
other red hair, then the offspring has either dark or 
red hair. 

@ Assortment: an offspring can be formed with any 
compatible characteristics taken from the parents. 
Assortment is necessary as some combinations 
of equivalence relations may be infeasible. This 
means, for example, that the offspring inherits dark 
hair from the first parent and blue eyes from the 
second parent only if dark hair and blue eyes are 
compatible. Otherwise, the alleles are set to feasi- 
ble values taken from a random parent. 

© Ergodicity: an iterative use of search operators al- 
lows us to reach any point in the search space from 
all possible starting solutions. 


Radcliffe developed a consistent concept of how 
to design efficient EAs once appropriate equivalence 
classes (formae) are defined. However, the finding 
of appropriate equivalence classes, which is equiva- 
lent to either defining the genotype search space and 
the genotype—phenotype mapping or appropriate direct 
search operators on the phenotypes, is often difficult 
and remains an unsolved problem. 

As long as the genotypes are either binary, inte- 
ger, or real-valued strings, standard recombination and 
mutation operators can be used. The situation is differ- 
ent if direct representations (Sect. 53.2.4) are used for 
problems whose phenotypes are not binary, integer, or 
real-valued. Specialized operators are necessary that al- 
low offspring to inherit important properties from their 
parents [53.24, 25, 27,31]. In general, these operators 


are problem-specific and must be developed separately 
for every optimization problem. 


53.2.2 Local Search Operators 


Local search and the use of local search operators 
are at the core of EAs. The goal of local search is 
to find fitter individuals by performing neighborhood 
search [53.32]. Usually, a local search operator creates 
offspring that have a small or sometimes even mini- 
mal distance to their parents. Therefore, local search 
operators and the metric on the corresponding search 
space cannot be decided independently of each other 
but determine each other. A metric defines possible 
local search operators and a local search operator de- 
termines the metric. As search operators are applied to 
the genotypes, the metric on ®, is relevant for the defi- 
nition of local search operators. 

The basic idea behind using local search operators 
is that the structure of a fitness landscape should guide 
a search heuristic to high-quality solutions [53.33], and 
that good solutions can be found by performing small 
iterated changes. We assume that in most real-world 
problems high-quality solutions are not isolated in the 
search space but grouped together [53.34, 35]. There- 
fore, better solutions can be found by searching in the 
neighborhood of already found good solutions. The 
search steps must be small because too large search 
steps would result in randomization of the search, and 
guided search around good solutions would become 
impossible. In contrast, when using search operators 
that perform large steps in the search space it would 
not be possible to find better solutions by searching 
around already found good solutions but the search al- 
gorithm would jump randomly around the search space 
(Sect. 53.3.1). 

The following paragraphs review some common lo- 
cal search operators for binary, integer, and continuous 
genotypes and illustrate how they are designed based on 
the underlying metric. The local search operators (and 
underlying metrics) are commonly used and are usu- 
ally a good choice. However, in principle, we are free to 
choose other metrics and to define corresponding search 
operators. Then, the metric should be chosen such that 
high-quality solutions are neighboring solutions and the 
resulting fitness landscape leads guided search methods 
to an optimal solution. The choice of a proper metric 
and corresponding search operators are always problem 
specific and the ultimate goal is to choose a metric such 
that the problem becomes easy for EAs. However, we 
want to emphasize that for most practical applications 
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the illustrated search operators are a good choice and 
allow us to design efficient and effective EAs. 


Binary Genotypes 
When using binary genotypes, the distance between 
two solutions x, y € {0, 1}! is often measured using the 
Hamming distance. Local search operators based on 
this metric generate new solutions with the Hamming 
distance d(x,y) = 1. This type of search operator is 
also known as a standard mutation operator for binary 
strings or a bit-flipping operator. As each binary solu- 
tion of length / has / neighbors, this search operator can 
create / different offspring. For example, applying the 
bit-flipping operator to (0, 0,0, 0) can result in four dif- 
ferent offspring (1,0,0,0), (0, 1,0,0), (0,0, 1,0), and 
(0,0, 0, 1). 

Reeves [53.36] proposed another local search oper- 
ator for binary strings based on a different neighbor- 
hood definition: for a randomly chosen k € {0,..., l}, 
it complements the bits x;,...,x;. Again, each so- 
lution has / neighbors. For example, applying this 
search operator to (0,0,0,0) can result in (1,1, 1,1), 
(0,1, 1, 1), (0,0, 1, 1), or (0,0, 0, 1). Although the op- 
erator is of minor practical importance, it has some 
interesting theoretical properties. First, it is closely re- 
lated to the one-point recombination crossover (see 
below) as it chooses a random point and inverts all 
x; with i > k. Therefore, it has also been called the 
complementary crossover operator. Second, if all geno- 
types are encoded using Gray code [53.37,38], the 
neighbors of a solution in the Gray-coded search space 
using Hamming distance are identical to the neigh- 
bors in the original binary-coded search space using the 
complementary crossover operator. Therefore, Ham- 
ming distances between Gray encoded solutions are 
equivalent to the distances between the original binary 
encoded solutions using the metric induced by the com- 
plementary crossover operator (neighboring solutions 
have distance one). For more information regarding the 
equivalence of different neighborhood definitions and 
search operators we refer to the literature [53.36, 39, 
40]. 


Integer Genotypes 
For integer genotypes, different metrics are common, 
leading to different local search operators. When us- 
ing the binary Hamming metric, two individuals are 
neighbors if they differ in one decision variable. Search 
operators based on this metric can assign a random 
value to a randomly chosen allele. Therefore, each so- 
lution x€ {0,...,k! has Ik neighbors. For example, 


x = (0,0) with x; € {0,1,2} has four different neigh- 
bors ((1, 0), (2,0), (0, 1), and (0, 2)). 

The situation changes when defining local search 
operators based on the city-block metric. Then, a lo- 
cal search operator can create new solutions by slightly 
increasing or decreasing one randomly chosen deci- 
sion variable. For example, new solutions are gen- 
erated by adding +/-1 to a randomly chosen vari- 
able x;. Each solution of length / has a maximum 
of 2/ different neighbors. For example, x = (0,0) with 
x; E€ {0,1,2} has only two different neighbors (0, 1) 
and (1,0). 

Finally, we can define search operators such that 
they do not modify values of decision variables but 
exchange values of two decision variables x; and x. 
Therefore, using the Hamming distance, two neighbors 
have distance d = 2 and each solution has a maximum 
of (2) different neighbors. For example, x = (3,5,2) 
has three different neighbors ((5,3,2), (2,5,3), and 
(3,2, 5)). 


Continuous Genotypes 
For continuous genotypes, we can define local search 
operators analogously to integer genotypes. Based on 
the binary Hamming metric, the application of a lo- 
cal search operator can assign a random value x; € 
[xi min; Xi,max] to the i-th decision variable. Furthermore, 
we can define a local search operator such that it ex- 
changes the values of two decision variables x; and x. 
The binary Hamming distance between old and new so- 
lutions is d = 2. 

The situation is a little more complex in compari- 
son to integer genotypes when designing a local search 
operator based on the city-block metric. We must de- 
fine a search operator such that its iterative application 
allows us to reach all solutions in reasonable time. 
Therefore, a search step should be not too small (we 
want to have some progress in search) and not too large 
(the offspring should be similar to the parent solution). 
A commonly used concept for such search operators is 
to add a random variable with zero mean to the de- 
cision variables. This results in x’; = x; + m, where m 
is a random variable and x’ is the offspring generated 
from x. Sometimes m is uniformly distributed in [—a, a], 
where a < (Xi max — Xi,min). More common is the use of 
anormal distribution N (0, 0) with zero mean and stan- 
dard deviation o. The addition of zero-mean Gaussian 
random variables generates offspring that have, on aver- 
age, the same statistical properties as their parents. For 
more information on local search operators for contin- 
uous variables, we refer to [53.41]. 
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53.2.3 Recombination Operators 


To be able to use recombination operators, a set of so- 
lutions (population) must exist as the goal of recombi- 
nation is to recombine meaningful properties of parent 
solutions. Thus, for the application of recombination 
operators at least two parent solutions are necessary; 
otherwise local search operators are the only option. 
Recombination operators should be designed according 
to Radcliffe’s recommendations (Sect. 53.2.1). 

Analogously to local search operators, recombina- 
tion operators should be designed based on the metric 
used [53.6, 30]. Given two parent solutions x?! and xP? 
and one offspring solution x°, recombination operators 
should be designed such that 


max(d(x?!, x°), d(x, x2)) < d(x?!, xP?) . (53.3) 


Therefore, the application of recombination operators 
should result in offspring where the distances between 
offspring and its parents are equal to or smaller than 
the distance between the parents. When viewing the 
distance between two solutions as a measurement of 
dissimilarity, this design principle ensures that offspring 
solutions are similar to parents. Consequently, applying 
a recombination operator to the same parent solutions 
x?! = xP? should also result in the same offspring (x° = 
xP! = xP?), 

In the last few years, this basic concept of the de- 
sign of recombination operators has been interpreted as 
geometric crossover [53.42—44]. This work builds upon 
previous work [53.6, 30,45] and defines crossover and 
mutation representation-independently using the notion 
of distance associated with the search space. 

Why should we use recombination operators in 
EAs? The motivation is that we assume that many real- 
world problems are decomposable. Therefore, prob- 
lems can be solved by decomposing them into smaller 
subproblems, solving these smaller subproblems, and 
combining the optimal solutions of the subproblems to 
obtain overall problem solutions. The purpose of re- 
combination operators is to form new overall solutions 
by recombining solutions of smaller subproblems that 
exist in different parent solutions. If this juxtaposition 
of smaller, highly fit, partial solutions (often denoted 
as building blocks) does not result in good solutions, 
search strategies that are based on recombination op- 
erators will show low performance. However, as many 
problems of practical relevance can be decomposed into 
smaller problems (they are decomposable), the use of 
recombination operators often results in good perfor- 
mance of EAs. 


Common recombination operators for standard 
genotypes are one-point crossover [53.22] and uniform 
crossover [53.46-48]. We assume a vector or string x 
of decision variables of length /. When using one-point 
crossover, a crossover point c={I1,...,/—1} 
initially chosen randomly. Usually, two offspring solu- 
tions are created from two parent solutions by swapping 
oe partial nings: As a result, we obtain for the parents 


= [eet xP] and xP? = P,a, 2] the 
tn gis cae eee we, Pa] and 
= p, ee eee a ]. A generalized 


version of one- point crossover is n-point crossover. For 
this type of crossover operator, we choose n different 
crossover points and create an offspring by alternately 
selecting alleles from parent solutions. For uniform 
crossover, we decide independently for every single 
allele of the offspring from which parent solution it 
inherits the value of the allele. In most implementa- 
tions, no parent is preferred and the probability of an 
offspring inheriting the value of an allele from a spe- 
cific parent is p = 1/m, where m denotes the number 
of parents that are considered for recombination. For 
example, when two possible offspring are considered 
with the same probability (p = 1/2), we could obtain 
as offspring x?! = [x"", xpd eae 
E E E E 

Figure 53. 2 presents examples for the three 
crossover variants. All three recombination operators 
are based on the binary Hamming distance and fol- 
low (53.3) as d(xP!, xP?) > max(d(a?!, x°), d(xP?, x°)). 
Therefore, the similarity between offspring and parent 
is not lower than between the parents. 

Uniform and n-point crossover can be used inde- 
pendently of the type of decision variables (binary, 
integer, continuous, etc.), since these operators only 
exchange alleles between parents. In contrast, inter- 
mediate recombination operators attempt to average 
or blend components across multiple parents and are 
designed for continuous and integer problems. Given 
two parents x?! and x®?, a crossover operator known 
as arithmetic crossover [53. 49] creates an offspring 
X° as x? = ax?” + (1—a)x?”, where œ € [0, 1]. Ifa = 
0.5, the crossover just takes the average of both par- 
ent solutions. In general, for m parents, this operator 
becomes x? = ae ang , where X`; L; a; = 1. Arith- 
metic crossover is based on the city-block metric. With 
respect to this metric, the distance between offspring 
and parent is smaller than the distance between the par- 
ents. Another type of crossover operator that is based on 
the Euclidean metric is geometrical crossover [53.49]. 
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Fig. 53.2a-c Different crossover 
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Given two parents, an offspring is created as x? = 


tors we refer to [53.50] (binary crossover) and [53.41] 
(continuous crossover). 


/ pl p2 : ; 
xX? x. For further information on crossover opera- 


53.2.4 Direct Representations 


If we apply search operators directly to phenotypes, it is 
not necessary to specify a representation and a genotype 
space. Then, phenotypes are the same as genotypes 


F(X): Bg >R. (53.4) 


fz does not exist and we have a direct representa- 
tion. Because there is no longer an additional mapping 
between Ø, and ®), a direct representation does not 
change any aspect of the phenotype problem such as 
difficulty or metric. However, when using direct rep- 
resentations, we often cannot use standard search op- 
erators, but have to define problem-specific operators. 
Therefore, important for the success of EAs using a di- 
rect representation is not finding a good representation 
for a specific problem, but developing proper search op- 
erators defined on phenotypes. 

The definition of the variation operators are relevant 
for different implementations of direct representations. 
Since we assume that local search operators always 
generate neighboring solutions, the definition of a lo- 
cal search operator induces a metric on the genotypes. 
Therefore, the metric that we use on the genotype 
space should be chosen in such a way that new solu- 
tions that are generated by local search operators have 
a small distance to the old solutions and the solutions 
are neighbors with respect to the metric used. Further- 
more, the distance between two solutions x € ®, and 
y € P, should be proportional to the minimal number 


Offspring solutions 


of local search steps that are necessary to move from x 
to y. Analogously, the definition of a recombination op- 
erator also induces a metric on the search space. The 
metric used should guarantee that the application of 
a recombination operator to two solutions x? € ®, and 
yP € @, creates a new solution x° € Ø, whose distances 
to the parents are not larger than the distance between 
the parents (53.3). 

For the definition of variation operators, we should 
also consider that for many problems we have a natural 
notion of similarity between phenotypes. When we cre- 
ate a problem model, we often know which solutions are 
similar to each other. Such a notion of similarity should 
be considered for the definition of variation operators. 
We should design local search operators in such a way 
that their application creates solutions which we view 
as similar. Such a definition of local search operators 
ensures that neighboring phenotypes are also neighbors 
with respect to the metric that is induced by the search 
operators. 

At a first glance, it seems that the use of direct rep- 
resentations makes life easier as direct representations 
release us from the challenge to design efficient rep- 
resentations. However, we are confronted with some 
problems: 


@ There are many phenotypes to which no standard 
variation operators can be applied. 

@ The design of high-quality problem-specific search 
operators is difficult. 

@ We cannot use EAs that only work on standard 
genotypes. 


For indirect representations with standard geno- 
types, the definition of search operators is straightfor- 
ward as these are usually based on the metric of the 
genotype space (Sects. 53.2.2 and 53.2.3). The behav- 
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ior of EAs using standard search operators is usually 
well studied and well understood. However, when us- 
ing direct representations, standard operators can often 
no longer be used. Instead, problem-specific operators 
must be developed for each phenotype. This is difficult, 
as we cannot use most of our knowledge about the be- 
havior of EAs using standard genotypes and standard 
operators. 

The design of proper search operators is often de- 
manding as phenotypes are usually not string-like but 
are more complicated structures like trees, schedules, 
time tables, or other structures (Sect. 53.2.5). In this 
case, phenotypes cannot be depicted as a string or in an- 
other way that is accessible to variation operators. Other 
representative examples are the form or shape of an ob- 
ject. Search operators that can be directly applied to the 
shape of an object are often difficult to design. 

Finally, using specific variants of EAs like estima- 
tion of distribution algorithms (EDAs) becomes very 
difficult. These types of EAs do not use standard search 
operators that are applied to genotypes but build new so- 
lutions according to a probabilistic model of previously 
generated solutions [53.51—56]. These search methods 
were developed for a few standard genotypes (usually 
binary and continuous) and result in better performance 
than, for example, traditional simple genetic algorithms 
for decomposable problems [53.57, 58]. However, be- 
cause direct representations with non-standard pheno- 
types and problem-specific search operators can hardly 
be implemented in EDAs, direct representations cannot 
benefit from these optimization methods. 


53.2.5 Standard Search Operators 


We provide an overview of standard search spaces and 
the corresponding search operators. The search spaces 
can either represent genotypes (indirect representation) 
or phenotypes (direct representation). We order the 
search spaces by increasing complexity. With increas- 
ing complexity, the design of search operators becomes 
more demanding. An alternative to designing complex 
search operators for complex search spaces is to in- 
troduce additional mappings that map complex search 
spaces to simpler ones. Then, the design of the cor- 
responding search operators becomes easier, however, 
a proper design of the additional mapping (representa- 
tion) becomes more important. 


Strings and Vectors 
Strings and vectors of either fixed or variable length are 
the most elementary search spaces. They are the most 


frequently used genotype structures. Vectors allow us 
to represent an ordered list of decision variables and are 
the standard genotypes for the majority of optimization 
problems. Strings are appropriate for sequences of char- 
acters or patterns. Consequently, strings are suited for 
problems where the objects modeled are text, charac- 
ters, or patterns. 

For strings and vectors with fixed length, we can 
use standard local search and recombination operators 
(Sects. 53.2.2 and 53.2.3) that are based on the Ham- 
ming metric or the binary Hamming metric. Search 
operators for strings and vectors with variable length 
are often based on the Levenshtein distance. 


Coordinates/Points 
To represent locations in a geometric space, coordinates 
can be used. Coordinates can be either integer or con- 
tinuous. Common examples are locations of cities or 
other spots on two-dimensional grids. Coordinates are 
appropriate for problems that work on sites, positions, 
or locations. 

We can use standard local and recombination opera- 
tors for continuous variables and integers, respectively. 
For coordinates, the Euclidean metric is often used to 
measure the similarity of solutions. 


Graphs 
Graphs allow us to represent relationships between ar- 
bitrary objects. Usually, the structure of a graph is 
described by listing its edges. An edge represents a re- 
lationship between a pair of objects. Given a graph 
with n nodes (objects), there are n(n— 1)/2 possible 
edges. Using graphs is appropriate for problems that 
seek a network, circuit, or relationship. 

Common genotypes for graphs are lists of edges in- 
dicating which edges are used. For example, the char- 
acteristic vector representation encodes graphs of fixed 
size using a binary vector of length n(n— 1)/2 [53.3, 
Sect. 6.3]. Standard search operators for the characteris- 
tic vector representation are based on the Hamming met- 
ric as the distance between two graphs can be calculated 
as the number of different edges. Standard search oper- 
ators can be used if there are no additional constraints. 


Subsets 
Subsets represent selections from a set of objects. Given 
n different objects, the number of subsets having ex- 
actly k elements is equal to (o). Thus, the number of 
possible subsets is $` %—ọo () = 2”. For subsets, the or- 
der of the objects does not matter. The two example 


subsets {1,3,5} and {3, 5, 1} represent the same pheno- 
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type solution. Local search operators that can be applied 
directly to subsets often either modify the objects in 
the subset, or increase/reduce the number of objects in 
one subset. Recombination operators that are directly 
applied to subsets are more sophisticated as no stan- 
dard operators can be used. We refer to [53.59] for 
detailed information on their design. Subsets are often 
used for problems that seek a cluster, collection, parti- 
tion, group, packaging, or selection. 

Given n different objects, a subset of fixed size k 
can be represented using an integer vector x of length 
k, where the x; indicate the selected objects and x; Æ x, 
for i Aj and i,j € [1, k]. Then, standard local search op- 
erators can be applied if we assume that each of the k 
selected objects is unique. The application of recombi- 
nation operators is more demanding as each subset is 
represented by k! different genotypes (integer vectors) 
and the distances between the k! different genotypes 
that represent the same subset are large [53.60]. Re- 
combination operators must be designed such that the 
distances between offspring and parents are smaller 
than the distances between parents (53.3) and the re- 
combination of two genotypes that represent the same 
subset always results in the same subset. For guidelines 
on the design of appropriate recombination operators 
and examples, we refer to [53.61] and [53.60]. 


Permutations 
A large variety of EAs have been developed for per- 
mutation problems as many such problems are of 
practical relevance but NP-hard (NP: non-deterministic 
polynomial-time). Permutations are orderings of items. 
The order of the objects is relevant for permutations. 
The number of permutations on a set of n elements is 
n!. 1-2-3 and 1-3-2 are two examples of permutations 
of three integer numbers x€ {1,2,3}. The traveling 
salesperson problem (TSP) is a prominent example of 
a permutation problem. Permutations are commonly 
used for problems that seek an arrangement, tour, or- 
dering, or sequence. 

The design of appropriate search operators for per- 
mutations is demanding. In many approaches, permuta- 
tions are encoded using an integer genotype vector of 
length n, where each decision variable x; indicates an ob- 


ject and has a unique value (x; Æ x for i Aj and i,j € 
{1,...,2}). Standard recombination and mutation opera- 
tors applied to such genotypes fail since the resulting so- 
lutions usually represent no permutations. Therefore, in 
the literature a variety of different permutation-specific 
variation operators have been developed. They are either 
based on the absolute or relative ordering of the objects 
in a permutation. When using the absolute ordering of 
objects in a permutation as the distance metric, two solu- 
tions are similar to each other if the objects have the same 
position in the two solutions (x! = x7). For example, 1- 
2-3-4 and 2-3-4-1 have a maximum absolute distance of 
d = 4, as the two solutions have no common absolute po- 
sitions. In contrast, when using relative ordering, two so- 
lutions are similar if the sequence of objects is similar for 
the two solutions. For example, 1-2-3-4 and 2-3-4-1 have 
distance d = 1 as the two permutations are shifted by one 
position. Based on the metric used (relative versus abso- 
lute ordering), a large variety of different recombination 
and local search operators have been developed. Exam- 
ples are the order crossover [53.62], partially mapped 
crossover [53.63], cycle crossover [53.64], generalized 
order crossover [53.65], or precedence preservative 
crossover [53.66]. For more information on the design 
of such permutation-specific variation operators, we re- 
fer to [53.67, 68], and [53.60]. 


Trees 
Trees are used to describe hierarchical relationships be- 
tween objects. Trees are a specialized variant of graphs 
where only one path exists between each pair of nodes. 
As standard search operators cannot be applied to tree 
structures, we either need to define problem-specific 
search operators that are directly applied to trees or ad- 
ditional genotype—phenotype mappings that map each 
tree to simpler genotypes where standard variation op- 
erators can be applied. 

We can distinguish between trees of fixed and vari- 
able size. For trees of fixed size, we refer to [53.2]. 
Search operators for tree structures of variable size are 
at the core of genetic programming. They are often 
based on the Levenshtein distance. Further information 
about appropriate search operators for trees of variable 
size can be found in [53.69] and [53.70]. 
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Jones and Forrest [53.71] assumed that the difficulty of 
an optimization problem is determined by how the ob- 


jective values are assigned to the solutions x € X and 
what metric is defined on X. They classified fitness 
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landscapes into three classes, straightforward, difficult, 
and misleading. 


1. Straightforward. For such problems, the fitness of 
a solution is correlated with the distance to the 
optimal solution. With lower distance, the fitness 
difference to the optimal solution decreases. As the 
structure of the search space guides search methods 
towards the optimal solution such problems are usu- 
ally easy for guided search methods. 

2. Difficult. There is no correlation between the fitness 
difference and the distance to the optimal solu- 
tion. The fitness values of neighboring solutions 
are uncorrelated and the structure of the search 
space provides no information about which solu- 
tions should be sampled next by the search method. 

3. Misleading. The fitness difference is negatively cor- 
related to the distance to the optimal solution. 
Therefore, the structure of the search space misleads 
a guided search method to sub-optimal solutions. 


The general idea is to measure how well the metric 
defined on the search space fits the structure of the ob- 
jective function. A high fit between metric and structure 
of the fitness function makes a problem easy for guided 
search methods. 

A fundamental assumption about the application of 
EAs is that the vast majority of optimization problems 
that we can observe in the real world are: 


@ Neither misleading nor difficult, 
@ Have high locality (distance between solutions is 
correlated with their fitness difference). 


We assume that misleading problems have no im- 
portance in the real world as usually optimal solutions 
are not isolated in the search space surrounded by only 
low-quality solutions. Furthermore, we assume that the 
metric of a search space is meaningful and, on average, 
the fitness differences between neighboring solutions 
are smaller than between randomly chosen solutions. 
It is only because most real-world problems are nei- 
ther difficult nor misleading that guided search methods 
which use information about previously sampled so- 
lutions can outperform random search for real-world 
problems [53.2, 34]. 

Since we assume that high locality is a general prop- 
erty of real-world problems, EAs must ensure that their 
design does not destroy the high locality of a problem. 
If the high locality of a problem is destroyed, straight- 
forward problems turn into difficult problems and can- 
not be solved better than by random search [53.2]. 
Therefore, EAs must ensure that the search operators 


used fit the metric on the search space and representa- 
tions have high locality; this means phenotype distances 
must correspond to genotype distances. 

The second aspect of this section is how we can con- 
sider knowledge about problem-specific properties for 
the design of EAs. For example, we have knowledge 
about the character and properties of high-quality (or 
low-quality) solutions. Such problem-specific knowl- 
edge can be exploited by introducing a bias into EAs. 
The bias should consider this knowledge and, for exam- 
ple, concentrate search on solutions that are expected 
to be of high quality or avoid solutions expected to be 
of low quality. A bias can be considered in the repre- 
sentation as well as the search operator. However, EAs 
should only be biased if we have obtained some par- 
ticular knowledge about an optimization problem or 
problem instance. If we have no knowledge about prop- 
erties of a problem, we should not bias EAs as this will 
mislead the search heuristics. 

Section 53.3.1 discusses how the design of EAs can 
modify the locality of a problem. To ensure guided 
search, the search operators must fit the problem met- 
ric. Local search operators must generate neighboring 
solutions and recombination operators must generate 
offspring where the distances between offspring and 
parents do not exceed the distances between parents. 
Section 53.3.2 focuses on the possibility of biasing 
EAs. We discuss how problem-specific construction 
heuristics can be used as genotype—phenotype map- 
pings (Sect. 53.3.2) and how redundant representations 
affect heuristic search (Sect. 53.3.2). 


53.3.1 High Locality 


The locality of a problem measures how well the dis- 
tances d(x, y) between any two solutions x, y € X corre- 
spond to the difference in their fitness values |f(x) — 
f(y)|. Locality is high if neighboring solutions have 
similar fitness values and fitness differences correlate 
positively with distances. In contrast, the locality of 
a problem is low if low distances do not correspond to 
low differences in the fitness values. Important for the 
locality of a problem is the metric defined on the search 
space. 

The performance of guided search methods is high 
if the locality of a problem is relatively high; this means 
that the structure of the fitness landscape leads search 
algorithms to high quality solutions [53.2]. Local search 
methods show especially good performance if either 
high-quality or low-quality solutions are grouped to- 
gether in the solution space. Optimization problems 
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with high locality can usually be solved well using EAs, 
as all EAs have some kind of local search elements. 

The following paragraphs provide design guidelines 
for search operators and representations. Search oper- 
ators must fit the metric of a search space, because 
otherwise EAs show low performance as they behave 
like random search. A representation introduces an ad- 
ditional genotype—phenotype mapping. The locality of 
a representation describes how well the metric on the 
phenotype space fits to the metric on the genotype 
space. Low locality, which means there is a poor fit, ran- 
domizes the search and also leads to low performance 
of EAs. 


Search Operator 
EAs rely on the concept of local search. Local search 
iteratively generates new solutions similar to existing 
ones. Local search is a reasonable and successful search 
approach for real-world problems, as most real-world 
problems have high locality and are neither mislead- 
ing nor difficult. In addition, to avoid being trapped in 
local optima, EAs use diversification steps. Diversifi- 
cation steps randomize search and allow EAs to jump 
through the search space. 

Different types of EAs use different concepts for 
controlling intensification and diversification [53.2, 
Chap. 5]. Local search intensifies the search as it allows 
incremental improvements of already found solutions. 
Diversification steps must be relatively rare as they usu- 
ally lead to inferior solutions. When designing search 
operators, we must have in mind that EAs use local 
search operators as well as recombination operators 
for intensifying the search. Solutions that are gener- 
ated should be similar to the existing ones. Therefore, 
we must ensure that search operators (local search op- 
erators as well as recombination operators) generate 
similar solutions and do not jump around in the search 
space. This can be done by ensuring that local search 
operators generate neighboring solutions and recombi- 
nation operators generate solutions where the distances 
between parent and offspring are smaller or equal to the 
distances between parents (Sect. 53.2.3 and 53.3). 

Applying search operators to solutions defines 
a metric on the corresponding search space. With re- 
spect to the search operators, solutions are similar to 
each other if only a few local search steps suffice to 
transform one solution into another. Therefore, when 
designing search operators, it is important that the met- 
ric induced by the search operators fits the problem 
metric. If both metrics are similar (this means a lo- 
cal search operator creates neighboring solutions with 


respect to the problem metric), guided search will per- 
form well as it can systematically explore promising 
areas of the search space. 

Therefore, we should make sure that local search 
operators generate neighboring solutions. The fit be- 
tween the problem metric and the metric induced by the 
search operators should be high. Then, most real-world 
problems, where neighboring solutions have, on aver- 
age, similar fitness values, are easy to solve for EAs. 

For real-world problems, the design or choice of 
proper search operators can be difficult if it is unclear 
what a natural problem metric is. We want to illustrate 
this issue for a scheduling problem. Given a number of 
tasks, we want to find an optimal schedule. There are 
different metrics that can be relevant for such a permu- 
tation problem. We have the choice between metrics 
based either on the relative or absolute ordering of 
the tasks (Sect. 53.2.5). The choice of the right prob- 
lem metric depends on the properties of the scheduling 
problem. For example, if we want to find an optimal 
class schedule, usually it is more natural to use a met- 
ric based on the absolute ordering of the tasks (classes). 
The relative ordering of the tasks is much less important 
as we have fixed time slots. The situation is reversed 
if we want to find an optimal schedule for a paint 
shop. For example, when painting different cars con- 
secutively, color changes are time-consuming as paint 
tools have to be cleaned before a new color can be used. 
Therefore, the relative ordering of the tasks (paint jobs) 
is important, as the tasks should be grouped together 
such that tasks that require the same color are painted 
consecutively and ordered such that the paint shop starts 
with the brightest colors and ends with the darkest ones. 

This example makes it clear that the most natural 
problem metric does not depend on the set of possible 
solutions but on the character of the underlying opti- 
mization problem and fitness function. The goal is to 
choose a metric such that the locality of the problem 
is high. The same is true for the design of operators. 
A high-quality local search operator should generate 
solutions with similar fitness. Then, problems become 
easy to solve for EAs. 


Representation 
Representations introduce an additional genotype— 
phenotype mapping and thus modify the fit between 
the metric on the genotype space (which is induced 
by the search operators used), and the original problem 
metric on the phenotype space. High-quality represen- 
tations ensure that the metric on the genotype space 
fits the original problem metric. The locality of a rep- 
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resentation describes how well neighboring genotypes 
correspond to neighboring phenotypes [53.72-76]. In 
contrast to the locality of a problem, which measures 
the fit between fitness differences and phenotype dis- 
tances, the locality of a representation measures the fit 
between phenotype distances and genotype distances. 

The use of a representation can change the dif- 
ficulty of problems (Sect. 53.1.2) [53.6]. The ability 
of representations to change the difficulty of a prob- 
lem is closely related to their locality. The locality of 
a representation is high if all neighboring genotypes 
correspond to neighboring phenotypes. In contrast, it 
is low if neighboring genotypes do not correspond to 
neighboring phenotypes. Therefore, the locality dm of 
a representation can be defined as [53.3, Sect.3.3] 


> E, T dinl , 


di y=d* 


min 


dn = (53.5) 


where d., is the phenotype distance between the phe- 
notypes x? and y?, dy is the genotypic distance be- 
tween the corresponding genotypes, and d’; and dẹ in 
are the minimum distances between two (neighboring) 
phenotypes and genotypes, respectively. Without loss of 
generality, we assume that d = Ol ge For dm = 0, all 
genotypic neighbors correspond to phenotypic neigh- 
bors and the encoding has perfect (high) locality. 

We want to emphasize that the locality of a rep- 
resentation depends on the representation fọ and the 
metrics that are defined on ®, and ®,. f, alone only 
determines which phenotypes are represented by which 
genotypes and cannot be used for measuring similarities 
between solutions. To describe or measure the local- 
ity of a representation, a metric must be defined on ®, 
and ®). 

Figure 53.3 illustrates the difference between high- 
locality and low-locality representations. We assume 
12 different phenotypes (a-l) and measure distances 
between solutions using the Euclidean metric. Each 
phenotype (lower case symbol) corresponds to one 
genotype (upper case symbol). The representation fg 
has perfect (high) locality if neighboring genotypes cor- 
respond to neighboring phenotypes. Then, local search 
steps have the same effect in the phenotype and geno- 
type search space. 

If we assume that fẹ is a one-to-one mapping, ev- 
ery phenotype is represented by exactly one genotype 
and there are |®,|! = |®,|! different representations. 
Each of these many different representations assigns the 
genotypes to the phenotypes in a different way. 


We want to ask how the locality of a representa- 
tion influences the performance of EAs. Often, there is 
a natural problem metric describing which phenotypes 
are similar to each other. A representation introduces 
a new genotype metric based on the genotypes and 
search operators used. This metric can be different from 
the problem (phenotype) metric. Therefore, the charac- 
ter of search operators can be different for genotypes 
versus phenotypes. If the locality of a representation 
is high, then a search operator has the same effect 
on the phenotypes as on the genotypes. As a result, 
the original problem difficulty remains unchanged by 
a representation f,. Easy (straightforward) problems 
remain easy and misleading problems remain mislead- 
ing. Figure 53.4 (left) illustrates the effect of local 
search operators for high-locality representations. A lo- 
cal search step has the same effect on the phenotypes as 
on the genotypes. 

For low-locality representations, the situation is dif- 
ferent and the influence of a representation on the 


Phenotype 
search space 


A B C D LAC I 
Genotype FGH DFKH 
search space 

I J KL E J BG 

High locality Low locality 


Fig. 53.3 High versus low-locality representations 
High locality 


Low locality 


Phenotype 
search space 


Genotype 
search space 


Fig. 53.4 The effect of local search operators for high ver- 
sus low-locality representations 


Design of Representations and Search Operators 


53.3 Problem-Specific Design of Representations and Search Operators 


difficulty of a problem depends on its character. If 
a problem f is straightforward, a low-locality repre- 
sentation f randomizes the problem by destroying the 
correlation between distance and fitness and making the 
problem f = f,(fe(x*)) more difficult. When using low- 
locality representations, a small change in a genotype 
does not correspond to a small change in the pheno- 
type, but larger changes in the phenotype are possible 
(Fig. 53.4, right). Therefore, when using low-locality 
representations, straightforward problems become, on 
average, difficult as low-locality representations lead 
to a more uncorrelated fitness landscape and heuristics 
can no longer extract meaningful information about the 
structure of the problem. Guided search becomes more 
difficult as many genotypic search steps do not result in 
a similar solution but in a random one. 

Summarizing the results, low-locality representa- 
tions have the same effect as using random search. 
Therefore, on average, straightforward problems be- 
come more difficult for guided search methods. As 
most real-world problems are straightforward, the use 
of low-locality representations makes these problems 
more difficult. Therefore, we strongly encourage users 
of EAs to use high-locality representations for problems 
of practical relevance. Of course, low-locality repre- 
sentations make misleading problems easier for guided 
search [53.3]; however, these are problems which we do 
not expect to meet in reality and we do not really want 
to solve. 

For more information on the influence of the local- 
ity of representations on the performance of EAs, we 
refer the interested reader to [53.3, Sect. 3.3] and [53.2]. 


53.3.2 Biasing Search 


This section discusses how to bias EAs. If we have 
some knowledge about the properties of either high- 
quality or low-quality solutions, we can make use of 
this knowledge for the design of EAs. For representa- 
tions, we can incorporate heuristics or introduce redun- 
dant encodings and assign a larger number of genotypes 
to high-quality phenotypes. Search operators can be 
designed in such a way that they distinguish between 
high-quality and low-quality solution features (building 
blocks) and prefer the high-quality ones [53.2]. 

A representation or search operator is biased if 
the application of a variation operator generates some 
solutions in the search space with higher probabil- 
ity [53.12]. We can bias representations by incorporat- 
ing heuristics into the genotype—phenotype mapping. 
Furthermore, representations can be biased if the num- 


ber of genotypes exceeds the number of phenotypes. 
Then, representations are called redundant [53.7780]. 
Redundant representations are biased if some pheno- 
types are represented by a larger number of genotypes. 
Analogously, search operators are biased if some solu- 
tions are generated with higher probability. 

When biasing EAs, we must make sure that we 
have a priori knowledge about the problem and the bias 
exploits this knowledge in an appropriate way. Intro- 
ducing an inappropriate or wrong bias into EAs would 
mislead search and result in low solution quality. Fur- 
thermore, we must make sure that a bias is not too 
strong. Using a bias can focus the search on specific 
areas of the search space and exclude solutions from 
consideration. If the bias is too strong, EAs can easily 
fail. 

The following paragraphs discuss biasing represen- 
tations and search operators. The next one gives an 
overview of how problem-specific construction heuris- 
tics can be used as genotype—phenotype mappings. 
Then, heuristic search varies either the input (problem 
space search) or the parameters (heuristic space search) 
of the construction heuristic. The following paragraph 
discusses redundant representations. Redundant repre- 
sentations with low locality randomize guided search 
and thus should not be used. Redundant representations 
with high locality can be biased by overrepresenting so- 
lutions similar to optimal solutions. 


Incorporating Construction Heuristics 

in Representations 
We focus on combining problem-specific construction 
heuristics with genotype—phenotype mappings. The 
possibility to design problem-specific representations 
and to incorporate relevant knowledge about the prob- 
lem into the genotype—phenotype mapping by using 
construction heuristics is a promising line of research 
and is continuously discussed in the operations research 
and evolutionary computation communities [53.6, 22, 
81-84]. 

Genotype—phenotype mappings map genotypes to 
phenotypes and can incorporate problem-specific con- 
struction heuristics. An early example of a problem- 
specific representation is the ordinal representation 
of [53.85], who studied the performance of genetic 
algorithms for the TSP. The ordinal representation 
encodes a tour (permutation of n integers) by a geno- 
type xë of length n, where xf € {1,...,n—i} and i € 
{0,...,n—1}. For constructing a phenotype, a pre- 
defined permutation x° of n integers representing the 
n different cities is used. x° can be problem-specific and, 
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for example, consider edge weights. A phenotype (tour) 
is constructed from x8 by subsequently adding (start- 
ing with i = 0) the x*-th element of x to the phenotype 
(which initially contains no elements) and removing the 
x;-th element of x°. Problem-specific knowledge can be 
considered by choosing an appropriate x° as genotypes 
define perturbations of x° and using small integers for 
the x? results in a bias of the resulting phenotypes to- 
wards x°. For example, for a =1(€ {0,...,n—1}), 
the resulting phenotype is x’. 

Other early examples of problem-specific repre- 
sentations can be found in [53.86], where a problem- 
specific schedule builder was incorporated into repre- 
sentations for job shop scheduling problems, in [53.87] 
where representations that use a greedy adding heuris- 
tic for partitioning problems, and the more general 
adaptive representation genetic optimization technique 
(ARGOT) strategy was studied in [53.88, 89]. ARGOT 
dynamically changes either the structure of the geno- 
types or the genotype—phenotype mapping according to 
the progress made during search. 

In parallel to, and independently from represen- 
tations, Storer et al. [53.90] proposed problem space 
search (PSS) and heuristic space search (HSS) as ap- 
proaches that also combine problem-specific heuristics 
with problem-independent EAs. PSS and HSS apply in 
each search iteration of a modern heuristic a problem- 
specific base heuristic H that exploits known properties 
of the problem. The base heuristic H should be fast 
and creates a phenotype from a genotype. Results pre- 
sented for different applications show that this approach 
can lead to improved performance of EAs [53.82, 91, 
92]. For PSS, H is applied to perturbed versions of the 
genotype. The perturbations of the genotypes are usu- 
ally small and based on a definition of neighborhood in 
the genotype space. For HSS, in each search step of the 
modern heuristic, the genotypes remain unchanged, but 
(slightly) different variants of the base heuristic H are 
used. For scheduling problems, linear combinations of 
priority dispatching rules with different weights, or the 
application of different base heuristics to different parts 
of the genotype, have been proposed [53.91]. 

PSS and HSS use the same underlying concepts 
as problem-specific representations. The base heuris- 
tic H is equivalent to a (usually problem-specific) 
genotype—phenotype mapping and assigns phenotypes 
to genotypes. PSS performs heuristic search by mod- 
ifying (perturbing) the genotypes, which is equivalent 
to the concept of representations originally proposed 
by [53.22] (for an early example, see [53.85]). HSS 
perturbs the base heuristic (genotype-phenotype map- 


ping), which is similar to the concept of adaptive repre- 
sentations (for early examples, see [53.88] or [53.93]). 


Redundant Representation 
We assume a combinatorial optimization problem with 
a finite number of phenotypes. If the size of the geno- 
type search space is equal to the size of the phenotype 
search space (|®,| = |®,|) and the representation maps 
all genotypes to all phenotypes (bijection), a representa- 
tion cannot be biased. All solutions are represented with 
the same probability and a bias can only be a result of 
the search operator used. 

The situation is different for representations where 
the number of genotypes exceeds the number of pheno- 
types. We still assume that all phenotypes are encoded 
by at least one genotype. Such representations are 
usually called (e.g., in [53.78, 94, 95], or [53.80]). Rad- 
cliffe and Surry [53.28] introduced a different notion 
of redundancy and distinguished between degenerated 
representations, where more than one genotype en- 
codes one phenotype, and redundant representations 
where parts of the genotypes are not used for the 
construction of a phenotype. However, this distinction 
has not generally been accepted in the EA commu- 
nity. Therefore, we follow the majority of the lit- 
erature and define encodings to be redundant if the 
number of genotypes exceeds the number of pheno- 
types (which is equivalent to the notion of degeneracy 
of [53.28]). 

Rothlauf and Goldberg [53.80] distinguished be- 
tween different types of redundant representations 
based on the similarity of the genotypes that are as- 
signed to the same phenotype. A representation is 
defined to be synonymously redundant if the geno- 
types that are assigned to the same phenotype are 
similar to each other. Consequently, representations are 
non-synonymously redundant if the genotypes that are 
assigned to the same phenotype are not similar to each 
other. Therefore, the synonymity of a representation 
depends on the genotype and phenotype metric. Fig- 
ure 53.5 illustrates the differences between synonymous 
and non-synonymous redundant encodings. Distances 
between solutions are measured using a Euclidean met- 
ric. The symbols indicate different genotypes and their 
corresponding phenotypes. When using synonymously 
redundant representations (left), genotypes that repre- 
sent the same phenotype are similar to each other. 
When using non-synonymously redundant representa- 
tions (right), genotypes that represent the same pheno- 
type are not similar to each other but distributed over 
the whole search space. 
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Formally, a redundant representation f assigns 
a phenotype x? to a set of different genotypes xë € 
oY, where Vx® € oy :fe(x®) = xP. All genotypes xë 
in the genotype set oy represent the same phenotype 
xP. A representation is synonymously redundant if the 
genotype distances between all x8 € Ø? are small for 
all different x?. Therefore, if for all phenotypes the sum 
over the distances between all genotypes that corre- 
spond to the same phenotype 


>. DY } dew]. (53.6) 


x xep? yeep? 


where x8 Æ y$, is reasonably small, a representation is 
called synonymously redundant. d(x®, yë) depends on 
the metric used and measures the distance between two 
genotypes xë € oy and y8 € oy, which both represent 
the same phenotype x. 


Non-Synonymously Redundant Representations. 
The synonymity of a representation can have a large 
influence on the performance of EAs. When using non- 
synonymously redundant representations, a local search 
operator can result in solutions that are phenotypically 
completely different from their parents. For recombi- 
nation operators, the distances between offspring and 
parents are not necessarily smaller than the distances 
between parents. 

Local search methods outperform random search if 
solutions with similar fitness are grouped together in 
the search space and are not scattered over the whole 
search space [53.25, 33-35, 96, 97]. Furthermore, prob- 
lems are easy for guided search methods if distances 
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Fig. 53.5 Synonymous versus non-synonymous_ redun- 
dancy 


between solutions are related to corresponding fitness 
differences. However, non-synonymously redundant 
representations destroy existing correlations between 
solutions and their corresponding fitness values. Thus, 
search heuristics cannot use any information learned 
during the search for determining future search steps. 
As a result, it makes no sense for guided search ap- 
proaches to search around already found high-quality 
genotypes and guided search algorithms become ran- 
dom search. A local search step does not result in 
a solution with similar properties but in a random 
solution. Analogously, recombination is not able to 
create new solutions with similar properties to their 
parents, but creates new, random solutions. Therefore, 
non-synonymously redundant representations have the 
same effect on EAs as low-locality representations 
(Sect. 53.3.1). 

The use of non-synonymously redundant represen- 
tations allows us to reach many different phenotypes 
in a single local search step (Fig. 53.6). However, in- 
creasing the connectivity between phenotypes results 
in random search and decreases the efficiency of EAs. 
As for low-locality representations, a genotype search 
step does not result in a similar phenotype but cre- 
ates a random solution. Therefore, guided search is no 
longer possible but becomes random search. As a result, 
we obtain reduced performance of EAs on straightfor- 
ward problems when using non-synonymously redun- 
dant representations. 

On average, non-synonymously redundant repre- 
sentations transform straightforward (as well as mis- 
leading) problems into difficult problems where the 
fitness differences between two solutions are not cor- 
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Fig. 53.6a,b The effects of local search steps for (a) syn- 
onymously versus (b) non-synonymously redundant repre- 
sentations. The arrows indicate search steps 


EES |I Hed 


1078 PartE 


Evolutionary Computation 


EES |3 Hed 


related to their distances. Easy problems become 
more difficult. Therefore, we do not recommend us- 
ing non-synonymously redundant encodings. A more 
detailed discussion of non-synonymously redundant 
representations can be found in [53.3, Sect. 3.1] 
and [53.60]. 


Bias of Synonymously Redundant Representations. 
The use of synonymously redundant representations al- 
lows local search to generate neighboring solutions. 
Small variations of genotypes cannot result in large 
phenotypic changes but result either in the same or 
a similar phenotype (Fig. 53.6, left). 

To describe relevant properties of synony- 
mously redundant representations, we can use 
the order of redundancy, k,, which is defined as 
k, = log(|®,|)/ log(|®p]) [53.3, Sect. 3.1]. k; mea- 
sures the amount of redundant information in the 
encoding. Furthermore, we are especially interested 
in biases of synonymously redundant representations. 
r measures a bias and denotes the number of geno- 
types that represent the optimal solution. When using 
non-redundant representations, every phenotype is 
assigned to exactly one genotype (r = 1). In general, 
1<r<|®,|—|®|+1. 

Synonymously redundant representations are unbi- 
ased (uniformly redundant) if each phenotype is, on 
average, encoded by the same number of genotypes. 
In contrast, encodings are biased (non-uniformly redun- 
dant) if some phenotypes are represented by a different 
number of genotypes. Rothlauf and Goldberg [53.80] 
studied how the bias of synonymously redundant rep- 
resentations influence the performance of EAs. If rep- 
resentations are uniformly redundant, unbiased search 
operators generate each phenotype with the same prob- 
ability as for a non-redundant representation. Further- 
more, variation operators have the same effect on the 
genotypes and phenotypes and the performance of EAs 
using a uniformly and synonymously redundant encod- 
ing is similar to non-redundant representations. 

The situation is different for non-uniformly re- 
dundant encodings. The probability P of finding the 
correct solution depends on P x 1 —exp(—r/2*) [53.3]. 
Therefore, uniformly redundant representations do not 
change the behavior of EAs. Only by increasing r, 
which means overrepresenting optimal solutions, does 
the performance of EAs increase. In contrast, the per- 
formance of EAs decreases if the optimal solution is 
underrepresented. Therefore, non-uniformly redundant 
representations can only be used advantageously if a- 
priori information exists regarding optimal solutions. 


For more information on redundant representations, we 
refer the reader to [53.80]. 


Search Operators 
Search operators are applied either to genotypes or phe- 
notypes and subsequently create new solutions. Search 
operators can be either biased or unbiased. In the un- 
biased case, each solution in the search space has the 
same probability of being created. If some phenotypes 
have higher probabilities to be created by applying 
a search operator to a randomly chosen solution, we call 
this a bias towards those phenotypes [53.3, 98]. 

Using biased search operators can be helpful if 
some knowledge about the structure of high-quality 
solutions exists and the search operators are biased 
such that high-quality solutions are preferred. Then, the 
average fitness of solutions that are generated by the bi- 
ased search operator is higher than randomly generated 
solutions or search operators without a bias. Identify- 
ing high-quality solutions is often difficult, as exact 
optimization methods usually need exponential effort 
to find optimal solutions for relevant (often NP-hard) 
problems, and heuristic optimization methods usually 
do not provide any guarantee on solution quality. Pos- 
sible approaches to overcome these problems are to 
exactly solve small problem instances and to deduce the 
structure of high-quality solutions from the solutions 
obtained. However, this assumes that relevant prop- 
erties of optimal solutions are independent from the 
problem size. Second, solutions of higher quality can 
be identified by heuristic optimization methods if we in- 
crease the time spent on the heuristic search. Although 
heuristic optimization methods usually do not provide 
any guarantee of finding optimal solutions, the proba- 
bility of finding high-quality solutions increases with 
the time spent on the heuristics search. For an example, 
we refer the reader to [53.99]. 

Problems can occur if a bias induced by a search 
operator is either too strong or towards solutions that 
have a large distance to an optimal solution. If the bias 
is too strong, the of a population is reduced and the 
individuals in a population quickly converge towards 
those solutions to which the search operators are biased. 
Then, after a few search steps, heuristic search is no 
longer possible. Furthermore, the performance of EAs 
decreases if a bias exists towards solutions that have 
a large distance to optimal solutions. The biased search 
operators push a population of solutions in the wrong 
direction and it is difficult for EAs to find optimal solu- 
tions. Therefore, biased search operators should be used 
with care. 
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Standard search operators for binary, integer, or 
continuous genotypes (Sect. 53.2.5) are unbiased. The 
situation is slightly different for standard recombina- 
tion operators. Recombination operators never intro- 
duce new solution features but only recombine existing 
properties. Thus, once some solution features are lost 
in a population of solutions they can never be regained 
later by using recombination operators alone. Given 
a randomly generated and unbiased initial population 
of solutions, the iterative application of recombination 
operators can result in a random fixation of some de- 
cision variables, which reduces the of solutions that 
exist in a population. This effect is known as genetic 
drift. The existence of genetic drift is widely known 
and has been addressed in the field of population genet- 
ics [53.100—104] and also in the field of evolutionary 
algorithms [53.105—108]. 

When using more sophisticated phenotypes and 
direct search operators, identifying bias of search op- 
erators may be difficult. For example, in standard 
approaches, programs and syntactical expressions are 
encoded as trees of variable size. Daida et al. [53.109] 
showed that the two standard search operators in ge- 


53.4 Summary and Conclusions 


This chapter discusses the design of representations and 
search operators. Representations and search operators 
cannot be designed independently of each other, as to- 
gether they define the structure of the search space. 
Section 53.1 summarized the benefits of representa- 
tions and gave an overview of standard genotypes. 
Section 53.2 reviewed design guidelines for local, as 
well as recombination operators and gave an overview 
of standard search operators. Section 53.3 discussed 
possibilities for a problem-specific design of represen- 
tations and search operators. 

Representations and search operators should have 
high locality; this means that applying a local search 
operator to a genotype should result in a neighboring 
phenotype. If we have some knowledge about the prop- 
erties of high-quality solutions, we can bias evolution- 
ary algorithms by either incorporating problem-specific 
heuristics in the representation, using biased represen- 
tations, or biased search operators. 

For problems of practical relevance, we assume that 
the metric of a search space is meaningful and, on 
average, the fitness differences between neighboring 
solutions are smaller than between randomly chosen so- 
lutions. Search operators and representations should be 


netic programming (sub-tree swapping crossover and 
sub-tree mutation) are biased as they do not effec- 
tively search all tree shapes. In particular, very full 
or very narrow tree solutions are extraordinarily dif- 
ficult to find, even when the fitness function provides 
good guidance to the optimum solutions. Therefore, ge- 
netic programming approaches will perform poorly on 
problems where optimal solutions require full or nar- 
row trees. Furthermore, since the search operators do 
not find solutions that are at both ends of this full- 
ness spectrum, problems may arise if we use those 
search operators to solve problems whose solutions 
are restricted to a particular shape, of whatever degree 
of fullness. Hoai et al. [53.110] studied approaches to 
overcome these problems and introduced a new tree- 
based representation and local insertion and deletion 
search operators with a lower bias. 

In general, the bias of search operators can be 
measured by comparing the properties of randomly 
generated solutions with solutions that are created by 
subsequent applications of search operators. Examples 
of how to analyze the bias of search operators can be 
found in [53.2]. 


designed in such a way that they fit the metric of the 
search space. If local as well as recombination-based 
search operators are not able to generate similar solu- 
tions, intensification of search is not possible, and EAs 
behave like random search. For a representation which 
introduces an additional genotype—phenotype mapping, 
we must make sure that it does not alter the charac- 
ter of the search operators. Therefore, its locality must 
be high; this means that the phenotype metric must fit 
the genotype metric. Low locality of a representation 
randomizes the search and leads to low performance of 
evolutionary algorithms. 

There is a general trade-off between the effective- 
ness and application range of optimization methods. 
Usually, the more problems that can be solved with 
one particular optimization method, the lower its re- 
sulting average performance. Therefore, standard EAs 
that are not problem-specific often only work for small 
or toy problems. As the problem becomes larger and 
more realistic, performance degrades. To improve the 
performance for selected optimization problems, we 
must design them in a more problem-specific way. By 
assuming that most problems in the real world have 
high locality, EAs already exploit a specific property 
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of problems, namely their high locality. We can fur- 
ther increase the performance of EAs if we have some 
idea about properties of high-quality solutions. Such 
problem-specific knowledge can be exploited by intro- 


ducing a bias into EAs. The bias should consider this 
knowledge and, for example, concentrate search on so- 
lutions that are expected to be of high quality or avoid 
solutions expected to be of low quality. 
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54. Stochastic Local Search Algorithms: An Overview 


Holger H. Hoos, Thomas Stiitzle 


In this chapter, we give an overview of the main 
concepts underlying the stochastic local search 
(SLS) framework and outline some of the most rel- 
evant SLS techniques. We also discuss some major 
recent research directions in the area of stochas- 
tic local search. The remainder of this chapter is 
structured as follows. In Sect. 54.1, we situate the 
notion of SLS within the broader context of fun- 
damental search paradigms and briefly review the 
definition of an SLS algorithm. In Sect. 54.2, we 
summarize the main issues and trends in the 
design of greedy constructive and iterative im- 
provement algorithms, while in Sects. 54.3-54.5, 
we provide a concise overview of some of the 
most widely used simple, hybrid, and popula- 
tion-based SLS methods. Finally, in Sect. 54.6, 
we discuss some recent topics of interest, such 
as the systematic design of SLS algorithms and 
methods for the automatic configuration of SLS 
algorithms. 


54.1 The Nature and Concept of SLS.............. 1086 
54.2 Greedy Construction Heuristics 
and Iterative Improvement .................. 1089 


Stochastic local search (SLS) algorithms are the method 
of choice for solving computationally hard decision 
and optimization problems from a wide range of ar- 
eas, including computing science, operations research, 
engineering, chemistry, biology and physics. SLS com- 
prises a spectrum of techniques ranging from simple 
constructive and iterative improvement procedures to 
more complex methods, such as simulated anneal- 
ing (SA), iterated local search or evolutionary al- 
gorithms (EAs). As evident from the term stochas- 
tic local search, randomization can, and often does, 
play a prominent role in these methods. Randomized 
choices may be used in the generation of initial so- 
lutions or in the decision which of several possible 
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54.6.1 Combination of SLS Algorithms 
with Systematic Search 
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search steps to perform next — sometimes merely to 
break ties between equivalent alternatives, and some- 
times to heuristically and probabilistically select from 
large and diverse sets of possible candidates. Judi- 
cious use of randomization can arguably simplify 
algorithm design and help achieve robust algorithm 
behavior. 

The concept of an SLS algorithm has been defined 
formally [54.1] and not only provides a unifying frame- 
work for many different types of algorithms, including 
the previously mentioned constructive and iterative im- 
provement procedures, but also provides a wide range 
of more complex search methods commonly known as 
metaheuristics. 
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Greedy constructive and iterative improvement pro- 
cedures are important SLS algorithms, since they typ- 
ically serve as building blocks for more complex SLS 
algorithms, whose performance critically depends on 
the design choices and fine tuning of these underly- 
ing components. Greedy constructive algorithms and 
iterative improvement procedures terminate naturally 
when a complete solution has been generated or a local 
optimum of a given evaluation function is reached, re- 
spectively. One possible way to obtain better solutions 
is to restart these basic SLS procedures from randomly 
chosen initial search positions. However, this approach 
has shown to be relatively ineffective in practice for rea- 
sonably sized problem instances (and it breaks down for 
large instances [54.2]). 

To overcome these limitations, over the last 
decades, a large number of more sophisticated, gen- 
eral-purpose SLS methods [54.1] have been introduced; 
these are often called metaheuristics [54.3], since they 
are based on higher level schemes for controlling one 
or more subsidiary heuristic search procedures. We 
divide these general-purpose SLS methods into three 
broad classes: simple, hybrid and population-based 
SLS methods. Simple SLS methods typically use one 
neighborhood relation during the search and either 
modify the acceptance criterion for search steps, allow- 
ing to occasionally accept worsening steps, or modify 
the evaluation function that is used during the local 
search process. Examples of simple SLS methods in- 
clude SA [54.4,5] and (simple) tabu search [54.6—9]. 
A number of SLS methods combine different types of 
search steps — for example, construction steps and per- 
turbative local search steps — or introduce occasional 
larger modifications into current candidate solutions, 
to provide appropriate starting points for subsequent 
iterative improvement search. Examples of such hy- 
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Computational approaches for the solution of hard, 
combinatorial problems can all be viewed as perform- 
ing some form of search. Essentially, search algorithms 
generate and evaluate candidate solutions for the prob- 
lem instance at hand. For combinatorial decision prob- 
lems, the evaluation of a candidate solution requires to 
check whether the candidate solution is a feasible so- 
lution satisfying all given constraints; for combinatorial 
optimization problems, it involves computing the value 
of the given objective function. For NP-complete de- 
cision problems and NP-equivalent optimization prob- 


brid SLS methods include greedy randomized adaptive 
search procedures (GRASPs) [54.10] and iterated local 
search [54.11]. Finally, several SLS methods maintain 
and manipulate at each iteration a set, or population, 
of candidate solutions, which provides a natural way 
of increasing search diversification. Examples of such 
population-based SLS methods include EAs [54.12- 
15], scatter search [54.16, 17] and ant colony optimiza- 
tion [54.18—20]. 

Our classification into simple, hybrid and popula- 
tion-based SLS methods is not the only possible one, 
and certain SLS algorithms could be seen as belonging 
to more than one category. For example, many popu- 
lation-based SLS methods are also hybrid, as they use 
different search operators or combine the manipulation 
of the population of candidate solutions with iterative 
improvement on members of the population to achieve 
increased performance. In fact, there is an increas- 
ing trend to design and apply SLS algorithms that are 
not merely based on a single, well-established general- 
purpose SLS method, but rather combine flexibly ele- 
ments of different SLS methods or incorporate mech- 
anisms taken from systematic search algorithms, such 
as branch and bound or dynamic programming. The 
conceptual framework of SLS naturally accommodates 
this development, and the composition of more complex 
SLS algorithms from conceptually simpler components 
is explicitly supported, for example, by the concept of 
generalized local search machines [54.1]. In this con- 
text, methodological issues concerning the engineering 
of SLS algorithms [54.21, 22] are increasingly gaining 
importance. Similarly, the exploitation of automatic al- 
gorithm configuration techniques and, more generally, 
the programming by optimization paradigm [54.23] en- 
able the systematic development of high-performance 
SLS algorithms. 


lems, even the most efficient algorithms known to date 
require running time exponential in the instance size in 
the worst case, while candidate solutions can be evalu- 
ated in polynomial time. 

A candidate solution for an instance of a com- 
binatorial problem is generally composed of solution 
components. Consider, for example, the well-known 
traveling salesperson problem (TSP). In the TSP, one is 
given a weighted, fully connected graph G = (V, E, w), 
where V = {v1,v2,..., Vn} is the set of |V| = n vertices, 
EC Vx V is the set of edges that fully connects the 


Stochastic Local Search Algorithms: An Overview 


54.1 The Nature and Concept of SLS 


graph, and w: Eb Rt is a function that assigns to 
each edge e € E a nonnegative weight w(e). The objec- 
tive is to find a minimum-weight Hamiltonian cycle in 
G. A candidate solution for a TSP instance can be repre- 
sented by a permutation 7 = (7(1),2(2),...,2(n)) of 
the vertex indices, and the objective function w is given 
as 


w(z) = W(Vx(n)> Va(1)) 
a=] 


+9 wa. vaa+n) : (54.1) 


i=1 


In the TSP, a (complete) candidate solution, commonly 
also called a tour, can be seen as consisting of n out of 
the n- (n— 1) possible edges, and each edge represents 
a solution component. 

Any given tour can be modified by removing two 
edges and introducing two unique new edges such that 
another valid tour is obtained. This modification is an 
example of a perturbation of a complete candidate so- 
lution, and we refer to search algorithms that make 
systematic use of such solution modifications as per- 
turbative search methods. In practice, such perturbative 
search methods iteratively modify a current candidate 
solution according to some rule, and this process ends 
when a given termination criterion is met. 

Perturbative search methods start from some com- 
plete candidate solution. The task of generating such 
candidate solutions is commonly accomplished by con- 
structive search methods or construction heuristics. 
Constructive search methods iteratively extend an ini- 
tially empty candidate solution by one or several solu- 
tion components until a complete candidate solution is 
obtained. Constructive search methods can thus be seen 
as operating in a search space of partial candidate solu- 
tions. An example of a constructive search method is the 
nearest neighbor heuristic for the TSP. An initial ver- 
tex is chosen randomly, and at each construction step, 
the nearest neighbor heuristic follows a minimal weight 
edge to one of the vertices that have not yet been vis- 
ited. These steps are iterated until all vertices have been 
visited, and the tour is completed by returning to the 
initial vertex. 

Generally speaking, local search algorithms start at 
some initial search position and iteratively move, based 
on local information, from the current position to neigh- 
boring positions in the search space. Both perturbative 
and constructive search methods match this general de- 
scription. While in the literature, the term local search 
is mostly used for perturbative search methods, it also 


applies to constructive search methods: A partial solu- 
tion corresponds to a position in the search space of 
partial candidate solutions, and the neighbors of a par- 
tial solutions are obtained by extending it with one or 
more solution components. In fact, there are a number 
of well-known generic SLS methods, such as GRASP, 
iterated greedy and ant colony optimization, that are 
based on constructive local search. 

Many local search algorithms use randomized de- 
cisions, for example, for generating initial solutions or 
when determining search steps. We therefore refer to 
such methods as stochastic local search (SLS) algo- 
rithms. The following components need to be specified 
to define an SLS algorithm (for a formal definition, we 
refer to Chap. 1 of [54.1]). 


@ Search space — comprises the set of candidate so- 
lutions (or search positions) for the given problem 
instance. 

© Solution set — consists of the search positions that 
are considered to be solutions of the given problem 
instance. In the case of decision problems, the solu- 
tion set comprises all feasible candidate solutions; 
in the case of optimization problems, the solution 
set typically comprises all optimal feasible candi- 
date solutions. 

@ Neighborhood relation — specifies the direct neigh- 
bors of each candidate solution s, i.e., the search 
positions that can be reached from s in a single 
search step of the SLS algorithm. 

© Memory states — hold additional information about 
the search beyond the search position. If an algo- 
rithm is memoryless, the memory may consist of 
a single, constant state. 

© Initialization function — specifies the search initial- 
ization in the form of a probability distribution over 
initial search positions and memory states. 

© Step function — determines the computation of 
search steps by mapping each search position and 
memory state to a probability distribution over its 
neighboring search positions and memory states. 

© Termination predicate — used to decide search ter- 
mination based on the current search position and 
memory state. 


The formal definition of an SLS algorithm speci- 
fies the initialization function, the step function, and 
the termination predicate as probability distributions, 
which the algorithm samples at each step during any 
given run. In practice, however, the initialization func- 
tion, the step function, and the termination predicate 
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will be specified by procedures, and the correspond- 
ing probability distributions are only implicitly defined. 
Note that the definition of an SLS algorithm is general 
enough to include deterministic local search algorithms. 
In fact, formally we can describe deterministic local 
search algorithms as special cases of SLS algorithms — 
deterministic decisions can be modeled using degener- 
ate probability distributions (Dirac delta). 

The working principle of an SLS algorithm is then 
as follows. The search process starts from some ini- 
tial search state that is generated by the initialization 
function. While some termination criterion is not satis- 
fied, search steps are performed according to the step 
function. In the case of optimization problems, the 
SLS algorithm keeps track of the best solution found 
so far, which is then returned upon termination of 
the algorithm. In the case of decision problems, the 
SLS algorithm typically stops as soon as a (feasible) 
solution is found or another termination criterion is 
satisfied. 

In all but the simplest cases, the search process is 
guided by an evaluation function, which measures the 
quality of candidate solutions. The efficacy of this guid- 
ance depends on the properties of the evaluation func- 
tion and the way in which it is integrated into the search 
process. Evaluation functions are generally problem 
specific. For many optimization problems, the objec- 
tive function given by the problem definition is used; 
however, different evaluation functions can sometimes 
provide better guidance, for example, in the sense of ap- 
proximation guarantees [54.24]. In decision problems, 
an appropriate evaluation function has to be defined by 
the algorithm designer. Often, the objective function 
used for optimization variants of the decision prob- 
lem can provide useful guidance. For example, for the 
satisfiability problem in propositional logic (SAT), the 
objective function of MAX-SAT, which, in a nutshell, 
counts the number of constraint violations, provides ef- 
fective guidance. Some SLS methods, such as dynamic 
local search (briefly discussed in Sect. 54.3), modify the 
evaluation function during the search process. 

The general concept of SLS algorithms, as intro- 
duced above and discussed in depth by Hoos and 
Stützle [54.1], provides a unified view of constructive 
and perturbative local search techniques that range from 
rather simplistic greedy constructive heuristics and iter- 
ative improvement algorithms to rather complex hybrid 
and population-based SLS methods. Population-based 
algorithms, which manipulate sets of candidate solu- 
tions at each iteration, fall under the definition of an 
SLS algorithm by considering search positions consist- 


ing of sets of candidate solutions. In this case, the step 
function also operates on sets of candidate solutions for 
the given problem instance. For example, in the case of 
typical EAs, recombination, mutation, and selection can 
all be modeled as operations on sets of candidate solu- 
tions, which are formally parts of a single-step function 
used for mapping one generation to the next. 

It is instructive to contrast the concept of an SLS 
algorithm with that of a metaheuristic. Metaheuristics 
have been described as heuristics that are superimposed 
on another heuristic [54.6], a [54.25]: 


master strategy that guides and modifies other 
heuristics to produce solutions beyond those that 
are normally generated in a quest for local optimal- 


ity, 
as [54.20]: 


a set of algorithmic concepts that can be used to 
define heuristic methods applicable to a wide set of 
different problems, 


and as [54.26]: 


a high-level problem-independent algorithmic 
framework that provides a set of guidelines 
or strategies to develop heuristic optimization 
algorithms. 


However, the term metaheuristic [54.26]: 


is also used to refer to a problem-specific implemen- 
tation of a heuristic optimization algorithm accord- 
ing to the guidelines expressed in such a framework. 


As is evident from these characterizations, there is 
no formal definition of the term metaheuristic, and its 
precise meaning has evolved over time. The term meta- 
heuristic is commonly used to refer to the high-level 
guidance strategies that in many occasions are used 
to extend underlying greedy constructive or perturba- 
tive search procedures. Hence, the scope of the term 
metaheuristic differs from that of an SLS algorithm; it 
comprises what can be similarly loosely characterized 
as general-purpose SLS methods, but extends naturally 
to higher-level search strategies involving paradigms 
other than SLS, such as systematic search methods 
based on backtracking. 

Conversely, the term metaheuristic is usually not 
applied to simple SLS procedures (such as random 
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sampling, random walk and iterative improvement), 
nor to problem-specific SLS algorithms with prov- 
able properties. Therefore, there are SLS algorithms 
based on metaheuristics (such as ant colony opti- 
mization, iterated local search or EAs for various 
problems), SLS algorithms that are not metaheuristics 
(such as 2-opt for the TSP or conflict-directed ran- 
dom walk for SAT) and metaheuristics that are not 
based on SLS (such as various branch and bound 


methods and hybrids between systematic and local 
search). 

Because the notion of an SLS algorithm explicitly 
refers to aspects that are not related to the high-level 
guidance of the search process, such as the choice of 
a neighborhood relation, evaluation function and ter- 
mination predicate, research on SLS also covers the 
design, implementation and analysis of these more 
problem-specific components. 


54.2 Greedy Construction Heuristics and Iterative Improvement 


The main SLS techniques underlying more complex 
SLS methods (or metaheuristics) comprise (greedy) 
constructive search and iterative improvement algo- 
rithms. In the following, we discuss the main principles 
and choices underlying these methods. 

Constructive search procedures (or construction 
heuristics) typically evaluate at each construction step 
the quality of the available solution components based 
on a heuristic function. Greedy construction heuristics 
choose to add at each step a solution component with 
best heuristic value, breaking ties either randomly or 
by means of a secondary heuristic function. For several 
polynomially solvable problems, such as the minimum 
spanning tree problem, greedy construction heuristics 
(for example, Kruskal’s algorithm) are guaranteed to 
produce optimal solutions [54.27]; unfortunately, for 
NP-hard problems, this is generally not the case, due 
to the myopic decisions taken during solution construc- 
tion. 

A useful distinction can be made between static and 
adaptive construction heuristics. In static construction 
heuristics, the heuristic values associated with solution 
components are precomputed before the actual con- 
struction process is executed and remain unchanged 
throughout. In adaptive construction heuristics, the 
heuristic values are recomputed at each construction 
step to take into account the impact of the current par- 


Fig. 54.1 A 2-exchange move for the symmetric TSP. 
Note that the pair of edges to be introduced is uniquely 
determined to ensure that the neighbor is again a tour 


tial solution. Adaptive construction heuristics tend to 
be more accurate and result in better quality candidate 
solutions than static heuristics, but they are also com- 
putationally more expensive. 

Construction heuristics are often used to provide 
good initial candidate solutions for perturbative local 
search algorithms. One of the most basic SLS meth- 
ods is to iteratively improve a candidate solution for 
a given problem instance. Such an iterative improve- 
ment algorithm starts from some initial search position 
and iteratively replaces the current candidate solution s 
by an improving neighboring candidate solution s’. The 
local search is terminated once no improving neighbor 
is available, that is, Vs’ € N(s) : g(s) < g(s"), where g(-) 
is the evaluation function to be minimized, and N(s) de- 
notes the set of neighbors of s. In the literature, iterated 
improvement algorithms are also referred to as iterated 
descent or (in the case of maximization problems) hill- 
climbing procedures. 

Neighborhoods are problem specific, and it is gener- 
ally difficult to predict a priori which of several possible 
neighborhoods results in best performance. However, 
for most problems, standard neighborhoods exist. Un- 
der the k-exchange neighborhood, two candidate solu- 
tions are neighbors if they differ by at most k solution 
components. An example is the 2-exchange neighbor- 
hood for the TSP, where two tours are neighbors if they 
differ by a pair of edges. Figure 54.1 illustrates a move 
in this neighborhood. In a k-exchange neighborhood, 
each candidate solution has O(n") direct neighbors, 
where n is the number of solution components in each 
candidate solution. Thus, the neighborhood size is ex- 
ponential in k, as is the time to identify improving 
neighbors. While using larger neighborhoods typically 
makes it possible to reach better solutions, finding those 
solutions also takes more time. In other words, there 
is a tradeoff between the quality of the local optima 
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reachable by an iterative improvement algorithm and 
its run time. In practice, neighborhoods that involve 
a quadratic or cubic time-complexity may already re- 
sult in prohibitive computation times for large problem 
instances. 

The overall time-complexity of searching a given 
neighborhood is determined by its size and the cost of 
evaluating each neighbor. The power of local search 
crucially relies on the fact that caching and incremen- 
tal updating techniques can significantly reduce the cost 
of evaluating neighbors compared to computing the re- 
spective evaluation function values from scratch. For 
example, the quality of a 2-exchange neighbor of a tour 
for a TSP instance with n vertices can be computed 
from the quality of the current tour by subtracting and 
adding two edge weights (that is, two numbers) each; 
computing the weight of such a tour from scratch, on 
the other hand, requires n additions. Sometimes, to 
render the computation of the incremental updates as 
efficient as possible, additional data structures need to 
be implemented, but the net effect is often a very large 
reduction in computational effort. 

A second important technique for reducing the 
time-complexity of evaluating a given neighborhood 
is based on the idea of excluding from consideration 
neighbors that are unlikely or provably unable to lead 
to improvements. These neighborhoods pruning tech- 
niques play a crucial role in many high-performance 
SLS algorithms. Examples of such pruning techniques 
are the fixed radius searches and nearest neighbors lists 
used for the TSP [54.28-30], the use of so-called don’t 
look bits [54.28], as well as reduced neighborhoods for 
the job-shop scheduling problem [54.31] and pre-tests 
for search steps, as done for the single machine total 
weighted tardiness problem [54.32]. 

The speed and performance of iterative improve- 
ment algorithms also depends on the mechanism 
used to determine search steps, the so-called pivoting 
rule [54.33]. Iterative best improvement chooses at each 
step a neighboring candidate solution that mostly im- 
proves the evaluation function value. Any ties that occur 
can be broken either randomly, according to the order 
in which the neighborhood is searched, or based on 
a secondary criterion (as in [54.34]). In order to find 
a most improving neighbors, iterative best improvement 
needs to examine the entire neighborhood in each step. 
Iterative first improvement, in contrast, examines the 
neighborhood in some given order and moves to the first 
improving neighboring candidate solution found during 
this neighborhood scan. Iterative first improvement ap- 
plies improving search steps earlier than iterative best 


improvement, but the amount of improvement achieved 
in each step tends to be smaller; therefore, it usually 
requires more improvement steps to reach a local opti- 
mum. If a candidate solution is a local optimum, first- 
and best-improvement algorithms detect this only by in- 
specting the entire neighborhoods of that solution; don’t 
look bits [54.28, 29] offer a particularly useful mecha- 
nism for reducing the time required by this final check, 
the so-called check-out time. 

Interestingly, the local optimum found by itera- 
tive first improvement depends on the order in which 
the neighborhood is examined. This property can be 
exploited by using a random order for scanning the 
neighborhood, and repeated runs of random-order first 
improvement algorithms can identify very different lo- 
cal optima, even if each run is started from the same ini- 
tial position [54.1, Sect. 2.1]. Thus, the search process 
in random-order first improvement is more diversified 
than in the first improvement algorithms that scan local 
neighborhoods in fixed order. 

The notion of local optimality is defined with re- 
spect to a specific neighborhood. Thus, changing the 
neighborhood during the local search process may pro- 
vide an effective means for escaping from poor quality 
local optima, and offers the opportunity to benefit from 
the advantages of large neighborhoods without incur- 
ring the computational burden associated with using 
them exclusively. In the context of iterative improve- 
ment algorithms, this idea forms the basis of variable 
neighborhood descent (VND), a variant of a general- 
purpose SLS method known as variable neighborhood 
search (VNS) [54.35,36]. VND uses a sequence of 
neighborhoods Nj, N2,...,Nx; this sequence is typ- 
ically ordered according to increasing neighborhood 
size or increasing time complexity of searching the 
neighborhoods. VND starts by using the first neigh- 
borhood, Nj, until a local optimum is reached. Every 
time the exploration of a neighborhood N; does not 
identify an improving local search step, that is, a lo- 
cal optimum w.r.t. neighborhood N; is found, VND 
switches to the next neighborhood, N;+ 1 in the given se- 
quence. Whenever an improving move has been made 
in a neighborhood N;, VND switches back to N; and 
continues using the subsequent neighborhoods, N3 etc., 
from there. The search is terminated when a local opti- 
mum w.r.t. N; has been reached. The central idea of this 
scheme is to use small neighborhoods whenever possi- 
ble, since they allow for the most efficient local search 
process. The VND scheme typically results in a signif- 
icant reduction of computation time when compared to 
an iterative improvement algorithm that uses the largest 
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neighborhood only. VND typically finds high-quality 
local optima, because upon termination, the resulting 
candidate solution is locally optimal with respect to all 
k neighborhoods examined. 

Finally, recent years have seen an explosion in the 
development of iterative improvement methods that ex- 
ploit very large scale neighborhood, whose size is 
typically exponential in the size of the given prob- 
lem instance [54.37]. In fact, there are two main ap- 
proaches to searching these neighborhoods. The first 
is to perform a heuristic search in the neighborhood, 
since a exact search would be computationally too 
demanding. This idea forms the basis of variable- 
depth search algorithms, where the number of solu- 
tion components that are modified in each step is not 
determined a priori. Interestingly, the two best-known 
variable-depth search algorithms, the Kernighan—Lin 
algorithm for graph partitioning [54.38] and the Lin- 


54.3 Simple SLS Methods 


Iterative improvement algorithms accept only improv- 
ing neighbors as new current candidate solutions, and 
they terminate when encountering a local optimum. To 
allow the search process to progress beyond local op- 
tima, many SLS methods permit moves to worsening 
neighbors. We refer to the methods discussed in the 
following as simple SLS methods, because they essen- 
tially only use one type of search steps, in a single, fixed 
neighborhood relation. 


54.3.1 Randomized Iterative Improvement 


The key idea behind randomized iterative improve- 
ment (RII) is to occasionally perform moves to random 
neighboring candidate solutions irrespective of their 
evaluation function value. The simplest way of imple- 
menting this idea is to apply, with a given probability 
Wp, a so-called uninformed random walk step, which 
chooses a neighbor of the current candidate solution 
uniformly at random, while with probability 1 — w,, an 
improvement step is performed. Often, the improve- 
ment step will correspond to one iteration of a best 
improvement procedure. The parameter w, is referred 
to as walk probability or, simply, noise parameter. RII 
algorithms have the property that they can perform arbi- 
trarily long sequences of random walk steps; the length 
of these sequences (i.e., the number of consecutive 
random walk steps) follows a geometric distribution 


Kernighan algorithm for the TSP [54.39], have been 
devised about in the early 1970s, a fact that illus- 
trates the lasting interest in these types of methods. 
The more recent concept of ejection chains [54.40] 
is related to variable-depth search. Another interest- 
ing approach is to devise neighborhoods with a special 
structure that allows them to be searched either in 
polynomial time or at least very efficiently in prac- 
tice [54.37, 41—43]. This is the central idea behind many 
recent developments in very large scale neighborhoods, 
which include techniques such as Dynasearch [54.32, 
44] and cyclic exchange neighborhoods [54.45, 46]. 
As a result of these research efforts, current state- 
of-the-art methods for a variety of combinatorial prob- 
lems such as the TSP [54.47] or the single machine 
total weighted tardiness problem [54.48] rely on iter- 
ative improvement algorithms based on very large scale 
neighborhoods. 


with parameter w,. This allows effective escapes from 
local optima and renders RII probabilistically approx- 
imately complete [54.1, Sect. 4.1]. A main advantage 
of RII is ease of implementation — often, only a few 
additional lines of code are required to extend an it- 
erative improvement procedure to an RII procedure — 
and its behavior is effectively controlled by a single 
parameter. 

RII algorithms have been shown to perform quite 
well in a number of applications. For example, in the 
1990s, minor variations of RII, in which random walk 
steps are determined based on the status of constraint 
violations rather than chosen uniformly at random, have 
been state of the art for solving the SAT [54.49, 50] and 
other constraint satisfaction problems [54.51]. Due to 
their simplicity, RII algorithms also facilitate theoreti- 
cal analyses, including characterization of performance 
in dependence of parameter settings [54.52]. 


54.3.2 Probabilistic Iterative Improvement 


Instead of accepting worsening search steps regardless 
of the amount of deterioration in evaluation function 
value they caused (as is the case for random walk 
steps), it may be preferable to have the probability 
of acceptance depend on the change of the evaluation 
function value incurred. This is the key idea underlying 
probabilistic iterative improvement (PII). Unlike RII, 
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each step of PII involves two phases: first, a neighbor- 
ing candidate solution s’ € N(s) is selected uniformly 
at random (proposal mechanism); then, a probabilis- 
tic decision is made whether to accept s’ as the new 
search position (acceptance test). For minimization 
problems, the acceptance probability is often based on 
the Metropolis condition and defined as 


Paccept(T S, s’) 
1 if g(s’) < g(s) 
:= g(s) —g(s’) f 
exp S otherwise , 


(54.2) 


where Paccept(T, S, s’) is the acceptance probability, g 
is the evaluation function to be minimized, and T is 
a parameter that influences the probability of accept- 
ing a worsening search step. PII is closely related to 
simulated annealing (SA), discussed next; in fact, when 
using the acceptance mechanism given above, PII is 
equivalent to constant-temperature SA. In light of this 
connection, parameter T is also called temperature. For 
various applications, such PII procedures have been 
shown to perform quite well, provided that T is cho- 
sen carefully [54.53, 54]. It is worth noting that in the 
limit for T = 0, PII effectively turns into an iterative 
improvement procedure (i. e., never accepts worsening 
steps), while for T = oo, it performs a uniform random 
walk. 


54.3.3 Simulated Annealing 


Simulated annealing (SA) [54.4,5] is similar to PII, 
except that the parameter T is modified at run time. 
Following the analogy of the physical annealing of 
solid materials (e.g., metals and glass), which inspired 
SA, the temperature T is initially set to some high 
value and then gradually decreased. At the beginning 
of the search process, high temperature values result 
in relatively high probabilities of accepting worsening 
candidate solutions. As the temperature is decreased, 
the search process becomes increasingly greedy; for 
very low settings of the temperature, almost only im- 
proving neighbors or neighbors with evaluation func- 
tion value equal to the current candidate solution are 
accepted. 

Standard SA algorithms iterate over the same two 
stage process as PII, typically using uniform sam- 
pling (with or without replacement) from the neigh- 
borhood as a proposal mechanism and a parameter- 


ized acceptance test based on the Metropolis condi- 
tion (54.2) [54.4, 5]. The modification of temperature T 
is managed by a so-called annealing (or cooling) sched- 
ule, which is a function that determines the temperature 
value at each search step. One of the most common 
choices is a geometric cooling schedule, defined by an 
initial temperature, Tọ, a parameter œ between 0 and 
1, and a value k, called the temperature length, which 
defines the number of candidate solutions that are pro- 
posed at each fixed value of the temperature; every k 
steps, the temperature is updated as T := a-T. Impor- 
tant parameters of SA are often determined based on 
characteristics of the problem instance to be solved. 
For example, the initial temperature may be based on 
statistics derived from an initial, short random walk, 
the temperature length may be set to a multiple of 
the neighborhood size, and the search process may 
be terminated when the frequency with which pro- 
posed search steps are accepted falls below a given 
threshold. 

SA is one of the oldest and most studied SLS 
methods. It has been applied to a very broad range of 
computational problems, and many types of annealing 
schedules, proposal mechanisms, and acceptance tests 
have been investigated. SA has also been subject to 
a substantial amount of theoretical analysis, which has 
yielded various convergence results. For more details 
on SA, we refer to [54.55, 56]. 


54.3.4 Tabu Search 


Tabu search (TS) differs significantly from the previ- 
ously discussed SLS methods, in that it makes a direct 
and systematic use of memory to direct the search pro- 
cess [54.25]. In its most basic form, which is also called 
simple tabu search, TS expands an iterative improve- 
ment procedure with a short-term memory to prevent 
the local search process from returning to recently vis- 
ited search positions. Instead of memorizing complete 
candidate solutions and forbidding these explicitly, TS 
usually associates a tabu status with specific solution 
components. In the latter case, TS stores for each so- 
lution component the time (i.e., the iteration number) 
at which it was last modified. Each solution component 
is then considered as potentially tabu if the difference 
between the stored iteration number and the current it- 
eration number is larger than the value of a parameter 
called tabu tenure (or tabu list length). The tabu status 
of a local search step is then determined based on spe- 
cific tabu criteria, which are a function of the tabu status 
of solution components that are affected by it. One ef- 
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fect is that once a search step has been performed, it is 
tabu in that it cannot be reversed for a certain number 
of iterations. 

Seen from a neighborhood perspective, TS dynam- 
ically restricts the set of neighbors permissible at each 
local search step by excluding neighbors that are cur- 
rently tabu. Since the tabu mechanism through prohi- 
bition of solution components is quite restrictive, many 
simple TS algorithms use an aspiration criterion, which 
overrides the tabu status of neighbors if specific condi- 
tions are satisfied; for example, if a local search step 
leads to a new best solution, aspiration allows it to be 
accepted regardless of its tabu status. 

As an example, consider a simple TS algorithm for 
the TSP, based on the 2-exchange neighborhood. Edges 
that are removed (or introduced) by a 2-exchange step 
may then not be reintroduced into (or removed from) 
the current tour for tt search steps, where tt is the tabu 
tenure. 

For several problems, even simple TS algorithms 
have been shown to perform quite well. However, the 
performance of TS strongly depends on the tabu tenure 
setting. To avoid the difficulty of finding fixed settings 
suitable for a given problem, mechanisms such as re- 
active tabu search [54.57] have been devised to adapt 
the tabu tenure at run time. Simple TS algorithms can 
be improved in many different ways. In particular, var- 
ious mechanisms have been developed that make use 
of intermediate-term and long-term memory to further 
enhance the performance of simple TS. For a detailed 
description of such techniques, which aim either at in- 
tensifying the search in specific areas of the search 
space or at diversifying the search to explore unvisited 
search space regions, we refer to the book by Glover 
and Laguna [54.25]. 


54.3.5 Dynamic Local Search 


In contrast to the simple SLS methods discussed so far, 
dynamic local search (DLS) does not accept worsening 
search steps, but rather modifies the evaluation func- 
tion during the search in order to escape from local 
optima. These modifications of the evaluation func- 
tion g are commonly triggered whenever the underlying 
local search algorithm, typically an iterative improve- 
ment procedure, has reached a locally optimal solution 
with respect to g’, the current evaluation function. Next, 
the evaluation function is modified and the subsidiary 
local search algorithm is run until a local optimum 
(with respect to the new g’) is encountered. These lo- 
cal search phases and evaluation function updates are 


iterated until some termination criterion is met (see Al- 
gorithm 54.1). 


Algorithm 54.1 High-level outline of dynamic local 
search 
Dynamic local search (DLS): 
determine initial candidate solution s 
initialize penalties 
while termination criterion is not satisfied do 
compute modified evaluation function g’ 
from g and penalties 
perform subsidiary local search on s using g’ 
update penalties based on s 
end while 


The modified evaluation function g’ is typically 
computed as the sum of the original evaluation function 
and penalties associated with each solution component, 
that is 


ge (s) := g(s) + > penalty(i) , (54.3) 
i€SC(s) 


where g is the original evaluation function, SC(s) is 
the set of solution components of candidate solution s, 
and penalty(i) is the penalty of solution component i. 
Initially, all penalties are set to zero. Variants of DLS 
differ in the details of their penalty update mecha- 
nism (e.g., additive vs. multiplicative updates, occa- 
sional reduction of penalties) and the choice of the 
solution components whose penalties are adjusted. For 
example, guided local search [54.58, 59] uses the fol- 
lowing mechanism for choosing the solution compo- 
nents whose penalties are increased: First, a utility 
value u(i) := g;(s)/(1 + penalty(i)) is computed for 
each solution component i, where g;(s) measures the 
impact of i on the evaluation function; then, the penal- 
ties of solution components with maximal utility are 
increased. 

DLS algorithms are sometimes referred to as a soft 
form of tabu search, since solution components are not 
strictly forbidden, but the effect of the penalties resem- 
bles a soft prohibition. There are also conceptual links 
to Lagrangian methods [54.60,61]. DLS algorithms 
have been shown to reach state-of-the-art performance 
for SAT [54.62] and for the maximum clique prob- 
lem [54.63]. 
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54.4 Hybrid SLS Methods 


The performance of basic SLS techniques can often 
be improved by combining them with each other. In 
fact, even RII can be seen as a combination of iterative 
improvement and random walk, using the same neigh- 
borhood. Several other SLS methods combine different 
types of search steps, and in the following, we briefly 
discuss some prominent examples. 


54.4.1 Greedy Randomized 
Adaptive Search Procedures 


As mentioned previously, construction heuristic can 
be easily and effectively combined with perturbative 
local search procedures. While greedy construction 
heuristics generally generate only one or very few 
different candidate solutions, randomization of the con- 
striction process makes it possible to generate many 
different high-quality solutions. The idea underlying 
GRASP [54.10, 64] is to combine randomized greedy 
construction with a subsequent perturbative local search 
phase, whose goal is to improve the candidate so- 
lutions produced by the construction heuristic. The 
two phases of solution construction and perturbative 
local search are repeated until a termination crite- 
rion, e.g., maximum computation time, is met. The 
term adaptive in GRASP refers to the fact that the 
hybrid search process typically uses an adaptive con- 
struction heuristic. Randomization in GRASP is real- 
ized based on the concept of a restricted candidate 
list, which contains the best-scoring solution compo- 
nents according to the given heuristic function. In 
the simplest and most common GRASP variants, el- 
ements are chosen uniformly at random from this 
restricted candidate list during the construction pro- 
cess. For a detailed description, various extensions, 
and an overview of applications of GRASP, we refer 
to [54.64]. 


54.4.2 Iterated Greedy Algorithms 


A disadvantage of GRASP is that new candidate 
solutions are constructed from scratch and indepen- 
dently of previously found solutions. Iterated greedy 
(IG) algorithms iteratively apply greedy construction 
heuristics to generate a chain of high-quality candi- 
date solutions. The central idea is to alternate be- 
tween solution construction and destruction phases, 
and thus to combine at least two different types of 
search steps. IG algorithms first build an initial, com- 


plete candidate solution s. Then, they iterate over 
the following phases, until a termination criterion is 
met: 


1. Starting from the current candidate solution, s, a de- 
struction phase is executed, during which some 
solution components are removed from s, result- 
ing in a partial candidate solution s’. The solution 
components that are removed in this phase may be 
chosen at random or, for example, based on their 
impact on the evaluation function. 

2. Starting from s’, a construction heuristic is used to 
generate another candidate solution, s”. This con- 
struction heuristic may differ from the one used to 
generate the initial candidate solution. 

3. Based on an acceptance criterion, a decision is 
made whether to continue the search from s or s”. 
Additionally, it is often useful to further im- 
prove complete candidate solutions by means of 
a subsidiary perturbative local search procedure 
(see Algorithm 54.2 for a high-level outline of 
IG). 


Algorithm 54.2 High-level outline of an iterated 
greedy (IG) algorithm 
Iterated greedy (IG): 
construct initial candidate solution s 
perform subsidiary local search on s 
while termination criterion is not satisfied do 
apply destruction to s, resulting in s 
apply constructive heuristic starting from s’, 
resulting in s” 
perform subsidiary local search on s” (optional) 
based on acceptance criterion, keep s or 
accept s := s” 
end while 


The principle underlying IG methods has been 
rediscovered several times, and consequently, can 
be found under various names, including ruin-and- 
recreate [54.65], iterative flattening [54.66], and it- 
erative construction heuristic [54.67]; it has also 
been used in the context of SA [54.68]. IG al- 
gorithms, especially when combined with perturba- 
tive local search methods, have reached state-of-the- 
art performance for a number of problems, includ- 
ing several variants of flowshop scheduling [54.69, 
70]. 


Stochastic Local Search Algorithms: An Overview 


54.5 Population-Based SLS Methods 


54.4.3 Iterated Local Search 


Iterated local search (ILS) generates a sequence of so- 
lutions by alternating applications of a perturbation 
mechanism and of a subsidiary local search algorithm. 
Consequently, ILS can be seen as a hybrid between the 
search methods underlying the local search and pertur- 
bation phases. 

An ILS algorithm is specified by four main compo- 
nents. The first is the mechanism used for generating 
an initial solution, for example, a greedy constructive 
heuristic. The second is a subsidiary (perturbative) local 
search procedure; typically, this is an iterative improve- 
ment algorithm, but often, other simple SLS methods 
are used. The third component is a perturbation proce- 
dure that introduces a modification to a given candidate 
solution. These perturbations should be complementary 
to the modifications introduced by the subsidiary local 
search procedure; in particular, the effect of the pertur- 
bation procedure should not be easily reversible by the 
local search procedure. The fourth component is an ac- 
ceptance criterion, which is used to decide whether to 
accept the outcome of the latest perturbation and local 
search phase. 

ILS starts by generating an initial candidate solu- 
tion, to which then subsidiary local search is applied. It 
then iterates over the following phases, until a termina- 
tion criterion is met: 


1. Perturbation is applied to the current candidate 
solution s, to obtain an intermediate candidate so- 
lution s’. 

2. Subsidiary local search is applied to s’. 

3. Based on the acceptance criterion, a decision is 
made whether to continue the search from s or s’ 
(see Algorithm 54.3 for a high-level outline of 
ILS). 


Often, the subsidiary search is based on iterative im- 
provement and ends in a local optimum; ILS can there- 


54.5 Population-Based SLS Methods 


The use of a population of candidate solutions offers 
a convenient way to increase diversification in SLS. 
For example, population-based extensions of ILS algo- 
rithms have been proposed with this aim in mind [54.74, 
75]. A further potential benefit comes from the inher- 
ent parallelizability of the most population-based SLS 


fore be seen as performing a biased random walk in the 
space of local optima produced by the given subsidiary 
local search procedure. The acceptance criterion (to- 
gether with the strength of the perturbation mechanism) 
then determines the degree of search intensification: if 
only improving candidate solutions are accepted, ILS 
performs a randomized first-improvement search in the 
space of local optima; if any new local optimum is ac- 
cepted, ILS performs a random walk in the space of 
local optima. 


Algorithm 54.3 High-level outline of iterated local 
search 
Iterated local search (ILS): 
generate initial candidate solution s 
perform subsidiary local search on s 
while termination criterion is not satisfied do 
apply perturbation to s, resulting in s’ 
perform subsidiary local search on s’ 
based on acceptance criterion, keep s 
or accept s := s’ 
end while 


An attractive feature of ILS is that basic versions 
can be quickly and easily implemented, especially if 
a simple SLS algorithm or an iterative improvement 
procedure is already available. Using some additional 
refinements, ILS methods define the current state of 
the art for solving many combinatorial problems, in- 
cluding the TSP [54.71]. Similar to IG, ILS is based 
on an idea that has been rediscovered several times 
and is known under various names, including large- 
step Markov chains [54.29] and chained local optimiza- 
tion [54.72]. There is also a close conceptual connection 
with several variants of variable neighborhood search 
(VNS) [54.35]; in fact, the so-called basic VNS and 
skewed VNS algorithms can be seen as variants of 
ILS that adapt the perturbation strength at run time. 
For more details on iterated local search, we refer 
to [54.73]. 


methods, although the parallelization thus achieved is 
not necessarily more effective than the simple and 
generic approach of performing multiple independent 
runs of an SLS algorithm in parallel (see also [54.1], 
Sect. 4.4). As previously remarked, population-based 
methods can be cast into the SLS framework described 
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in Sect. 54.1 by defining search positions to consist of 
sets of candidate solutions and by using neighborhood 
relations, initialization, and step functions that operate 
on such populations. 

Unfortunately, the benefits derived from the use of 
populations come at the cost of increased complex- 
ity, in terms of implementation effort, and parameters 
that need to be set appropriately. In what follows, 
we describe two of the most prominent population- 
based methods, one based on a constructive search 
paradigm (ant colony optimization), and the other based 
on a perturbative search paradigm (evolutionary algo- 
rithms). 


54.5.1 Ant Colony Optimization 


Ant colony optimization (ACO) algorithms have orig- 
inally been inspired by the trail-following behavior of 
real ant species, which allows them to find shortest 
paths [54.76,77]. This biological phenomenon gave 
rise to a surprisingly effective algorithm for combina- 
torial optimization [54.18, 19]. In ACO, the artificial 
ants perform a randomized constructive search that is 
biased by (artificial) pheromone trails and heuristic in- 
formation derived from the given problem instance. The 
pheromone trails are numerical values associated with 
solution components that are adapted at run time to 
reflect experience gleaned from the search process so 
far. 

During solution construction, at each step every 
ant chooses a solution component, probabilistically 
preferring those with high-pheromone trail and heuris- 
tic information values. For illustration, consider the 
TSP — the first problem to which ACO has been 
applied [54.18]. Each edge (i,j) has an associated 
pheromone value t;; and a heuristic value nj, which for 
the TSP is typically defined as 1/w(i, j), that is, the in- 
verse of the edge weight. In ant system [54.19], the first 
ACO algorithm for the TSP, an ant located at vertex 
i would add vertex j to its current partial tour s’ with 
probability 


a,b 
Tij ‘Mij 


r (54.4) 
B 
Vien) Ti Nit 


Pij 


where N(i) is the feasible neighborhood of vertex i, i. e., 
the set of all vertices that have not yet been visited in s’, 
and a and f are parameters that control the relative im- 
portance of pheromone trails and heuristic information, 
respectively. Note that the tour construction procedure 


implemented by the artificial ants is a randomized ver- 
sion of the nearest neighbor construction heuristic. In 
fact, randomizing a greedy construction heuristic based 
on pheromone trails associated with the decisions to 
be made would generally be a good initial step to- 
ward an effective ACO algorithm for a combinatorial 
problem. 

Once every ant has constructed a complete can- 
didate solution, it is typically highly advantageous to 
apply an iterative improvement procedure or a sim- 
ple SLS algorithm [54.20,78]. Next, the pheromone 
trail values are updated by means of two counteracting 
mechanisms. The first models pheromone evaporation 
and decreases some or all pheromone trail values by 
a constant factor. The second models pheromone de- 
posit and increases the pheromone trail levels of solu- 
tion components that have been used by one or more 
ants. The amount of pheromone deposited typically de- 
pends on the quality of the respective solutions. In the 
best performing ACO algorithms, only some of the ants 
with the highest quality solutions are allowed to deposit 
pheromone. The overall result of the pheromone update 
is an increased probability of choosing solution com- 
ponents in subsequent solution constructions that have 
previously been found to occur in high-quality solu- 
tions. ACO algorithms then cycle through these phases 
of solution construction, application of local search, 
and pheromone update until some termination criterion 
is met (see Algorithm 54.4 for a high-level outline of 
ACO). 


Algorithm 54.4 High-level outline of ant colony 
optimization 
Ant colony optimization (ACO): 
initialize pheromone trails 
while termination criterion is not satisfied do 
generate population sp of candidate solutions 
using subsidiary randomized 
constructive search 
perform subsidiary local search on sp 
update pheromone trails 
end while 


Many different variants of ACO algorithms have 
been studied. Along with many additional details on 
ACO, these are described in the book by Dorigo and 
Stiitzle [54.20]; for more recent surveys, we refer the 
reader to [54.79, 80]. The ACO metaheuristic [54.81, 
82] provides a general framework for these variants 
and a generic view of how to apply ACO algorithms. 
ACO is also one of the most successful algorithmic 
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54.6 Recent Research Directions 


techniques within the broader field of swarm intelli- 
gence [54.83]. 


54.5.2 Evolutionary Algorithms 


Evolutionary algorithms (EAs) are a prominent class of 
population-based SLS methods that are loosely inspired 
by concepts from biological evolution. Unlike ACO al- 
gorithms, EAs work with a population of complete can- 
didate solutions. The initial set of candidate solutions 
is typically created randomly, but greedy construction 
heuristics may also be used to seed the population. This 
population then undergoes an artificial evolution, where 
at each iteration, the population of candidate solutions 
is modified by means of mutation, recombination and 
selection. 

Mutation operators typically introduce small, ran- 
dom perturbations into individual candidate solutions. 
The strength of these perturbations is usually controlled 
by a parameter called mutation rate; alternatively, a spe- 
cific, fixed perturbation, akin to a random walk step 
in RII, may be performed. Recombination operators 
generate one or more new candidate solutions by com- 
bining information from two or more parent candidate 
solutions. The most common type of recombination 
is crossover, inspired by the homonymous mechanism 
in biological evolution; it generates offspring by as- 
sembling partial candidate solutions from linear repre- 
sentations of two parents. In addition to mutation and 
recombination, selection mechanisms are used to deter- 
mine the candidate solutions that will undergo mutation 
and recombination, as well as those that will form the 
population used in the next iteration of the evolutionary 
process. Selection is based on the fitness, i.e., evalu- 
ation function values, of the candidate solutions, such 
that better candidate solutions have a higher probability 
to be selected. 

Details of the mutation, recombination and selec- 
tion mechanisms all have a strong impact on the per- 
formance of an EA. Generally, the use of problem 
specific knowledge within these mechanisms leads to 
better performance. In fact, much research in EAs has 
been devoted to the design of effective mutation and 


54.6 Recent Research Directions 


In this section, we concisely discuss three research 
directions that we regard as particularly timely and 
promising: combinations of SLS and systematic search 


recombination operators; a good example for this is 
the TSP [54.84, 85]. To achieve cutting-edge perfor- 
mance in an BA, it is often useful to improve at least 
the best candidate solutions in a given population by 
means of a perturbative local search method, such as 
iterative improvement. The resulting class of hybrid al- 
gorithms, which are also known as memetic algorithms 
(MA) [54.86], are enjoying increasing popularity as 
a broadly applicable method for solving solving combi- 
natorial problems (see Algorithm 54.5 for a high-level 
outline of an MA). 


Algorithm 54.5 High-level outline of a memetic 
algorithm 
Memetic algorithm (MA): 
initialize population p 
perform subsidiary local search on each 
candidate solution in p 
while termination criterion is not satisfied do 
generate set pr of candidate solutions 
through recombination 
perform subsidiary local search on each 
candidate solution of pr 
generate set pm of candidate solutions 
from p U pr through mutation 
perform subsidiary local search on each 
candidate solution of pm 
select new population p from candidate 
solutions in p U pr U pm 
end while 


Several other techniques are conceptually related to 
evolutionary algorithms but have different roots. Scat- 
ter search and path relinking are SLS methods whose 
origins can be traced back to the mid-1970s [54.16]. 
Scatter search can be seen as a memetic algorithm that 
uses special types of recombination and selection mech- 
anisms. Path relinking corresponds to a specific form of 
interpolation between two (or possibly more) candidate 
solutions and is thus conceptually related to recombi- 
nation operators. Both methods have recently become 
increasingly popular; details can be found in [54.17, 
87]. 


techniques, SLS algorithm engineering, and automated 
configuration and design of SLS algorithms. For other 
topics of interests, such as SLS algorithms for mul- 
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tiobjective [54.88—90], stochastic [54.91] or dynamic 
problems [54.92, 93], we refer to the literature for more 
details. 


54.6.1 Combination of SLS Algorithms 
with Systematic Search Techniques 


Systematic search and SLS are traditionally seen as 
two distinct approaches for solving challenging com- 
binatorial problems. Interestingly, the particular ad- 
vantages and disadvantages of each of these ap- 
proaches render them rather complementary. There- 
fore, it is hardly surprising that over the last few 
years, there has been increased interest in the ex- 
ploration and development of hybrid algorithms that 
combine ideas from both paradigms. For example, re- 
lated to the area of mathematical programming, the 
term Matheuristics has recently been coined to refer 
to methods that combine elements from mathematical 
programming techniques (which are primarily based 
on systematic search) and (meta)heuristic search algo- 
rithms [54.94]. 

Hybrids between SLS and systematic search fall 
into two main classes. The first of these consists of ap- 
proaches where the systematic search algorithm plays 
the role of the master process, and an SLS proce- 
dure is used to solve subproblems that arise during the 
systematic search process. Probably, the simplest, yet 
potentially effective method is to use an SLS algorithm 
to provide an initial high-quality (primal) bound on the 
optimal solution of the problem, which is then used 
by the systematic search algorithm for pruning parts of 
the search tree. Several more elaborate schemes have 
been devised, e.g., in the context of column generation 
and separation routines in integer programming [54.95]. 
Other approaches introduce the spirit of local search 
into integer programming solvers; examples of these in- 
clude local branching [54.96] and relaxation-induced 
neighborhood search [54.97]. We refer to [54.95] for 
a recent overview of such combinations. 

The second class of hybrid approaches is based on 
the idea of using systematic search procedures to deal 
with specific tasks arising while running an SLS al- 
gorithm. Very-large neighborhood search [54.37], as 
discussed in Sect. 54.2, is probably one of the best- 
known examples. Elements of tree search methods can 
also be exploited within constructive search algorithms, 
as exemplified by the use of branch and bound tech- 
niques in ACO algorithms [54.98, 99]. Other examples 
include tour merging [54.100] and the usage of infor- 
mation derived from integer programming formulations 


of optimization problems in heuristic methods [54.101]. 
We refer to [54.102] for a survey of this general ap- 
proach. A taxonomy of the possible combinations of 
exact and local search algorithms has been introduced 
by Jourdan et al. [54.103]. 

Despite an increasing number of efforts on com- 
bining systematic search methods and SLS methods, as 
reviewed in [54.94], much work remains to be done in 
this direction, especially considering that the two un- 
derlying fundamental search paradigms are developed 
primarily in rather disjoint communities. We believe 
that much can be gained by overcoming the traditional 
view of these two approaches as being competing with 
each other in favour of focusing on synergies due to 
their complementarity. 


54.6.2 SLS Algorithm Engineering 


Despite the impressive successes in SLS research and 
applications — SLS algorithms are now firmly estab- 
lished as the method of choice for tackling a broad 
range of combinatorial problems — there are still sig- 
nificant shortcomings. Perhaps most prominently, there 
is a lack of guidelines and best practices regarding the 
design and development of effective SLS algorithms. 
Current practice is to implement one specific SLS 
method, based on one or more construction heuristics 
or iterative improvement procedures. However, general- 
purpose SLS methods are not fully defined recipes: they 
leave many design choices open, and typically only spe- 
cific combinations of these choices will result in an 
effective algorithms for a given problem. Even worse, 
the underlying basic construction and iterative improve- 
ment procedures have a tremendous influence on the 
final performance of the SLS algorithms built on them, 
and this influence is frequently neglected. 

We firmly believe that a more methodological ap- 
proach needs to be taken toward the design and imple- 
mentation of SLS algorithms. The research direction 
dedicated to developing such an approach is called 
stochastic local search algorithm engineering or, for 
short, SLS engineering; it is conceptually related to 
algorithm engineering [54.104] and software engineer- 
ing [54.105], where similar methodological issues are 
tackled in a different context. Algorithm engineering is 
rather closely related to SLS engineering; it has been 
conceived as an extension to the traditionally more the- 
oretically oriented research on algorithms. Algorithm 
engineering, according to [54.104], deals with the it- 
erative process of designing, analyzing, implementing, 
tuning and experimentally evaluating algorithms. SLS 
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engineering shares this motivation; however, the al- 
gorithms that are dealt with in the context of SLS 
approaches have substantially more complex and un- 
predictable behavior than those typically considered 
in algorithm engineering. There are several reasons 
for this: SLS algorithms are usually used for solv- 
ing NP-hard problems, they allow for many more 
degrees of freedom in the choice of algorithm com- 
ponents, and their stochasticity makes analysis more 
complex. 

From a high-level perspective, an initial approach 
to a successful SLS engineering process would proceed 
in a bottom-up fashion. Starting from knowledge about 
the problem, it would build SLS algorithms by itera- 
tively adding complexity to simple, basic algorithms. 
More concretely, a tentative first attempt at such a pro- 
cess could be as follows: 


1. Study existing knowledge on the problem to be 
solved and its characteristics; 

2. Implement basic and advanced constructive and it- 
erative improvement procedures; 

3. Starting from these, add complexity (for example, 
by moving to simple SLS methods); 

4. Improve performance by gradually adding concepts 
from more complex SLS techniques (for exam- 
ple, perturbations, prohibition mechanisms, popula- 
tions); 

5. Further configure and fine-tune parameters and de- 
sign choices; 

6. If found to be useful: iterate over steps 4-5. 


Obviously, such a process would not necessarily 
strictly follow this outline, but insights gained at later 
stages could prompt revisiting earlier design decisions. 
Several high-performance SLS algorithms have already 
been developed following roughly the process outlined 
above (see [54.106] for an explicit example). 

The SLS engineering process can be supported in 
various ways. Algorithm development, implementation 
and testing is facilitated by the use of programming 
frameworks like Paradiseo [54.107, 108] and EasyLo- 
cal++ [54.109, 110], dedicated languages and systems 
like COMET [54.111], libraries of data types (such 
as LEDA [54.112]), and statistical tools, such as the 
comprehensive, open-source R environment [54.113]. 
We expect that software environments specifically de- 
signed for the automated empirical analysis and design 
of algorithms, such as HAL [54.114,115], will be 
especially useful in this context. Tools for the auto- 
matic configuration and tuning of algorithms, discussed 


further in the next section are also of considerable 
importance. 

Furthermore, we see an improved understanding of 
the relationship between problem and instance features 
on the one side, and the properties and the behavior of 
SLS methods on the other side as key enabling fac- 
tors for advanced SLS engineering approaches. The 
potential insights to be gained are not only of practical 
value to SLS engineering but also of considerable sci- 
entific interest. Progress in this direction is facilitated 
by advanced search space analysis techniques, statis- 
tical methods and machine learning approaches (see, 
e.g., Merz and Freisleben [54.116], Xu et al. [54.117] 
and Watson etal. [54.118]). Another promising av- 
enue for future research involves the integration of 
theoretical insights into the design process, for ex- 
ample, by restricting design alternatives or parameter 
choices. 

It is important to note that research toward SLS 
engineering adopts a component-wise view of SLS meth- 
ods. For example, iterated local search (ILS) uses 
perturbations to diversify the search as well as ac- 
ceptance tests (components: perturbations, acceptance 
tests), while evolutionary algorithms prominently in- 
volve the use of a population of solutions (component: 
population of solutions). Each of these components 
can be instantiated in different ways, and various com- 
binations are possible. An effective SLS engineering 
process should provide guidance to the algorithm de- 
signer regarding the choice and configuration of these 
components. It would naturally and incrementally lead 
to combinations of algorithmic components taken from 
different SLS methods (or other paradigms, such as 
mathematical programming — [54.94]), if these con- 
tribute to desirable performance characteristics of the 
algorithm under design. Such an engineering process 
would therefore rather naturally produce hybrid algo- 
rithms that are effective for solving the given computa- 
tional problem. 

Finally, SLS engineering highlights more the im- 
portance of decisions concerning the underlying basic 
SLS techniques (such as construction heuristics, neigh- 
borhoods, efficient data structures, etc.) than the gen- 
eral-purpose SLS methods (or metaheuristics) used in 
a given algorithm design scenario. In fact, in our ex- 
perience, such fundamental choices together with: (i) 
the level of expertise of the SLS algorithm developer 
and implementer, (ii) the time invested in designing and 
configuring the SLS algorithm, (iii) the creative use of 
insights into algorithm behavior and interaction with 
problem characteristics play a considerably more im- 
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portant role in the design of effective SLS algorithms 
than the focus on specific features prescribed by so- 
called metaheuristics. 


54.6.3 Automatic Configuration 
of SLS Algorithms 


The performance of algorithms for virtually any com- 
putationally challenging problem (and in particular, for 
any NP-hard problem) depends strongly on appropriate 
settings of algorithm parameters. In many cases, there 
are tens of such parameter; for example, the well-known 
commercial CPLEX solver for integer programming 
problems has more than 130 user-specifiable param- 
eters that influence its search behavior. Likewise, the 
behavior of most SLS algorithms is controlled by pa- 
rameters, and many design choices can be exposed in 
the form of parameters. This gives rise to algorithms 
with many categorical and numerical parameters. Cate- 
gorical parameters are used to make choices from a dis- 
crete set of design variants, such as search strategies, 
neighborhoods or perturbation mechanisms. Numeri- 
cal parameters often arise as subordinate parameters 
that directly control the behavior of a search strategy 
(e.g., temperature in SA and tabu tenure in simple tabu 
search). The goal in automated algorithm configura- 
tion is to find settings of these parameters that achieve 
optimized performance w.r.t. a performance metric of 
interest (for example, solution quality or computation 
time). 

Automated algorithm configuration methods are an 
active area of research and have been demonstrated to 
achieve very substantial performance gains on many 
widely studied and challenging problems [54.119]. So- 
called offline configuration methods, which determine 
performance-optimizing parameter settings on a rep- 
resentative set of benchmark instances during a train- 
ing phase before algorithm deployment, have arguably 
been studied most intensely been studied. These in- 
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55. Parallel Evolutionary Combinatorial Optimization 


El-Ghazali Talbi 


In this chapter, a clear difference is made between 
the parallel design aspect and the parallel imple- 
mentation aspect of evolutionary algorithms (EAs). 
From the algorithmic design point of view, the 
main parallel models for EAs are presented. A uni- 
fying view of parallel models for EAs is outlined. 
This chapter is organized as follows. In Sect. 55.2, 
the main parallel models for designing EAs are 
presented. Section 55.3 deals with the implemen- 
tation issues of parallel EAs. In this section, the 
main concepts of parallel architectures and paral- 
lel programming paradigms, which interfere with 
the design and implementation of parallel EAs, are 
outlined. The main performance indicators that 
can be used to evaluate a parallel EAs in terms 
of efficiency are detailed. Finally, Sect. 55.4 deals 
with the design and implementation of differ- 
ent parallel models for EAs based on the software 
framework ParadisE0. 
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55.1 Motivation 


On one hand, optimization problems are more and more 
complex and their resource requirements to solve them 
are ever increasing. Real-life optimization problems are 
often NP-hard, and CPU time, and/or memory con- 
suming. Although the use of evolutionary algorithms 
(EAs) allows us to significantly reduce the computa- 
tional complexity of the solving algorithm, the latter 
remains time-consuming for many problems in diverse 
domains of application, where the objective function 
and the constraints associated with the problem are re- 
source (e.g., CPU, memory) intensive and the size of the 
search space is huge. Moreover, more and more com- 
plex and resource intensive EAs are developed (e.g., 
hybrid EAs, multiobjective EAs) [55.1]. 
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On the other hand, the rapid development of tech- 
nology in designing processors (e.g. multicore proces- 
sors, dedicated architectures), networks (local networks 
(LAN) such as Myrinet and Infiniband or wide area 
networks (WAN) such as optical networks), and data 
storage make the use of parallel computing more and 
more popular. Such architectures represent an effec- 
tive opportunity for the design and implementation 
of parallel EAs. Indeed, sequential architectures are 
reaching physical limitations (speed of light, thermo- 
dynamics). Nowadays, even laptops and workstations 
are equipped with multicore processors, which repre- 
sent one class of parallel architecture. Moreover, the 
ratio cost/performance is constantly decreasing. The 
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proliferation of powerful workstations and fast com- 
munication networks have shown the emergence of 
dedicated architectures (e.g., GPUs), clusters of pro- 
cessors (COPs), networks of workstations (NOWs), and 
large-scale networks of machines (Grids) as platforms 
for high-performance computing. 

Parallel and distributed computing can be used in 
the design and implementation of EAs for the following 
reasons: 


@ Speedup the search: One of the main goals in 
parallelizing an EA is to reduce the search time. 
This helps designing real time and interactive 
optimization methods. This is a very important 
aspect for some class of problems where there 
are hard requirements on search time such as 
in dynamic optimization problems and time-crit- 
ical control problems such as real-time plan- 
ning. 

© Improve the quality of the obtained solutions: Some 
parallel models for EAs allow us to improve the 
quality of solutions. Indeed, exchanging informa- 
tion between algorithms will alter their behavior 
in terms of searching in the landscape associated 
with the problem. The main goal in the coopera- 
tion between algorithms is to improve the quality 
of solutions. Both convergence to better solutions 
and reduced search time may happen. Let us note 
that a parallel model for EAs may be more effective 
than a sequential algorithm even on a single proces- 
sor [55.2]. 

@ Improve the robustness: A parallel EA may be more 
robust in terms of solving in an effective manner dif- 
ferent optimization problems and different instances 
of a given problem. Robustness may be measured 
in terms of the sensitivity of the algorithm to its 
parameters. 


55.2 Parallel Design of EAs 


In terms of designing parallel EAs, three major parallel 
models are identified. They follow the following three 
hierarchical levels (Table 55.1): 


© Algorithmic level: In this model, independent or 
cooperating self-contained EAs are used. It is 
a problem-independent interalgorithm paralleliza- 
tion. If the different EAs are independent, the search 
will be equivalent to the sequential execution of the 


@ Solve large-scale problems: Parallel EAs allow to 
solve large-scale instances of complex optimiza- 
tion problems. A challenge here is to solve very 
large instances that cannot be solved on a sequen- 
tial machine. Another similar challenge is to solve 
more accurate mathematical models associated with 
different optimization problems. Improving the ac- 
curacy of mathematical models increases in general 
the size of the associated problems to be solved. 
Moreover, some optimization problems need the 
manipulation of huge databases such as data min- 
ing problems. 


The implementation point of view deals with the 
efficiency of a parallel EAs on a target parallel archi- 
tecture using a given parallel language, programming 
environment, or middleware. The focus is on the paral- 
lelization of EAs on general-purpose parallel and dis- 
tributed architectures, since this is the most widespread 
computational platform. This chapter also deals with 
the implementation of EAs on dedicated architectures 
such as reconfigurable architectures and GPUs (graph- 
ical processing units). Different architectural criteria, 
which affect the efficiency of the implementation, will 
be considered: shared memory versus distributed mem- 
ory, homogeneous versus heterogeneous, shared ver- 
sus nonshared by multiple users, local network versus 
large network. Indeed, those criteria have a strong im- 
pact on the deployment technique employed such as 
load balancing and fault tolerance. Depending on the 
type of parallel architecture used, different parallel 
and distributed languages, programming environments, 
and middlewares may be used such as message pass- 
ing (e.g., MPI), shared memory (e.g., multithreading, 
OpenMP, CUDA), remote procedural call (e.g., Java 
RMI, RPC), high-throughput computing (e.g., Condor), 
and grid computing (e.g., Globus). 


algorithms in terms of the quality of solutions. How- 
ever, the cooperative model will alter the behavior 
of the EAs and enable the improvement in terms of 
the quality of solutions. 

© Iteration level: In this model, each iteration of 
an EA is parallelized. It is a problem-independent 
intra-algorithm parallelization. The behavior of the 
EA is not altered. The main objective is to speedup 
the algorithm by reducing the search time. Indeed, 
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55.2 Parallel Design of EAs 


Table 55.1 Parallel models of EAs 


Parallel model Problem dependency Behavior 
Algorithmic level Independent Altered 
Iteration level Independent Nonaltered 
Solution level Dependent Nonaltered 


the iteration cycle of EAs on large populations, es- 
pecially for real-world problems, requires a large 
amount of computational resources. 

@ Solution level: In this model, the parallelization pro- 
cess handles a single solution of the search space. It 
is a problem-dependent intra-algorithm paralleliza- 
tion. In general, evaluating the objective function(s) 
or constraints for a generated solution is frequently 
the most costly operation in EAs. In this model, the 
behavior of the EA is not altered. The objective is 
mainly the speedup of the search. 


In the following sections, different parallel models 
are detailed and analyzed in terms of algorithmic de- 
sign. 


55.2.1 Algorithmic-Level Parallel Model 


In this model, many EAs are launched in parallel. They 
may cooperate or not to solve the target optimization 
problem. 


Independent Algorithmic-Level Parallel Model 
In the independent algorithmic-level parallel model, 
different EAs are executed without any cooperation. 
The different EAs may be initialized with different pop- 
ulations. Different parameter settings may be used for 
the EAs such as the mutation and crossover proba- 
bilities, etc. Moreover, each search component of an 
EA may be designed differently: encoding, search op- 
erators (e.g., variation operators), objective function, 
constraints, stopping criteria, etc. 

This parallel model is straightforward to design and 
implement. The master/worker paradigm is well suited 
to this model. A worker implements an EA. The master 
defines different parameters to use by the workers and 
determines the best found solution from those obtained 
by different workers. In addition to speeding up the al- 
gorithm, this parallel model enables us to improve its 
robustness [55.3]. 

This model raises particularly the following ques- 
tion: Is it equivalent to execute k EAs during a time f 
and to execute a single EA during kt? The answer 
depends on the landscape properties of the problem 
(e.g., the presence of multiple basins of attraction, 


Granularity Goal 

EA Effectiveness 
Iteration Efficiency 
Solution Efficiency 


distribution of the local optima, and fitness distance cor- 
relation) [55.4]. 


Cooperative Algorithmic-Level Parallel Model 
In the cooperative model for parallel EAs, different 
algorithms are exchanging informations related to the 
search with the intent to compute better and more ro- 
bust solutions. 

In designing a parallel cooperative model for any 
EA, the same design questions need to be answered: 


@ The exchange decision criterion (When?): The ex- 
change of information between the EAs can be 
decided either in a blind (periodic or probabilistic) 
way or according to an intelligent adaptive crite- 
rion. Periodic exchange occurs in each algorithm 
after a fixed number of iterations; this type of com- 
munication is synchronous. Probabilistic exchange 
consists in performing a communication operation 
after each iteration with a given probability. Con- 
versely, adaptive exchanges are guided by some 
run-time characteristics of the search. For instance, 
it may depend on the evolution of the quality of the 
solutions or the search memory. A classical crite- 
rion is related to the improvement of the best found 
local solution. 

© The exchange topology (Where ?): The communica- 
tion exchange topology indicates for each EA its 
neighbor(s) regarding the exchange of information, 
i. e., the source/destination algorithm(s) of the infor- 
mation. Several works have been dedicated to the 
study of the impact of the topology on the quality 
of the provided results, and show that cyclic graphs 
are better [55.5, 6]. The ring, mesh, and hypercube 
regular topologies are often used. 

© The information exchanged (What?): This param- 
eter specifies the information to be exchanged be- 
tween the EAs. In general, this information can be 
composed of: 

— Solutions: This information deals with a selec- 
tion of the solutions found during the search. In 
general, it contains elite solutions that have been 
found such as the best solution at the current 
iteration, local best solutions, global best solu- 
tion, neighborhood best solution, best diversi- 
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a) Parallel insular model for EAs 


b) Parallel cellular model for EAs 


Fig. 55.1a,b The traditional parallel 
(a) island and (b) cellular models for 


evolutionary algorithms 


fied solutions, and randomly selected solutions. 
The number of solutions to exchange may be an 
absolute value or a given percentage of the pop- 
ulation. Any selection mechanism can be used 
to select the solutions. 

— Search memory: This information deals with 
any element of the search memory that is asso- 
ciated with the involved EA. 

© The integration policy (How?): Analogously to the 
information exchange policy, the integration policy 
deals with the usage of the received information. In 
general, there is a local copy of the received infor- 
mation. The local variables are updated using the 
received ones. For instance, the best found solu- 
tion is simply updated by the best between the local 
best solution and the neighboring best solution. Any 
replacement strategy may be applied on the local 
population by the set of received solutions. 


Traditional Parallel Models for EAs. Historically, 
the cooperative parallel model has been largely used 
in EAs [55.7]. In sequential genetic algorithms (the 
sequential model is known as the panmictic genetic 
algorithm), the selection takes place globally. Any indi- 
vidual can potentially reproduce with any other individ- 
ual of the population. Among the best-known parallel 
algorithmic-level models for evolutionary algorithms 
are the island model and the cellular model. In the 
island model (also known as the migration model, dis- 
tributed model, multideme EA, or coarse-grained EA) 
for genetic algorithms, the population is decomposed 
into several subpopulations distributed among different 
nodes (Fig. 55.1). Each node is responsible of the evo- 
lution of one subpopulation. It executes all the steps of 
a classical EA from the selection to the replacement on 
the subpopulation. Each island may use different pa- 
rameter values and different strategies for any search 
component such as selection, replacement, variation 
operators (mutation, crossover), and encodings. After 
a given number of generations (synchronous exchange), 


or when a condition holds (asynchronous exchange), 
the migration process is activated. Then, exchanges of 
some selected individuals between subpopulations are 
realized, and received individuals are integrated into the 
local subpopulation. The selection policy of emigrants 
indicates for each island in a deterministic or stochastic 
way the individuals to be migrated. The stochastic or 
random policy does not guarantee that the best individ- 
uals will be selected, but its associated computation cost 
is relatively lower. The deterministic strategy allows the 
selection of the best individuals. The number of emi- 
grants can be expressed as a fixed or variable number 
of individuals, or through a percentage of individuals 
from the population. The choice of the value of such pa- 
rameter is crucial. Indeed, if the number of emigrants is 
low, the migration process will be less efficient as the is- 
lands will have the tendency to evolve in an independent 
way. Conversely, if the number of emigrants is high, the 
EAs will likely converge to the same solutions [55.8]. In 
EAs, the replacement/integration policy of immigrants 
indicates in a stochastic or deterministic way the local 
individuals to be replaced by the newcomers. The ob- 
jective of the model is to delay the global convergence 
and encourage diversity [55.9, 10]. 

The other well-known parallel model for EAs, the 
cellular model (also known as the diffusion or fine- 
grained model), may be seen as a special case of the 
island model where an island is composed of a sin- 
gle individual. Traditionally, an individual is assigned 
to a cell of a grid. The selection occurs in the neigh- 
borhood of the individual [55.11—13]. Hence, the se- 
lection pressure is less important than in sequential 
EAs. The overlapped small neighborhood in cellular 
EAs helps exploring the search space because a slow 
diffusion of solutions through the population provides 
a kind of exploration, while exploitation takes place 
inside each neighborhood. Cellular models applied to 
complex problems can have a higher convergence prob- 
ability to better solutions than panmictic EAs [55.14, 
15]. 
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55.2.2 Iteration-Level Parallel Model 


Evaluated 
solutions 


FIFO 


In this parallel model, a focus is made on the paral- 
lelization of each iteration of EAs. The iteration-level 
parallel model is generally based on the distribution 
of the handled solutions. Indeed, the most resource- 
consuming part in an EA is the evaluation of the 
generated solutions. Our concerns in this model are 
only search mechanisms that are problem-independent 
operations such as the generation of successive pop- 
ulations. Any search operator of an EA which is not 
specific to the tackled optimization problem is involved 
in the iteration-level parallel model. This model keeps 
the sequentiality of the original algorithm, and, hence, 
the behavior of the EA is not altered. 

Parallel iteration level models arise naturally when 
dealing with EAs, since each element belonging to 
the population is an independent unit. The iteration- 
level parallel model involves the distribution of the 
population. The operations commonly applied to 
each of the population elements are performed in 
parallel. 

The population of individuals can be decomposed 
and handled in parallel. In the beginning of the paral- 
lelization of EAs the well-known master-worker (also 
known as global parallelization) method was used. In 
this scheme, a master performs the selection operations 
and the replacement. The selection and replacement 
are generally sequential procedures, as they require 
a global management of the population. The associ- 
ated workers perform the recombination, mutation and 
the evaluation of the objective function. The master 
sends the partitions (subpopulations) to the workers. 
The workers return back newly evaluated solutions to 
the master. 

According to the order in which the evaluation 
phase is performed in comparison with the other parts 
of the EA, two modes can be distinguished: 


Fig. 55.2 Parallel asynchronous eval- 
uation of a population 


Parallel evaluators 


@ Synchronous: In the synchronous mode, the worker 
manages the evolution process and performs in 
a serial way the different steps of selection and re- 
placement. At each iteration, the master distributes 
the set of new generated solutions among the work- 
ers and waits for the results to be returned back. 
After the results are collected, the evolution process 
is restarted. The model does not change the behav- 
ior of the EA compared to a sequential model. The 
synchronous execution of the model is always syn- 
chronized with the return back of the last evaluated 
solution. 

@ Asynchronous: In the asynchronous mode, the eval- 
uation phase is not synchronized with the other parts 
of the EA. The worker does not wait for the return 
of all evaluations to perform the selection, reproduc- 
tion, and replacement steps. The steady-state EA is 
a good example illustrating the asynchronous model 
and its advantages. In the asynchronous model ap- 
plied to a steady-state EA, the recombination and 
the evaluation steps may be done concurrently. 
The master manages the evolution engine and two 
queues of individuals of a given fixed size: individ- 
uals to be evaluated, and solutions being evaluated. 
The individuals of the first queue wait for a free 
evaluating node. When the queue is full the pro- 
cess blocks. The individuals of the second queue 
are assimilated into the population as soon as pos- 
sible (Fig. 55.2). The reproduced individuals are 
stored in a FIFO data structure, which represents 
the individuals to be evaluated. The EA continues 
its execution in an asynchronous manner, without 
waiting for the results of the evaluation phase. The 
selection and reproduction phase are carried out un- 
til the queue of nonevaluated individuals is full. 
Each evaluator agent picks an individual from the 
data structure, evaluates it, and stores the results 
into another data structure storing the evaluated in- 
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and/or input/output intensive. Indeed, most of real-life 
optimization problems need the intensive calculation of 
the objectives and/or the access to large input files or 
databases. 

Two different solution-level parallel models may be 
carried out: 


Algorithmic-level {Th -#. fo +--+. 


independent or cooperative 
self-contained metaheuristics 


Neighborhood or 
population 
partitioning 


Iteration-level 
parallel handling of 
solutions or populations 


© Functional decomposition: In functional oriented 
parallelization, the objective function(s) and/or con- 
straints are partitioned into different partial func- 
tions. The objective function(s) or the constraints 
are viewed as the aggregation of some partial func- 
tions. Each partial function is evaluated in parallel. 
Then, a reduction operation is performed on the 
results returned back by the computed partial func- 


Solution-level 
parallel handling of 
a single solution 


Functional or 
data partitioning 
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Fig. 55.3 Combination of the three parallel hierarchical models of 


dividuals. The order of evaluation defined by the 
selection phase may not be the same as in the re- 
placement phase. The replacement phase consists 
in receiving, in a synchronous manner, the results 
of the evaluated individuals, and applying a given 
replacement strategy of the current population. 


In some EAs (e.g., blackboard-based ones) some in- 
formation must be shared. For instance, in ant colony 
optimization (ACO), the pheromone matrix must be 
shared by all ants. The master has to broadcast the 
pheromone trails to each worker. Each worker handles 
an ant process. It receives the pheromone trails, con- 
structs a complete solution, and evaluates it. Finally, 
each worker sends back to the master the constructed 
and evaluated solution. When the master receives all 
the constructed solutions, it updates the pheromone 
trails [55.16—19]. 


55.2.3 Solution-Level Parallel Model 


In this model, problem-dependent operations per- 
formed on solutions are parallelized. In general, the 
interest here is the parallelization of the evaluation of 
a single solution (also called acceleration move par- 
allel model; objective and/or constraints). This model 
is particularly interesting when the objective function 
or the constraints are time and/or memory consuming, 


tions. By definition, this model is synchronous, so 
one has to wait the termination of all workers calcu- 
lating the partial functions. 

© Data partitioning: For some problems, the objec- 
tive function may require the access to a huge 
database that could not be managed on a single ma- 
chine. Due to a memory requirement constraint, the 
database is distributed among different sites, and 
data parallelism is exploited in the evaluation of the 
objective function. In data-oriented parallelization, 
the same identical function is computed on differ- 
ent partitions of the input data of the problem. The 
data is then partitioned or duplicated over different 
workers. 


In the solution-level parallel model, the maximum 
number of parallel operations will be equal to the num- 
ber of partial functions or the number of data partitions. 
A hybrid model can also be used in which a functional 
decomposition and a data partitioning are combined. 


55.2.4 Hierarchical Combination 
of the Parallel Models 


The three presented models for parallel EAs may 
be used in conjunction within a hierarchical struc- 
ture [55.20,21] (Fig. 55.3). The parallelism degree 
associated with this hybrid model is very important. In- 
deed, this hybrid model is very scalable; the degree of 
concurrency is k * m * n, where k is the number of EAs 
used, m is the size of the population, and n is the num- 
ber of partitions or tasks associated with the evaluation 
of a single solution. 
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55.3 Parallel Implementation of EAs 


Parallel implementation of EAs deals with the efficient 
mapping of a parallel model of EAs on a given parallel 
architecture. 


55.3.1 Parallel and Distributed Architectures 


Parallel architectures are evolving quickly. The main 
criteria of parallel architectures, which will have an 
impact on the implementation of parallel EAs, are: 
memory sharing, homogeneity of resources, resource 
sharing by multiple users, scalability, and volatility 
(Fig. 55.4). Those criteria will be used to analyze differ- 
ent parallel models and their efficient implementation. 
A guideline is given for the efficient implementation of 
each parallel model of EAs according to each class of 
parallel architectures. 


Shared Memory/Distributed Memory Architectures. 
In shared memory parallel architectures, the proces- 
sors are connected by a shared memory. There are 
different interconnection schemes for the network (e.g., 
bus, crossbar, multistage crossbar). This architecture 
is easy to program. Conventional operating systems 
and programming paradigms of sequential program- 
ming can be used. There is only one address space for 
data exchange but the programmer must take care of 
synchronization in memory access, such as the mutual 
exclusion in critical sections. This type of architecture 
has a poor scalability (from 2 to 128 processors in cur- 
rent technologies) and a higher cost. An example of 
such shared memory architectures are symmetric multi- 
processors (SMPs) machines and multicore processors. 

In distributed memory architectures, each processor 
has its own memory. The processors are connected by 
a given interconnection network using different topolo- 


gies (e.g., hypercube, 2D or 3D torus, fat-tree, and 
multistage crossbars). This architecture is harder to pro- 
gram; data and/or tasks have to be explicitly distributed 
to processors. Exchanging information is also explicitly 
handled using message passing between nodes (syn- 
chronous or asynchronous communications). The cost 
of communication is not negligible and must be mini- 
mized to design an efficient parallel EA. However, this 
architecture has a good scalability in terms of the num- 
ber of processors. In recent years, clusters of processors 
(COWs) became one of the most popular parallel dis- 
tributed memory architectures. A good ratio between 
cost and performance is obtained with this class of ar- 
chitectures. 


Homogeneous/Heterogenous Parallel Architec- 
tures. Parallel architectures may be characterized by 
the homogeneity of the used processors, communica- 
tion networks, operating systems, etc. For instance, 
COWs are in general homogeneous parallel archi- 
tectures. The proliferation of powerful workstations 
and fast communication networks have shown the 
emergence of heterogeneous networks of workstations 
(NOWs) as platforms for high-performance computing. 
This type of architecture is present in any laboratory, 
company, campus, institution, etc. These parallel 
platforms are generally composed of an important 
number of owned heterogeneous workstations shared 
by many users. 


Shared/Nonshared Parallel Architectures. Most 
massively parallel machines (MPP) and clusters of 
workstations (COWs) are generally nonshared by the 
applications. Indeed, at a given time, the processors 
composing those architectures are dedicated to the 


Target architectures for parallel metaheuristics 
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Fig. 55.4 Hierarchical and flat classification of target parallel architectures for EAs 
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Table 55.2 Characteristics of the main parallel architectures. Hom: Homogeneous, Het: Heterogeneous 


Criteria Memory Homogeneity 
SMP Multicore Shared Hom 

COW Distributed Hom or Het 
NOW Distributed Het 

HPC Grid Distributed Het 

Desktop grid Distributed Het 


execution of a single application. NOWs constitute 
a low-cost hardware alternative to run parallel algo- 
rithms but are in general shared by multiple users and 
applications. 


Local Network (LAN)/Wide-Area Network (WAN). 
Massively parallel machines, clusters, and local net- 
works of workstations may be considered as tightly 
coupled architectures. Large networks of workstations 
and grid computing platforms are loosely coupled and 
are affected by a higher cost of communication. Dur- 
ing the last decade, grid computing systems have been 
largely deployed to provide high-performance comput- 
ing platforms. A computational grid is a scalable pool of 
heterogeneous and dynamic resources geographically 
distributed across multiple administrative domains and 
owned by different organizations [55.22]. Two types of 
grids may be distinguished: 


© High-Performance Computing Grid (HPC grid): 
This grid interconnect supercomputers or clusters 
via a dedicated high-speed network. In general, this 
type of grid is nonshared by multiple users (at the 
level of processors). 

@ Desktop Grid: This class of grids is composed 
of numerous owned workstations connected via 
nondedicated network such as the internet. This grid 
is volatile and shared by multiple users and applica- 
tions. 


Peer-to-peer networks have been developed in par- 
allel to grid computing technologies. Peer-to-peer in- 
frastructures have been focused on sharing data and are 
increasingly popular for sharing computation. 


Volatile/Nonvolatile Parallel Architectures. Desk- 
top grids constitute an example of volatile parallel 
architectures. In a volatile parallel architecture, there 
is a dynamic temporal and spatial availability of re- 
sources. In a desktop grid or a large network of shared 
workstations, volatility is not an exception but a rule. 
Due to the large-scale nature of the grid, the probability 
of resource failure is high. For instance, desktop grids 


Sharing Network Volatility 
Yes or No Local No 
No Local No 
Yes Local Yes 
No Large No 
Yes Large Yes 


have a faulty nature (e.g., reboot, shutdown, and fail- 
ure). 

Table 55.2 recapitulates the characteristics of the 
main parallel architectures according to the presented 
criteria. Those criteria will be used to analyze the effi- 
cient implementation of the different parallel models of 
EAs. 


55.3.2 Dedicated Architectures 


Dedicated hardware represents programmable hard- 
ware or specific architectures that can be designed or 
reused to execute a parallel EA. The best-known ded- 
icated hardware is represented by field programmable 
gate arrays (FPGA) and GPU (Fig. 55.4). 

FPGAs are hardware devices that can be used to 
implement digital circuits by means of a programming 
process (do not confuse with evolvable hardware where 
the architecture is reconfigured using EAs) [55.23]. The 
use of the Xilinx’s FPGAs to implement different EAs 
is more and more popular. The design and the pro- 
totyping of a FPGA-based hardware board to execute 
parallel EAs may restrict the design of some search 
components. However, for some specific challenging 
optimization problems with a high use rate such as in 
bioinformatics, dedicated hardware may be a good al- 
ternative. 

GPU is a dedicated graphics rendering device for 
a workstation, personal computer, or game console. Re- 
cent GPUs are very efficient at manipulating computer 
graphics, and their parallel SIMD structure makes them 
more efficient than general-purpose CPUs for a range 
of complex algorithms [55.24]. The main companies 
producing GPUs are AMD (ATI Radeon series) and 
NVIDIA (NVIDIA Geforce series). The use of GPUs 
for an efficient implementation of EAs is a challenging 
issue [55.25]. 


55.3.3 Parallel Programming Environments 
and Middlewares 


The architecture of the target parallel machine strongly 
influences the choice of the parallel programming 
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Fig. 55.5 Main parallel programming languages, programming environments and middlewares 


model to use. There are two main parallel program- 
ming paradigms: shared memory and message passing 
(Fig. 55.5). 

Two main alternatives exist to program shared 
memory architectures: 


© Multithreading: A thread may be viewed as 
a lightweight process. Different threads of the same 
process share some resources and the same address 
space. The main advantages of multithreading are 
the fast context switch, the low resource usage, 
and the possible recovery between communication 
and computation. Each thread can be executed on 
a different processor or core. Multithreaded pro- 
gramming may be used within libraries such as the 
standard Pthreads library [55.26] or programming 
languages such as Java threads [55.27]. 

© Compiler directives: One of the standard shared 
memory paradigms is OpenMP and CUDA. It rep- 
resents a set of compiler directives interfaced with 
the languages Fortran, C and C++ [55.28]. Those di- 
rectives are integrated in a program to specify which 
sections of the program to be parallelized by the 
compiler. 


Distributed memory parallel programming envi- 
ronments are based mainly on the following three 
paradigms: 


@ Message passing: Message passing is probably the 
most widely used paradigm to program parallel 
architectures. In the message passing paradigm, 
processes of a given parallel program communi- 


cate by exchanging messages in a synchronous or 
asynchronous way. The well-known programming 
environments based on message passing are sockets 
and message passing interface (MPI). 

© Remote Procedure Call: Remote procedure call 
(RPC) represents a traditional way of program- 
ming parallel and distributed architectures. It allows 
a program to cause a procedure to execute on an- 
other processor. 

© Object-oriented models: As in sequential program- 
ming, parallel object oriented programming is a nat- 
ural evolution of RPC. A classical example of such 
a model is Java RMI (Remote Method Invocation). 


In the last decade, great work has been carried 
out on the development of grid middlewares. The 
Globus toolkit represents the de facto standard grid 
middleware. It supports the development of distributed 
service-oriented computing applications [55.29]. 

It is not easy to propose a guideline on which 
environment to use in programming a parallel EA. 
It will depend on the target architecture, the parallel 
model of EAs, and the user preferences. Some lan- 
guages are more system oriented such as C and C++. 
More portability is obtained with Java but the price 
is less efficiency. This tradeoff represents the classical 
efficiency/portability compromise. A Fortran program- 
mer will be more comfortable with OpenMP. RPC 
models are more adapted to implement services. Con- 
dor represents an efficient and easy way to implement 
parallel programs on shared and volatile distributed ar- 
chitectures such as large networks of heterogeneous 
workstations and desktop grids, where fault tolerance is 


ESS | J Hed 


116 PartE 


Evolutionary Computation 


ESS | J Hed 


Table 55.3 Parallel programming environments for differ- 
ent parallel architectures 


Architecture Examples of suitable programming 
environment 

SMP Multithreading library within an operating 
system (e.g., Pthreads) 

Multicore Multithreading within languages: Java 
OpenMP interfaced with C, C++ or 
Fortran 

COW Message passing library: MPI interfaced 
with C, C++, Fortran 

Hybrid MPI or Hybrid models: MPI/OpenMP, 

ccNUMA MPI/Multithreading 

NOW Message passing library: MPI interfaced 
with C, C++, Fortran 
Condor or object models (JavaRMI) 

HPC grid MPICH-G (Globus) or GridRPC models 


(Netsolve, Diet) 
Desktop grid Condor-G or object models (Proactive) 
ensured by a checkpoint/recovery mechanism. The use 
of MPI within Globus is more or less adapted to high- 
performance computing (HPC) grids. However, the user 
has to deal with complex mechanisms such as dynamic 
load balancing and fault tolerance. Table 55.3 presents 
a guideline depending on the target parallel architec- 
ture. 


55.3.4 Performance Evaluation 


For sequential algorithms, the main performance mea- 
sure is the execution time as a function of the input 
size. In parallel algorithms, this measure also depends 
on the number of processors and the characteristics of 
the parallel architecture. Hence, some classical perfor- 
mance indicators such as speedup and efficiency have 
been introduced to evaluate the scalability of parallel al- 
gorithms [55.30]. The scalability of a parallel algorithm 
measures its ability to achieve performance propor- 
tional to the number of processors. 

The speed-up Sy is defined as the time 7; it takes to 
complete a program with one processor divided by the 
time Ty it takes to complete the same program with N 
processors 


Ti 


Sy = =>. 
N Ty 


(55.1) 


One can use wall-clock time instead of CPU time. The 
CPU time is the time a processor spends in the exe- 
cution of the program, and the wall-clock time is the 
time of the whole program including the input and out- 


put. Conceptually the speed-up is defined as the gain 
achieved by parallelizing a program. If Sy > N (resp., 
Sy = N), a superlinear (resp., linear) speedup is ob- 
tained [55.14]. Mostly, a sublinear speedup Sy < N is 
obtained. This is due to the overhead of communica- 
tion and synchronization costs. The case Sy < 1 means 
that the sequential time is smaller than the parallel time 
which is the worst case. This will be possible if the com- 
munication cost is much higher than the execution cost. 

The efficiency Ey using N processors is defined as 
the speed-up Sy divided by the number of processors N. 


SN 
Ey = N (55.2) 
Conceptually the efficiency can be defined as how well 
N processors are used when the program is computed 
in parallel. An efficiency of 100% means that all of 
the processors are fully used all the time. For some 
large real-life applications, it is impossible to have the 
sequential time as the sequential execution of the al- 
gorithm cannot be performed. Then, the incremental 
efficiency Eyy may be used to evaluate the efficiency 
extending the number of processors from N to M pro- 
cessors 


N x Ey 


E = —_____, 55.3 
NM = TSE, (55.3) 


Different definitions of speedup may be used depend- 
ing on the definition of the sequential time reference T}. 
Asking what is the best measure is useless; there is no 
global dominance between the different measures. The 
choice of a given definition depends on the objective of 
the performance evaluation analysis. Then, it is impor- 
tant to specify clearly the choice and the objective of 
the analysis. 

The absolute speedup is used when the sequential 
time Tı corresponds to the best-known sequential time 
to solve the problem. Unlike other scientific domains 
such as numerical algebra where for some operations 
the best sequential algorithm is known, in EA search, 
it is difficult to identify the best sequential algorithm. 
So, the absolute speedup is rarely used. The relative 
speedup is used when the sequential time Tı corre- 
sponds to the parallel program executed on a single 
processor. 

Moreover, different stopping conditions may be 
used: 


© Fixed number of iterations: This condition is the 
most used to evaluate the efficiency of a parallel EA. 
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Using this definition, a superlinear speedup is possi- 
ble Sy > N. This is due to the characteristics of the 
parallel architecture where there is more resources 
(e.g. size of main memory and cache) than in a sin- 
gle processor (Fig. 55.6a). For instance, the search 
memory of an EA executed on a single processor 
may be larger than the main memory of a single 
processor and then some swapping will be carried 
out, which represents an overhead in the sequential 
time. When using a parallel architecture, the whole 
memory of the EA may fit in the main memory of 
its processors, and then the memory swapping over- 
head will not occur. 

© Convergence to a solution with a given quality: This 
measure is interesting to evaluate the effectiveness 
of a parallel EA. It is only valid for parallel models 
of EAs based on the algorithmic level, which alters 
the behavior of the sequential EA. A superlinear 
speedup is possible and is due to the characteris- 
tics of the parallel search (Fig. 55.6b). Indeed, the 
order of searching different regions of the search 
space may be different from sequential search. The 
sequences of visited solutions in parallel and se- 
quential search are different. This is similar to the 
superlinear speedups obtained in exact search algo- 
rithms such as branch and bound (this phenomenon 
is called speedup anomaly) [55.31]. 


Most of evolutionary algorithms are stochastic algo- 
rithms (scatter search, if considered as an evolutionary 
algorithm, is a deterministic algorithm). When the stop- 
ping condition is based on the quality of the solution, 
one cannot use the speedup metric as defined previ- 
ously. The original definition may be extended to the 
average speedup 


_ EM) 
E(Ty) ` 


N (55.4) 


The same seed for the generation of random numbers 
must be used for a more fair experimental performance 
evaluation. 

The speedup metrics have to be reformulated for 
heterogeneous architectures. The efficiency metric may 
be used for this class of architectures. Moreover, it 
can be used for shared parallel machines with multiple 
users. 


55.3.5 Main Properties of Parallel EAs 


The performance of a parallel EA on a given par- 
allel architecture depends mainly on its granularity. 


a) Parallel architecture source: memory hierarchy 


P: Processor 
M: Main memory 


C: Cache 
| 
P1 | P2 Pn 
Search M2 Mn 
memory 
| C2 Cn 


| | 


Interconnection network 


b) Parallel search source: parallele search trajectories 


Objective 


Different initial solutions 


i Local search 1 


Local search n 


First local optima 


Local search 2 \ 


Search space 


Fig. 55.6a,b Superlinear speedups for a parallel EA. 
(a) Parallel architecture source. (b) Parallel search source 


The granularity of a parallel program is the amount of 
computation performed between two communications. 
It computes the ratio between the computation time 
and the communication time. The three parallel mod- 
els (algorithmic level, iteration level, and solution level) 
have a decreasing granularity from coarse-grained to 
fine-grained. The granularity indicator has an important 
impact on the speedup. The larger is the granularity the 
better is the obtained speedup. 

The degree of concurrency of a parallel EA is repre- 
sented by the maximum number of parallel processes at 
any time. This measure is independent from the target 
parallel architecture. It is an indication of the number 
of processors that can employed usefully by the parallel 
EA. 

Asynchronous communications and the recovery 
between computation and communication is also an 
important issue for a parallel efficient implementation. 
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Indeed, most of the actual processors integrate dif- 
ferent parallel elements such as arithmetic logic unit 
(ALU), floating point unit (FPU), graphical processing 
unit (GPU), direct memory access (DMA), etc. Most of 
the computing part takes part in cache. Hence, the ran- 
dom access memory (RAM) bus is often free and can 
be used by other elements such as the DMA. Hence, in- 
put/output operations can be recovered by computation 
tasks. 

Scheduling different tasks composing a parallel EA 
is another classical issue to deal with for their efficient 
implementation. Different scheduling strategies may be 
used depending on whether the number and the location 
of works (tasks, data) depend or not on the load state of 
the target machine: 


@ Static scheduling: This class represents parallel EAs 
in which both the number of tasks of the application 
and the location of work (tasks, data) are gener- 
ated at compile time. Static scheduling is useful 
for homogeneous, and nonshared and nonvolatile 
heterogeneous parallel architectures. Indeed, when 
there are noticeable load or power differences be- 
tween processors, the search time of an iteration is 
derived by the maximum execution time over all 
processors, presumably on the most highly loaded 
processor or the least powerful processor. A signifi- 
cant number of tasks are often idle waiting for other 
tasks to complete their work. 

© Dynamic scheduling: This class represents paral- 
lel EAs for which the number of tasks is fixed at 
compile time, but the location of work is deter- 
mined and/or changed at run time. The tasks are 
dynamically scheduled on different processors of 
the parallel architecture. Dynamic load balancing 
is important for shared (multiuser) architectures, 
where the load of a given processor cannot be de- 
termined at compile time. Dynamic scheduling is 
also important for irregular parallel EAs in which 
the execution time cannot be predicted at compile 
time and varies during the search. For instance, this 
happens when the evaluation cost of the objective 
function depends on the solution. 

Many dynamic load-balancing strategies may be ap- 
plied. For instance, during the search, each time 
a processor finishes its work, it proceeds to a work- 
demand. The degree of parallelism of this class of 
scheduling algorithms is not related to load varia- 
tions in the target machine. When the number of 
tasks exceeds the number of idle nodes, multiple 
tasks are assigned to the same node. Moreover, 


when there are more idle nodes than tasks, some of 
them will not be used. 

© Adaptive scheduling: Parallel adaptive algorithms 
are parallel computations with a dynamically 
changing set of tasks. Tasks may be created or killed 
as a function of the load state of the parallel ma- 
chine. A task is created automatically when a node 
becomes idle. When a node becomes busy, the task 
is killed. Adaptive load balancing is important for 
volatile architectures such as desktop grids. 


For some parallel and distributed architectures such 
as shared networks of workstations and grids, fault tol- 
erance is an important issue. Indeed, in volatile shared 
architectures and large-scale parallel architectures, the 
fault probability is relatively important. Checkpoint- 
ing and recovery techniques constitute one answer to 
this problem. Application-level checkpointing is much 
more efficient than system-level checkpointing. In- 
deed, in system-level checkpointing, a checkpoint of 
the global state of a distributed application composed 
of a set of processes is carried out. In application- 
level checkpointing, only minimal information will be 
checkpointed (e.g., population of individuals, genera- 
tion number). Compared to system-level checkpointing, 
a reduced cost is then obtained in terms of memory and 
time. 

Finally, security issues may be important for large- 
scale distributed architectures such as grids and peer- 
to-peer systems (multidomain administration, firewall, 
etc.) and some specific applications such as medical and 
bioinformatics research applications of industrial con- 
cern. 


55.3.6 Algorithmic-Level Parallel Model 


Granularity 

The algorithmic-level parallel model has the largest 
granularity. Indeed, the time for exchanging the infor- 
mation is in general much less than the computation 
time of an EA. There are relatively low communica- 
tion requirements for this model. The more important 
is the frequency of exchange and the size of exchanged 
information, the smaller is the granularity. This paral- 
lel model is the most suited to large-scale distributed 
architectures over internet such as grids. Moreover, the 
trivial model with independent algorithms is convenient 
for low-speed networks of workstations over intranet. 
As there is no essential dependency and communica- 
tion between the algorithms, the speedup is generally 
linear for this parallel model. 
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For an efficient implementation, the frequency of 
exchange (resp., the size of the exchanged data) must 
be correlated to the latency (resp., bandwidth) of the 
communication network of the parallel architecture. 

To optimize the communication between proces- 
sors, the exchange topology can be specified according 
to the interconnection network of the parallel archi- 
tecture. The specification of the different parameters 
associated with the blind or intelligent migration deci- 
sion criterion (migration frequency/probability and im- 
provement threshold) is particularly crucial on a com- 
putational grid. Indeed, due to the heterogeneous nature 
of computational grids these parameters must be spec- 
ified for each EA in accordance with the machine it is 
hosted on. 


Scalability 
The degree of concurrency of the algorithmic-level par- 
allel model is limited by the number of EAs involved in 
solving the problem. In theory, there is no limit. How- 
ever, in practice, it is limited by the owned resources of 
the target parallel architectures, and also by the effec- 
tiveness aspect of using a large number of EAs. 


Synchronous Versus Asynchronous 

Communications 
The implementation of the algorithmic level model is 
either asynchronous or synchronous. The asynchronous 
mode associates with each EA an exchange decision 
criterion, which is evaluated at each iteration of the EA 
from the state of its memory. If the criterion is satisfied, 
the EA communicates with its neighbors. The exchange 
requests are managed by the destination EAs within an 
undetermined delay. The reception and integration of 
the received information is thus performed during the 
next iterations. However, in a computational grid con- 
text, due to the material and/or software heterogeneity 
issue, the EAs could be at different evolution stages 
leading to the noneffect and/or supersolution problem. 
For instance, the arrival of poor solutions at a very ad- 
vanced stage will not bring any contribution as these 
solutions will likely not be integrated. In the opposite 
situation, the cooperation will lead to premature con- 
vergence. 

From another point of view, as it is nonblocking, 
the model is more efficient and fault tolerant to such 
a degree a threshold of wasted exchanges is not ex- 
ceeded. In the synchronous mode, the EAs perform 
a synchronization operation at a predefined iteration by 
exchanging some data. Such operation guarantees that 
the EAs are at the same evolution stage, and so prevents 


the noneffect and supersolution problem quoted before. 
However, in heterogeneous parallel architectures, the 
synchronous mode is less efficient in term of consumed 
CPU time. Indeed, the evolution process is often hang- 
ing on powerful machines waiting the less powerful 
ones to complete their computation. The synchronous 
model is also not fault tolerant as a fault of a single EA 
implies the blocking of the whole model in a volatile 
environment. Then, the synchronous mode is globally 
less efficient on a computational grid. 

Asynchronous communication is more efficient 
than synchronous communication for shared architec- 
tures such as NOWs and desktop grids (e.g., multiple 
users, multiple applications). Indeed, as the load of 
networks and processors is not homogeneous, the use 
of synchronous communication will degrade the per- 
formances of the whole system. The least powerful 
machine will determine the performance. 

On a volatile computational grid, it is difficult to ef- 
ficiently maintain topologies such as rings and torus. 
Indeed, the disappearance of a given node (i. e., EA) re- 
quires a dynamic reconfiguration of the topology. Such 
reconfiguration is costly and makes the migration pro- 
cess inefficient. Designing a cooperation between a set 
of EAs without any topology may be considered. For 
instance, a communication scheme in which the target 
EA is selected randomly is more efficient for volatile 
architecture such as desktop grids. Many experimental 
results show that such topology allows a significant im- 
provement of the robustness and quality of solutions. 
The random topology is therefore thinkable and even 
commendable in a computational grid context. 


Scheduling 
Concerning the scheduling aspect, in the algorithmic- 
level parallel model the tasks correspond to EAs. 
Hence, the different scheduling strategies will differ as 
follows: 


@ Static scheduling: The number of EAs is constant 
and correlated to the number of processors of the 
parallel machine. A static mapping between the EAs 
and the processors is realized. The localization of 
EAs will not change during the search. 

© Dynamic scheduling: EAs are dynamically sched- 
uled on different processors of the parallel architec- 
ture. Hence, the migration of EAs during the search 
between different machines may happen. 

© Adaptive scheduling: The number of EAs involved 
into the search will vary dynamically. For exam- 
ple, when a machine becomes idle, a new EA is 
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launched to perform a new search. When a ma- 
chine becomes busy or faulty, the associated EA is 
stopped. 


Fault Tolerance 
The memory state of the algorithmic-level parallel 
model required for the checkpointing mechanism is 
composed of the memory of each EA and the in- 
formation being migrated (i. e., population, generation 
number). 


55.3.7 Iteration-Level Parallel Model 


Granularity 

A medium granularity is associated with the iteration- 
level parallel model. The ratio between the evaluation 
of a partition and the communication cost of a parti- 
tion determines the granularity. This parallel model is 
then efficient if the evaluation of a solution is time- 
consuming and/or there are a large number of candidate 
solutions to evaluate. The granularity will depend on the 
number of solutions in each subpopulation. 


Scalability 
The degree of concurrency of this model is limited by 
the size of the population. The use of large populations 
will increase the scalability of this parallel model. 


Synchronous Versus Asynchronous 

Communications 
Introducing asynchronism in the iteration-level parallel 
model will increase the efficiency of parallel EAs. In the 
iteration-level parallel model, asynchronous communi- 
cations are related to the asynchronous evaluation of 
partitions and construction of solutions. Unfortunately, 
this model is more or less synchronous. Asynchronous 
evaluation is more efficient for heterogeneous or shared 
or volatile parallel architectures. Moreover, asynchro- 
nism is necessary for optimization problems where the 
computation cost of the objective function (and con- 
straints) depends on the solution and different solutions 
may have different evaluation cost. 

Asynchronism may be introduced by relaxing the 
synchronization constraints. For instance, steady-state 
algorithms may be used in the reproduction phase. 

The two main advantages of the asynchronous 
model over the synchronous model are fault tolerance 
and robustness if the fitness computation takes very 
different computations time. Whereas some time-out 
detection can be used to address the former issue, the 
latter one can be partially overcome if the grain is set 


to very small values, as individuals will be sent out for 
evaluations upon request of the workers. Therefore, the 
model is blocking and, thus, less efficient on a hetero- 
geneous computational grid. Moreover, as the model 
is not fault tolerant, the disappearance of an evaluat- 
ing agent requires the redistribution of its individuals to 
other agents. As a consequence, it is essential to store 
all the solutions not yet evaluated. The scalability of the 
model is limited to the size of the population. 


Scheduling 
In the iteration-level parallel model, tasks correspond to 
the construction/evaluation of a set of solutions. Hence, 
the different scheduling strategies will differ as follows: 


@ Static scheduling: Here, a static partitioning of the 
population is applied. For instance, the population 
is decomposed into equal size partitions depend- 
ing on the number of processors of the parallel 
homogeneous nonshared machine. A static map- 
ping between the partitions and the processors is 
realized. For a heterogeneous nonshared machine, 
the size of each partition must be initialized ac- 
cording to the performance of the processors. The 
static scheduling strategy is not efficient for vari- 
able computational costs of equal partitions. This 
happens for optimization problems where different 
costs are associated with the evaluation of solutions. 
For instance, in genetic programming individuals 
may widely vary in size and complexity. This makes 
a static scheduling of the parallel evaluation of the 
individuals not efficient [55.32, 33]. 

© Dynamic scheduling: A static partitioning is applied 
but a dynamic migration of tasks can be carried out 
depending on the varying load of processors. The 
number of tasks generated may be equal to the size 
of the population. Many tasks may be mapped on 
the same processor. Hence, more flexibility is ob- 
tained for the scheduling algorithm. For instance, 
the approach based on the master-workers cycle 
stealing may be applied. To each worker is first al- 
located a small number of solutions. Once it has 
performed its iterations, the worker requests from 
the master additional solutions. All the workers are 
stopped once the final result is returned. Faster and 
less loaded processors handle more solutions than 
the others. This approach allows us to reduce the 
execution time compared to the static one. 

© Adaptive scheduling: The objective in this model 
is to adapt the number of partitions generated 
to the load of the target architecture. More effi- 
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cient scheduling strategies are obtained for shared, 
volatile, and heterogeneous parallel architectures 
such as desktop grids. 


Fault Tolerance 
The memory of the iteration-level parallel model re- 
quired for the checkpointing mechanism is composed 
of different partitions. The partitions are composed of 
a set of (partial) solutions and their associated objective 
values. 


55.3.8 Solution-Level Parallel Model 


Granularity 
This parallel model has a fine granularity. There is 
a relatively high communication requirements for this 
model. In the functional decomposition parallel model, 
the granularity will depend on the ratio between the 
evaluation cost of the subfunctions and the commu- 
nication cost of a solution. In the data decomposition 
parallel model, it depends on the ratio between the eval- 
uation of a data partition and its communication cost. 

The fine granularity of this model makes it less suit- 
able for large-scale distributed architectures where the 
communication cost (in terms of latency and/or band- 
width) is relatively important, such as in grid computing 
systems. Indeed, its implementation is often restricted 
to clusters or network of workstations or shared mem- 
ory machines. 


Scalability 
The degree of concurrency of this parallel model is 
limited by the number of subfunctions or data parti- 
tions. Although its scalability is limited, the use of the 
solution-level parallel model in conjunction with the 
two other parallel models enables to extend the scala- 
bility of a parallel EA. 


Synchronous Versus Asynchronous 

Communications 
The implementation of the solution-level parallel model 
is always synchronous following a master-workers 
paradigm. Indeed, the master must wait for all partial 
results to compute the global value of the objective 
function. The execution time T will be bounded by 
the maximum time 7; of the different tasks. An excep- 
tion occurs for hard-constrained optimization problems, 
where feasibility of the solution is first tested. The mas- 
ter terminates the computations as soon as a given task 
detects that the solution does not satisfy a given hard 
constraint. Due to its heavy synchronization steps, this 


parallel model is worth applying to problems in which 
the calculations required at each iteration are time con- 
suming. The relative speedup may be approximated as 
follows: 

T 


a+T/n’ 


where @ is the communication cost. 


Si = (55.5) 


Scheduling 
In the solution-level parallel model, tasks correspond 
to subfunctions in the functional decomposition and to 
data partitions in the data decomposition model. Hence, 
different scheduling strategies will differ as follows: 


© Static scheduling: Usually, the subfunctions or data 
are decomposed into equal size partitions depending 
on the number of processors of the parallel machine. 
A static mapping between the subfunctions (or data 
partitions) and the processors is applied. As for the 
other parallel models, this static scheme is efficient 
for parallel homogeneous nonshared machines. For 
a heterogeneous nonshared machine, the size of 
each partition in terms of subfunctions or data must 
be initialized according to the performance of the 
processors. 

@ Dynamic scheduling: Dynamic load balancing will 
be necessary for shared parallel architectures or 
variable costs for the associated subfunctions or 
data partitions. Dynamic load balancing may be eas- 
ily achieved by evenly distributing at run time the 
subfunctions or the data among the processors. In 
optimization problems, where the computing cost 
of the subfunctions is unpredictable, dynamic load 
balancing is necessary. Indeed, a static scheduling 
cannot be efficient because there is no appropri- 
ate estimation of the task costs (i. e., unpredictable 
costs). 

© Adaptive scheduling: In adaptive scheduling, the 
number of subfunctions or data partitions gener- 
ated is adapted to the load of the target architecture. 
More efficient scheduling strategies are obtained for 
shared, volatile and heterogeneous parallel architec- 
tures such as desktop grids. 


Fault Tolerance 
The memory of the solution-level parallel model re- 
quired for the checkpointing mechanism is straightfor- 
ward. It is composed of the solution and its partial 
objective value calculations. 

Depending on the target parallel architecture, Ta- 
ble 55.4 presents a general guideline for the efficient 
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Table 55.4 Efficient implementation of parallel EAs according to some performance metrics and used strategies 


Medium (nb. of solutions per 
Neighborhood size, popula- 


Moderate (eval. of solutions) 


Solution level 

Fine (eval. subfunctions, eval. data 
partitions) 

Nb. of subfunctions, nb. data partitions 


Exceptional (feasibility test) 
Partial solution(s) 


Property Algorithmic level Iteration level 
Granularity Coarse (frequency of ex- 

change, size of information) partition) 
Scalability Number of EAs 

tions size 

Asynchronism High (information exchange) 
Scheduling and fault EA Solution(s) 
tolerance 


implementation of the different parallel models of EAs. 
For each parallel model (algorithmic level, iteration 
level, and solution level), the table shows its char- 


55.4 Parallel EAs Under ParadisEO 


Designing generic software frameworks to deal with 
the design and efficient transparent implementation 
of parallel and distributed EAs is an important chal- 
lenge. Indeed, efficient implementation of parallel EAs 
is acomplex task, which depends on the type of the par- 
allel architecture used. In designing a software frame- 
work for parallel EAs, one has to keep in mind the 
following important properties: portability, efficiency, 
easiness of use, and flexibility in terms of parallel ar- 
chitectures and models. 

Several white-box frameworks for the reusable de- 
sign of parallel EAs have been proposed and are avail- 
able from the Web. The most important of them are: 
DREAM (distributed resource evolutionary algorithm 
machine) [55.34], ECJ (Java evolutionary computa- 
tion) [55.35], JDEAL (Java distributed evolutionary 
algorithms library) and Distributed BEAGLE (dis- 
tributed Beagle engine advanced genetic learning en- 
vironment) [55.36]. These frameworks are reusable as 
they are based on a clear object-oriented conceptual 
separation. They are also portable as they are devel- 
oped in Java, an exception is the last system, which 
is programmed in C++. However, they are limited 
regarding the parallel distributed models. Indeed, in 
DREAM and ECJ only the island model is implemented 
using Java threads and TCP/IP sockets. DREAM is par- 
ticularly deployable on peer-to-peer platforms. Further- 
more, JDEAL provides only the master-worker model 
(iteration-level parallel model) using TCP/IP sockets. 
The latter also designs the synchronous migration- 
based island model, but implemented on a single 
processor. 

Few frameworks available on the Web are devoted 
to EAs, and their hybridization. MALLBA [55.37], 


acteristics according to the outlined criteria (granu- 
larity, scalability, asynchronism, scheduling and fault 
tolerance). 


MAFRA (Java MuimeticAlgorithms Framework) 
[55.38] and ParadisEO are good examples of such 
frameworks. MAFRA is developed in Java using 
design patterns [55.39]. It is strongly hybridization- 
oriented, but it is very limited regarding parallelism 
and distribution. MALLBA and ParadisEO have nu- 
merous similarities. They are C+-+/MPI open source 
frameworks. They provide all the previously presented 
distributed models, and different hybridization mecha- 
nisms. However, they are quite different as ParadisEO 
is more flexible thanks to the finer granularity of 
its classes. Moreover, ParadisEO also provides the 
MPI-based communication layer and Pthreads-based 
multithreading. MALLBA is deployable on wide area 
networks using NetStream, a message passing service 
upon MPI [55.37]. ParadisEO is deployable on grid 
computing platforms using the Globus toolkit [55.21]. 

ParadisEO-PEO offers transparent implementation 
of the different parallel models on different archi- 
tectures using suitable programming environments. 
ParadisEO-PEO offers an easy implementation of the 
three main parallel models. The algorithmic-level par- 
allel model allows several optimization algorithms to 
cooperate and exchange any kind of data. The iteration- 
level parallel model proposes to parallelize and dis- 
tribute a set of identical operations. In the solution-level 
parallel model, any calculation block specific to the op- 
timization problem can be divided into smaller units to 
speed-up the treatment and gain efficiency. 

ParadisEO contains three interconnected mod- 
ules (Fig. 55.7): EO for evolutionary algorithms 
(population-based metaheuristics), MO for single 
solution-based metaheuristics (e.g., local search, tabu 
search simulated annealing), and MOEO for multi- 
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Fig. 55.7 ParadisEO-PEO implementation under different parallel programming environments and middlewares 


objective evolutionary algorithms. ParadisEO offers 
transparency in the sense that the user has not to 
deal explicitly with parallel programming. One has 
just to instantiate the needed ParadisEO components. 
The implementation is portable on distributed-memory 
machines as well as on shared-memory multiproces- 
sors. The user has not to manage the communications 
and threads-based concurrency. Moreover, the same 
parallel design (i.e., the same program) is portable 


55.5 Conclusions and Perspectives 


Parallel and distributed computing can be used in the 
design and implementation of EAs to speedup the 
search, to improve the quality of the obtained solutions, 
to improve the robustness, and to solve large-scale prob- 
lems. The clear separation between parallel design and 
parallel implementation aspects of EAs is important to 
analyze parallel EAs. The most important lessons of 
this chapter can be summarized as follows: 


© In terms of parallel design, the different parallel 
models for mono-objective EAs have been uni- 
fied. Three hierarchical parallel models have been 
extracted: algorithmic level, iteration level, and so- 
lution level parallel models. 

© In terms of parallel implementation, the question of 
an efficient mapping of a parallel model of EAs on 
a given parallel architecture and programming envi- 
ronment (i. e., language, library, and middleware) is 
handled. The focus was made on the key criteria of 


over different architectures. Hence, ParadisEO-PEO 
has been implemented on different parallel program- 
ming environments and middlewares (MPI, Pthreads, 
Condor, Globus, CUDA) which are adapted to differ- 
ent target architectures (shared and distributed mem- 
ory, cluster and network of workstations, Desktop and 
high-performance grid computing platforms, GPUs) 
(Fig. 55.7). The deployment of the presented parallel 
and distributed models is transparent for the user. 


parallel architectures that influence the efficiency of 
an implementation of parallel EAs. 

@ The use of the ParadisEO-PEO software frame- 
work allows the parallel design of the different 
parallel models of EAs. It also allows their trans- 
parent and efficient implementation on different 
parallel and distributed architectures (e.g., clusters 
and networks of workstations, multicores, GPUs, 
high-performance computing and desktop grids) us- 
ing suitable programming environments (e.g., MPI, 
Threads, Globus, Condor, CUDA). 


One of the perspectives in the coming years is to 
achieve Petascale performance. The emergence of het- 
eregeneous platforms composed of multicore chips and 
many-core chips technologies will speedup the achieve- 
ment of this goal. In terms of programming models, 
cloud computing will become an important alterna- 
tive to traditional high-performance computing for the 
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development of large-scale EAs that harness massive 
computational resources. This is a great challenge as 
nowadays cloud frameworks for parallel EAs are just 
emerging. 

In the future design of high-performance comput- 
ers, the ratio between power and performance will be 
increasingly important. The power represents the elec- 
trical power consumption of the computer. An excess in 
power consumption uses unnecessary energy, generates 
waste heat and decreases reliability. Very few vendors 
of high-performance architecture publicize the power 
consumption data compared to the performance data 
(the web site www.green500.org ranks the top 500 ma- 
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and quality are more difficult to handle by traditional 
sequential approaches. Moreover, parallel models for 
optimization and learning problems under the presence 
of uncertainty have to be deeply investigated. 
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56. How to Create Generalizable Results 


Thomas Bartz-Beielstein 


Basically, this chapter tries to find answers for the 
following fundamental questions in experimental 
research. 


(Q-1) How can problem instances be generated? 
(Q-2) How can experimental results be generalized? 


The chapter is structured as follows. Sec- 
tion 56.2 introduces real-world and artificial 
optimization problems. Algorithms are described 
in Sect. 56.3. Objective functions and statistical 
models are introduced in Sect. 56.4; these models 
take problem and algorithm features into con- 
sideration. Section 56.5 presents case studies that 
illustrate our methodology. The chapter closes with 
a summary and an outlook. 


56.1 Test Problems 
in Computational Intelligence .............. 1127 
56.2 Features of Optimization Problems ....... 1128 


56.1 Test Problems in Computational 


Computational intelligence (CI) methods have gained 
importance in several real-world domains such as pro- 
cess optimization, system identification, data mining, 
or Statistical quality control. Tools to determine the ap- 
plicability of CI methods in these application domains 
in an objective manner are missing. Statistics provide 
methods for comparing algorithms on certain data sets. 
In the past, several test suites were presented and con- 
sidered as state of the art. However, these test suites 
have several drawbacks, namely: 


@ Problem instances are mostly artificial and have no 
direct link to real-world settings. 

© Since there is a fixed number of test instances, 
algorithms can be fitted or tuned to this specific 
and very limited set of test functions. As a conse- 
quence, studies (benchmarks) provide insight how 
these algorithms perform on this specific set of test 


56.2.1 Problem Classes 
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instances, but no insight on how they perform in 
general. 

@ Statistical tools for comparisons of several algo- 
rithms on several test problem instances are rela- 
tively complex and not easy to analyze. 


We propose a methodology to overcome these dif- 
ficulties. This methodology, which generates problem 
classes rather than uses one instance, is constructed as 
follows: 


1. First, we pre-process the underlying real-world 
data. 

2. In a second step, features from these data are ex- 
tracted. This extraction relies on the assumption 
that mathematical variables can be used to represent 
real-world features. For example, decomposition 
techniques can be applied to model the underlying 
data structures, if we are using time-series data. The 
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original time series is deconstructed into a number 
of component series, where each of these reflects 
a certain type of behavior, e.g., a trend or seasonal- 
ity [56.1]. We obtain an analytical model of the data. 

3. Then, we parameterize this model. Based on 
this parametrization and randomization, we can 
generate infinitely many new problem instances. 

4. If no real-world data are available, problem in- 
stances can be generated using test-problem gen- 
erators. The generation of test problems, which 
are well-founded and have practical relevance, has 
been an on-going field of research for several 
decades. 

5. From this infinite set, we can draw a limited num- 
ber of problem instances which will be used for the 
comparison. 

6. Since problem instances are selected randomly, we 
apply random and mixed models for the analy- 
sis [56.2]. Mixed models include fixed and ran- 
dom effects. A fixed effect is an unknown con- 
stant. Its estimation from the data is a common 
practice in analysis of variance (ANOVA) or re- 
gression. A random effect is a random variable. 
We estimate the parameters that describe its dis- 
tribution, because — in contrast to fixed effects — 


it makes no sense to estimate the random effect 
itself. 


This chapter combines ideas from two approaches: 
problem generation and statistical analysis of com- 
puter experiments. The work presented by Chiarandini 
and Goegebeur [56.3] provides the basis of our sta- 
tistical analysis. They present a systematic and well- 
developed framework for mixed models. Related mod- 
eling approaches were suggested by McGeoch [56.4] 
and Birattari [56.5]. Gallagher and Yuan [56.6] present 
a problem instance (landscape) generator that is pa- 
rameterized by a small number of parameters, and the 
values of these parameters have a direct and intuitive 
interpretation in terms of the geometric features of the 
landscapes that they produce. Castiñeiras et al. [56.7] 
present a parameterizable benchmark generator for bin 
packing instances based on the well-known Weibull dis- 
tribution. Using the shape and scale parameters of the 
Weibull distribution, the authors generate benchmarks 
that contain a variety of item size distributions. They 
report that for all bin capacities, the number of bins re- 
quired in an optimal solution increases as the Weibull 
shape parameter increases. Using this feature, scalabil- 
ity is enabled. 


56.2 Features of Optimization Problems 


56.2.1 Problem Classes and Instances 


Nowadays, it is common practice in optimization to 
choose a fixed set of problem instances in advance and 
to apply classical ANOVA or regression analysis. In 
many experimental studies a few problem instances 7r; 
(i= 1,2,...,q) are used and the results of some runs of 
the algorithms a; (j = 1,2,...,) on these instances are 
collected. The instances can be treated as blocks and all 
algorithms are run on each single instance. Results are 
grouped per instance z;. Analyses of these experiments 
shed some light on the performance of the algorithms 
on those specific instances. However, the interest of the 
researcher should not be just the performance of the al- 
gorithms on those specific instances chosen, but rather 
on the generalization of the results to the entire class JT. 
Generalizations about the algorithm’s performance on 
new problem instances are difficult or impossible in this 
setting. 

Based on ideas from Chiarandini and Goege- 
beur [56.3], to overcome this difficulty, we propose 
the following approach: a small set of problem in- 


stances {; € T|i=1,2,...,q} is chosen at random 
from a large set, or class JI, of possible instances of 
the problem. Problem instances are considered as factor 
levels. However, this factor is of a different nature from 
the fixed algorithmic factors in the classical ANOVA 
setting. Indeed, the levels are chosen at random and the 
interest is not in these specific levels but in the prob- 
lem class JI from which they are sampled. Therefore, 
the levels and the factor are random. Consequently, 
our results are not based on a limited, fixed number of 
problem instances. They are randomly drawn from an 
infinite set, which enables generalization. 


56.2.2 Feature Extraction 
and Instance Generation 


A problem class JI can be generated in different man- 
ners. We will consider artificial and natural problem 
class generators. Artificially generated problems allow 
feature generation based on some predefined charac- 
teristics. They are basically theory driven, i.e., the 
researcher defines certain features such as linearity or 
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56.2 Features of Optimization Problems 


multi modality. Based on these features, a model (for- 
mula) is constructed. By integrating parameters into this 
formula, many problem instances can be generated by 
parameter variation. We will exemplify this approach 
in the following paragraph. The second way, which will 
generate natural problem classes, uses a three-stage ap- 
proach. First, the real-word system and its components 
are described. Then, features are extracted from a real- 
world system. Based on this feature set, a model is 
defined. Adding parameters to this model, new problem 
instances can be generated. There is also a third way to 
generate test instances: if we are lucky, many data are 
available. In this case, we can sample a limited number 
of problem instances from the larger set of real-world 
data. The statistical analysis is similar for these three 
cases. 


Artificial Test Functions 

Several problem instance generators have been pro- 
posed over the last years. For example, Gallagher and 
Yuan present a landscape test generator, which can be 
used to set up problem instances for continuous, bound- 
constrained optimization problems [56.6]. The Max-set 
of Gaussian landscape generator (MSG) uses the max- 
imum of m weighted Gaussian functions 


G(x) = max (wig;(x)), 


i€1,2 


where g : R” — R denotes an n-dimensional Gaussian 
function 


o= (CPE e-w)\" 
í Ory 2S ' 


H is an n-dimensional vector of means, and » is an 
(nxn) covariance matrix. The mean of each Gaussian 
corresponds to an optimum on the landscape and the 
location of all optima is known. The global optimum 
is the one with the largest value. We will use the MSG 
problem instance generator in Sect. 56.5 to demonstrate 
our approach. 


Natural Problem Classes 
This section exemplifies the three fundamental steps for 
generating real-world problem instances, namely: 


1. Describing the real-world system and its data 
2. Feature extraction and model construction 
3. Instance generation. 


We will illustrate this procedure by using the classic 
Box and Jenkins airline data [56.8]. These data contain 


the monthly totals of international airline passengers 
from 1949 to 1961. The feature extraction is based on 
methods from time-series analysis. Because of its sim- 
plicity the Holt-Winters method is popular in many 
application domains. It is able to adapt to changes in 
trends and seasonal patterns. The Holt—Winters predic- 
tion function requires the estimation of three param- 
eters, i.e., a, 8 and y, which can be estimated from 
original time-series data. Their optimal values are deter- 
mined by minimizing the squared one-step prediction 
error. To generate new problem instances, these param- 
eters can be slightly modified. Based on these modified 
values, the model is re-fitted. Finally, we can extract the 
new time series. One typical result from this instance 
generation is shown in Fig. 56.1. Bartz-Beielstein [56.9] 
describes this procedure in detail. 

To illustrate the wide applicability of this approach, 
we will list further real-work problem domains, which 
are subject of our current research: 


@ Smart metering: The development of accurate fore- 
casting methods for electrical energy consumption 
profiles is an important task. We consider time se- 
ries collected from a manufacturing process. Each 
time series contains quarter-hourly samples of the 
energy consumption of a bakery. A detailed data de- 
scription can be found in [56.10]. 

© Water industry: Canary is a software developed by 
the United States Environmental Protection Agency 
(US EPA) and Sandia National Laboratories. Its 
purpose is to detect events in the context of wa- 
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Fig. 56.1 Holt—Winters problem instance generator. The solid line 
represents the real data, the dotted line predictions from the Holt- 
Winters model and the fine dotted line modified predictions, respec- 
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ter contamination. An event is in this context de- 
fined as a certain time period where a contaminant 
significantly deteriorates the water quality. Dis- 
tinguishing events from (i) background changes, 
(Gii) maintenance and modification due to operation, 
and (iii) outliers is an essential task, which was 
implemented in the Canary software. Therefore, 
deviations are compared to regular patterns and 
short term changes. The corresponding data con- 
tains multi-variate time-series data. It is a selection 
from a larger dataset shipped with the open source 
event-detection software CANARY developed by 
US EPA and Sandia National Laboratories [56.11]. 
© Finance: The data are real-world data from intra- 
day foreign exchange (FX) trading. The FX market 
is a financial market for trading currencies to en- 
able international trade and investment. It is the 
largest and most liquid financial market in the 
world. Currencies can be traded via a wide variety 
of different financial instruments, ranging from sim- 
ple spot trades over to highly complex derivatives. 


56.3 Algorithm Features 


56.3.1 Factors and Levels 


Evolutionary algorithms (EA) belong to the large class 
of bio-inspired search heuristics. They combine specific 
components, which may be qualitative, like the recom- 
bination operator or quantitative, like the population 
size. Our interest is in understanding the contribution 
of these components. In statistical terms, these compo- 
nents are called factors. The interest is in the effects 
of the specific levels chosen for these factors. Hence, 
we say that the levels and, consequently, the factors 
are fixed. Although modern search techniques like se- 
quential parameter optimization or Pareto genetic pro- 
gramming [56.13] allow multi-objective performance 
measures (solution quality versus variability or descrip- 
tion length), we restrict ourselves to analyzing the effect 
of these factors on a univariate measure of performance. 
We will use the quality of the solutions returned by the 
algorithm at termination as the performance measure. 


56.3.2 Example: Evolution Strategy 
Evolution strategies (ES) are prominent representatives 


of evolutionary algorithms, which includes genetic al- 
gorithms and genetic programming as well [56.15]. 


We use three foreign exchange (currency rate) time 
series collected from Bloomberg. Each time series 
contains hourly samples of the change in currency 
exchange rate [56.12]. 


One typical goal in forecasting is the minimiza- 
tion of the forecast errors or the differences between 
real (observed) values, say y;, and predicted values, 
say ĵ;. This goal can be considered as an optimization 
problem. 

As stated in Sect. 56.2.2, the statistical analysis 
is similar for artificial and natural problem classes. 
Our goal can be stated as follows: For a given prob- 
lem class M, which can be artificial or natural, we 
try to determine if an optimization algorithm a or 
several algorithm instances œ; show similar behavior 
on randomly selected problem instances 7; € M. This 
question will be formulated as a statistical hypothesis. 
Based on the related statistical framework, we can de- 
termine confidence intervals for the performance of the 
algorithm on unseen problem instances. 


They can be classified as generic population-based 
metaheuristic optimization algorithms for global opti- 
mization that in some sense mimics the natural evo- 
lution. Evolution strategies are applied to hard real- 
valued optimization problems. Mutation is performed 
by adding a normally distributed random value to each 
vector component. The standard deviation of these ran- 
dom values is modified by self-adaptation. Evolution 
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Fig. 56.2 The evolutionary cycle, basic working scheme 
of all ES and EA. Terms common for describing evolution 
strategies are used, alternative terms are added below in 
brown 
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Table 56.1 Settings of exogenous parameters of an ES. Recombination operators are labeled as follows: 1 = no, 2 = 
dominant, 3 = intermediate, 4 = intermediate as in [56.14]. Mutation uses the following encoding: 1 = no mutation, 


2 = self adaptive mutation 


Parameter Symbol Name Range Value 
mue H Number of parent individuals N 5 
nu v=A/p Offspring-parent ratio Ry 2.0 
sigmaInit a) Initial standard deviations R+ 1.0 
nSigma no Number of standard deviations. d denotes the problem dimension {1, d} 1 
Cr Multiplier for individual and global mutation parameters R+ 1.0 
tau0 R+ 0.0 
tau R+ 1.0 
rho p Mixing number {1, u} 2 
sel k Maximum age R+ 1.0 
mutation Mutation dil, 2 2 
sreco To Recombination operator for strategy variables {1,2,3,4} 3 
oreco F Recombination operator for object variables Hil, 2, 32 2 


strategies can use a population of several solutions. 
Each solution is considered as individual and consists 
of object and strategy variables. Object variables repre- 
sent the position in the search space, whereas strategy 
variables store the step sizes, i.e., the standard devia- 
tions for the mutation. We analyze the ES basic variant, 
which was proposed in [56.14]. 

Mutation means neighborhood-based movement in 
search space, which includes the exploration of the 
outer space currently not covered by a population, 
whereas recombination rearranges existing information 
and so focuses on the inner space. Selection is meant 
to introduce a bias towards better fitness values. A con- 
crete ES may contain specific mutation, recombination, 
or selection operators, or call them only with a cer- 
tain probability, but the control flow is usually left 


56.4 Objective Functions 


We will use the following optimization framework: 
an ES is applied as a minimizer on the test func- 
tion f(x). Formally speaking, let S denote some set, 
e.g., SCR". We are seeking for values f* and x*, 
such that mines f(x) with f* = minyes f(x) and x* = 
arg min f(x). This approach can be extended in many 
ways. For example, if S denotes times-series data, then 
an optimization algorithm can be applied to minimize 
the empirical mean squared prediction error. 

Test problem instances will be drawn from Gal- 
lagher’s and Yuan’s MSG test function generator. The 
following parameters can be used to specify the MSG 
generator: 


unchanged. Each of the consecutive cycles is termed 
a generation. The control flow is shown in Fig. 56.2. 

Concerning the representation, it should be noted 
that most empiric studies are based on canonical forms 
as binary strings or real-valued vectors, whereas many 
real-world applications require specialized, problem- 
dependent ones. Table 56.1 summarizes important ES 
parameters. This chapter presents two case studies. The 
first case study is based on a fixed ES parameter setting, 
whereas the second case study modifies the recombi- 
nation operator for object variables. We are convinced 
that the applicability of the methods presented in this 
chapter goes far beyond the simplified case studies. Our 
main contribution is a framework, which allows conclu- 
sions that are not limited to a small number of problem 
instances but to problem classes. 


The number of Gaussian components m. 

The mean vector u of each component. 

The covariance matrix X of each component. 

The weight of each component w;. 

A maximum threshold t € [0; 1] can be specified for 
local optima and the fitness value of the global op- 
timum G*. Local optima are randomly generated 
within [0; tx G*]. 


The following tuple can be used to specify an MSG 
generator 


I := (lc, c]",n,m, Dy. {D5}, {t, G*}) , (56.1) 
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li l 
0.5 0 0.5 0 


where c € R defines the boundary constraints of the 
search space, n the search space dimensionality, m 
the number of Gaussian components, D,, the distribu- 
tion used to generate the mean vectors of components, 
D>» the distribution or procedures used to generate co- 
variances of components, t € [0; 1] the threshold for 
local optima, and G* the function value of the global 
optimum. 
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Fig. 56.3a-i Nine test problem instances from [Tmsc, generated with the MSG landscape generator as specified in (56.2). These 
figures exemplify how numbers and locations of the randomly generated optima can vary. Usually, the optima are evenly dis- 
tributed in the search space. In some settings, there are a few dominating optima as can be seen in part (g) 


Based on (56.1), we have specified the following 
MSG landscape generator for our experiments 


Tso := ([-1; 1], 2, 10, U[-1; 1], 
{U[0.05: 0.15], U[—2/4, 7/4]} , {0.8, 1}). 
(56.2) 


With this setting, the mean vector of each component 
is generated randomly within [—1, 1]?. The covariance 
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matrix of each component is generated with the proce- 
dure D y in three steps: 


1. A diagonal matrix S with eigenvalues is generated. 
An orthogonal matrix T is generated through n(n — 
1)/2 rotations with random angles between 
[—2/4, 7/4]. 

3. The covariance matrix is generated as T'ST. 


The weight w; of the component correspond- 
ing to the global optimum is set to 1 while other 
weights are randomly generated within [0; 0.8]. The 
nine problem instances, 2; E€ mso, (i= 1,...,9) 


56.5 Case Studies 


Bartz-Beielstein [56.9] introduced the acronyms: 


© SASP: one single algorithm and one single problem 
instance 

© SAMP: one single algorithm and multiple problems 
instances 

@ MASP: multiple algorithms and one single problem 
instance 

@ MAMS: multiple algorithms and multiple problem 
instances 


for classifying optimization designs [56.17]. 


56.5.1 Single Problem Designs: 
SASP and MASP 


In SASP we analyze the performance of an optimiza- 
tion algorithm a on a single problem instance wz. An 
optimization problem has a set of input data which 
instantiate the problem. This might be a function in 
continuous optimization or the location and distances 
between cities in a traveling salesman problem. In the 
following, we will use Y to denote the random per- 
formance measure obtained by r runs of algorithm @ 
on problem instance z. Because many optimization 
algorithms such as evolutionary algorithms are random- 
ized, their performance Y on one instance is a random 
variable. It might be described by a probability den- 
sity/mass function p(y|z). Running the algorithm with 
different random seeds on one problem instance, we 
collect sample data y,,...,y,, which are independent 
and identically distributed (i.1.d.). 

There are situations, in which SASP is the method 
of first choice. Real-world problems, which have to be 
solved only once in a very limited time, are good ex- 


from Fig. 56.3 were generated with this parametriza- 
tion. 

Note that we are using the distance to the op- 
timum as an objective function in our experiments. 
Our objective function reads G* — f(x), because we are 
considering minimization problems. Other measures of 
interest might be the gap percent of optimality 


(A -FO) 100. 
G* 


or computation time, etc., see, e.g., [56.16]. 


amples for using SASP optimizations. MASP shares 
several characteristics with SASP. Because of their lim- 
ited capacities for generalization, SASP and MASP will 
not be investigated further in this study. 


56.5.2 SAMP: Single Algorithm, Multiple 
Problems 


Fixed-Effects Models 
This setup is commonly used for testing an algorithm 
on a given (fixed) set of problem instances. Standard 
assumptions from analysis of variance (ANOVA) lead 
us to propose the following fixed-effects model [56.2] 


Yjy=utute;, (56.3) 


where u is an overall mean, t; is a parameter unique 
to the i-th treatment (problem instance factor), and £; 
is a random error term for replication j on problem in- 
stance i. Usually, the model errors ¢; are assumed to be 
normally and independently distributed with mean zero 
and variance o. If problem instance factors are con- 
sidered fixed, i. e., non random, the stochastic behavior 
of the response variable originates from the algorithm. 
This implies the experimental results 


WeNG@ewe ds i=1,...,q4,j=1,...;F, 


(56.4) 


and that the Y; are mutually independent. Results 
from statistical analyses remain valid only on the spe- 
cific instances. Furthermore, SAMP with a fixed set of 
problem instances is subject to criticism, e.g., that algo- 
rithms are trained for this specific set up test instances 
(over fitting). 

In order to make the results of the analysis inde- 
pendent of the specific instances and dependent instead 
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on the class of instances from which the specific in- 
stances are drawn, Chiarandini and Goegebeur propose 
randomized and mixed models for the experimental 
analysis of optimization algorithms as an extension of 
(56.3) [56.3]. In contrast to model (56.3), these mod- 
els allow generalizations of results to the whole class of 
instances. 


Randomized Models 
In the following, we consider a population or class of 
instances JI. The class JI consists of a large, possi- 
bly an infinite, number of problem instances 7;,i = 
1,2,3,... Let p(x) denote the probability of sampling 
instance z. The performance Y of the algorithm œ on 
the class IT is described by the probability function 


PO) = D> polpa). (56.5) 


well 


If we run an algorithm q@ r times on instance x, then 
we receive r replicates of a’s performance, denoted 
by Y,,..., Y,. These r observations are i.i.d., i. e., 


POr- srl) = | [poylx). (56.6) 


j=l 


So far, we have considered r replicates of the perfor- 
mance measure Y on one problem instance 2. Now we 
consider several, randomly sampled problem instances. 
Over all the instances the joint probability distribution 
of the observed performance measures is obtained by 
marginalizing over all instances 


aY) = pee 


well 


POI.. .y-|)p(r). (56.7) 


Extending the model (56.7) to the case where one 
algorithm with several parameter settings or several al- 
gorithms are analyzed leads to mixed models, which 
will be discussed in Sect. 56.5.3. 


Example SAMP: ES on M 

(Random-Effects Design) 
The simplest random-effects experiment is performed 
as follows. For i=1,...,g a problem instance zr; is 
drawn randomly from the class of problem instances JT. 
On each of the sampled z;, the algorithm œ is run r 
times using different seeds for æ. Due to a’s stochastic 
nature, we obtain, conditionally on the sampled in- 
stance, r replications of the performance measure that 
are i.i.d. 


Let Y; G@=1,...,q; j= 1,...,r) denote the ran- 
dom performance measure obtained in the j-th replica- 
tion of œ on z;. We are interested in drawing conclu- 
sions about w’s performance on a larger set of problem 
instances from JT and not just on those q problem 
instances included in the experiment. A systematic 
approach to accomplish this task comprehends the fol- 
lowing steps: 


@ SAMP-! algorithm and problem instances 

@ SAMP-2 ANOVA and restricted maximum likeli- 
hood estimator (REML) model building 

© SAMP-3 validation of the model assumptions 

@ SAMP-4 hypothesis testing 

@ SAMP-5 Confidence intervals and prediction. 


SAMP-1 Algorithm and Problem Instances. The 
goal of this case study is to analyze if one algorithm 
shows a similar performance on a class of problem in- 
stances, say [Tysc. A random-effects design will be 
used to model the results. We illustrate the decompo- 
sition of the variance of the response values in (i) the 
variance due to problem instance and (11) the variance 
due to the algorithm and derive results, which are based 
on hypotheses testing as introduced in (56.12). 

We consider one algorithm, an ES, which is run r = 
10 times on a set of randomly generated problem 
instances. The ES is parameterized with the default 
setting from Table 56.1. These parameters are kept con- 
stant during the experiment. Nine instances are drawn 
from the set of problem instances msg. Problem in- 
stances were generated with the MSG landscape gener- 
ator as specified in (56.2). The corresponding problem 
instances are shown in Fig. 56.3. 

The null hypothesis reads There is no instance ef- 
fect. Since we are considering the SAMP case, our 
experiment is based on one ES instance only. There 
are 90 observations, because 10 repeats were performed 
on 9 problem instances. Figure 56.4 shows the perfor- 
mance of the ES on these nine instances. The variable 
f£Seed is used to denote the problem instance num- 
ber 7. 


SAMP-2 ANOVA and REML Model Building. 


ANOVA Model Building. The following analysis is 
based on the linear statistical model 


PR E 
Yj = M++ ey), j 


j=1 j (56.8) 
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Fig. 56.4a,b Performance of the ES on nine test problem instances. (a) Problem instances plotted versus algorithm 
performance. (b) Problem instances plotted against logarithmic performance. Smaller values are better 


where jz is an overall mean and ej is a random er- 
ror term for replication j on instance i. Note that in 
contrast to the fixed-effects model from (56.3), Tt; is 
a random variable representing the effect of instance i. 
The stochastic behavior of the response variable orig- 
inates from both the instance and the algorithm. This 
is reflected in (56.8), where both q; and €; are random 
variables. The model (56.8) is the so-called random- 
effects model, cf. [56.2] or [56.3]. 

We assume that T,...,T, are iid. N (0, o2) and 
&j i= 1,...,q4, j= 1,...,r, are iid. N (0,07). If q; 
is independent of €, and has variance V(t;) = o2, the 
variance of any observation is V(Y;) = 07 + 02. Simi- 
lar to the partition in classical ANOVA, the variability 
in the observations can be partitioned into a compo- 
nent that measures the variation between treatments 
and a component that measures the variation within 
treatments. Based on the fundamental ANOVA identity 
SStotat = SStreat + SSer, we define 


SStreat = Pe. DA 
q-1 q-1 ; 


MS treat = 


and 


SSer pE DNIO 


MSerr AT e Tai 
q(r—1) q(r—1) 


It can be shown that 


and E(MSer) = 0°, 
(56.9) 


E(MStreat) = 07? + ro? 


cf. [56.2]. Therefore, the estimators of the variance 
components are 


ô? = MSer , (56.10) 
~ MSrea — MS 
ol = ee (56.11) 


r 


The corresponding ANOVA table is shown in Ta- 
ble 56.2. Based on ANOVA calculations, with (56.10) 
we obtain an estimator of the first variance compo- 
nent 67 = —0.4848257, and from (56.11), we obtain 


Table 56.2 ANOVA table for a one-factor fixed and random effects models 


Source of variation Sum of squares Degrees of freedom Mean square EMS fixed 


Treatment SStreat q-1 
Error SSerr q(r—1) 
Total SStotal qr-1 


EMS random 
Te Ta 2 
MS treat Gre |o ro 
MSerr o o 
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the second component 62 = 11.32854. The model vari- 
ance can be determined as ô? + 62 = 10.84372. The 
mean u = —12.05554 from (56.8) can be extracted. Fi- 
nally, the p value in the ANOVA table is calculated 
as 0.7979083. 

Note that we have obtained a negative variance. 
Since negative variances are not feasible, we can pro- 
ceed by setting their values to zero and proceed with 
these modified values. A more elegant way is presented 
in the following. 


Restricted Maximum Likelihood. In some cases, the 
standard ANOVA, which was used in our example, pro- 
duces a negative estimate of a variance component. 
This can be seen in (56.11): if MSer > MStreat, nega- 
tive values occur. By definition, variance components 
are positive. Methods that always yield positive vari- 
ance components have been developed. Here, we will 
use restricted maximum likelihood estimators (REML). 
The ANOVA method of variance component estima- 
tion, which is a method of moments procedure, and 
REML estimation may lead to different results. Output 
from an R-based analysis with the function 1me from 
the package 1me4 reads as follows (f Seed denotes the 
problem instance) [56.18]: 


Linear mixed model fit by REML 


Formula: yLog ~ 1+ (1 | £Seed) 
Data: samp.df 
AIC BIC logLik deviance REMLdev 
475.6 483.1 -234.8 469.3 469.6 


Random effects: 


Groups Name Variance Std.Dev. 
fSeed (Intercept) 0.000 0.0000 
a) Q-Q plot for residuals 
0.34 o 
025 


Sample quantiles 
S 
= S 
Nn N 


2 
= 


Theoretical quantiles 


10.893 3.3004 
groups: fSeed, 9 


Residual 


Number of obs: 90, 


Fixed effects: 
Estimate Std. Error t value 


(Intercept) -12.0555 0.3479 -34.65 


Compared to the ANOVA setting, different values 
for G?, Ge, and u were obtained. However, the REML- 
based analysis also shows that the variability in the 
response observations can be attributed to the variabil- 
ity of the algorithm. 


SAMP-3 Validation of the Model Assumptions. Be- 
fore performing hypothesis testing based on the mod- 
els introduced in SAMP-2, the validity of the model 
assumptions has to be investigated. If the model is 
adequate, the residuals should exhibit no structure. 
Residuals are plotted against fitted values to check the 
assumption of homoscedasticity and quantile—quantile 
(Q-Q) plots are used to check if residuals meet the 
normality assumption. Quantile—quantile plots of the 
residuals are shown in Fig. 56.5 for the raw and the 
log-transformed responses. These plots provide a good 
way to compare the distribution of a sample with 
a distribution. Large deviations from the line indicate 
non-normality of the sample data. These Q-Q plots in- 
dicate that a log transformation of the response might 
be useful in our setting. 


SAMP-4 Hypothesis Testing. Testing hypotheses 
about individual treatments (instances) is useless be- 


b) Q-Q plot for residuals 
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Fig. 56.5a,b (a) Q-Q plot of the residuals for raw data. (b) Q-Q plot for the log-transformed responses 
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cause the problem instances z; are here considered as 
samples from some larger population of instances IT. 
We test hypotheses about the variance component o2, 
i. e., the null hypothesis 

Ho : 02 = 0 versus H, oe >0. (56.12) 
Under Hp, the algorithm performance is identical on all 
problem instances (all treatments are identical), i.e., 
ro2 is very small. Based on (56.9), we conclude that 
E(MS treat) = 07 + roz and E(MSer) = o° are similar. 
Under the alternative, variability exists between treat- 
ments. Standard analysis shows that SSer/o? is dis- 
tributed as chi-square with q(r— 1) degrees of freedom. 
Let F,,, denote the F distribution with u numerator 
and v denominator degrees of freedom. Under Ho, the 
ratio 


Fix SStreat/g — 1 _ MS treat 
°? SSen/40 1) MSc 


is distributed as F4—1.4¢—1). To test hypotheses in 
(56.8), we require that t),...,t, are iid. N (0, a), 
éyj,t=1,...,¢,fjH1,...,4r, are iid. N (0,07), and 
all t; and ¢, are independent of each other. These con- 
siderations lead to the decision rule to reject Ho at the 
significance level a if 

fo>FU-a;q—1,q(r—-1)), (56.13) 
where fo is the realization of Fo from the data observed. 
An intuitive motivation for the form of statistic Fọ can 
be obtained from the expected mean squares. Under Ho 
both MSireat and MSer estimate o° in an unbiased way, 
and Fo can be expected to be close to one. On the other 
hand, large values of Fo give evidence against Ho. 

Regarding the SAMP case, we obtain the following 
values: Based on (56.9) and (56.13), we can deter- 
mine the F statistic and the p value. We get MStreat = 
MSer = 10.89275 and fọ = 1, which results in a large p 
value: 0.4426363. The null hypothesis Ho :o2 =0 
from (56.12) cannot be rejected, i. e., we conclude that 
there is no instance effect. A similar conclusion was ob- 
tained from the ANOVA method of variance component 
estimation as introduced in Table 56.2. 


SAMP-5 Confidence Intervals and Prediction. An 
unbiased estimator of the overall mean ju is 


Its variance is given by 


q r 


V5.) =V 


With (56.9) and (56.10), we obtain an estimator of the 
variance of the overall mean jz as 


VO.) = MS reat 
qr 
Since 
=p 


the confidence limits for jz can be derived as 


= MS 
5. Ẹ H-aj- treat , 
gr 


(56.14) 


We conclude the SAMP case study with predic- 
tion of the algorithm’s performance on a new in- 
stance from the same class. Based on (56.14), we 
obtain the following 95% confidence interval: [2.6773 x 
106: 1.262 x 1075]. Again, confidence intervals from 
the REML and ANOVA methods are very similar. 
Summarizing, we can conclude that the ES performs 
similarly on instances from msg, which were gener- 
ated with (56.2). 


56.5.3 MAMP: Multiple Algorithms, Multiple 
Problems 


In the MAMP case study, fixed effects are included 
in the conditional structure of (56.6), which leads to 
a mixed model. Instead of one fixed algorithm as in 
the SAMP case, we consider either several algorithms 
or algorithms with several parameters. Both situations 
can be treated while considering algorithms as levels 
of a fixed factor, whereas problem instances are drawn 
randomly from the population of instances Tysa: 


MAMP-1 algorithm and problem instances 
MAMP-2 ANOVA and REML model building 
MAMP-3 validation of the model assumptions 
MAMP-4 hypothesis testing: 

1. Random effects 

2. Fixed effects 

@ MAMP-5 confidence intervals and prediction. 
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MAMP-1 Algorithm and Problem Instances 

We aim at comparing the performance of the ES 
with different recombination operators over an instance 
class. More precisely, we have four ES instances using 
recombination operators {1,2,3,4} and nine instances 
randomly sampled from the class Tysa as illustrated in 
Fig. 56.3. Each run is repeated ten times. In this study 
4x 9x 10 = 360 data were used. We are interested in 
the following questions: 


@ Is there an instance effect? 

@ Do the mean performances of the ES with different 
recombination operators differ? 

@ Do the instance—algorithm interactions contribute to 
the variability of the response? 


A first visual inspection, which plots the perfor- 
mance of the algorithm within each problem instance, 
is shown in Fig. 56.6. In eight of the nine instances the 
linear regression line has a negative slope and the inter- 
cepts do not differ very much. This indicates that there 
is no significant interaction between the fixed and the 
random factors. 
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Fig. 56.6 Four algorithms (ES with modified recombination oper- 
ators) on nine test problem instances. Each panel represents one 
problem instance and problem instances are labeled from 124 to 
130. Performance is plotted against the level of the recombination 
operator 


MAMP-2 ANOVA and REML Model Building 
The variability in the performance measure can be 
decomposed according to the following mixed-effects 
ANOVA model 


Yijk = H + Oj + Ti + Yij + Eijk > (56.15) 


where u is an overall performance level common to all 
observations, œj is a fixed effect due to the algorithm j, 
T; is a random effect associated with instance i, yj is 
a random interaction between instance i and algorithm j, 
and jx is a random error for replication k of algorithm j 
on instance i. We assume that the ay’s are fixed effects 
such that yy a = 0 and that the random elements 17; 
are iid. N (0,02), yi are iid. N (0, 07), &jx are iid. 
N (0,07) and 1, yi and £; are mutually independent 
random variables. Similarly to (56.6) the conditional 
distribution of the performance measure given the in- 
stance and the instance—algorithm interaction is given 
by 


Yiultis vg © N (u +a + ti + yj o’) ; (56.16) 


with i= 1,...,g,j=1,...,f, and k=1,...,r. The 
marginal model reads (after integrating out the random 
effects t; and yj): 


Yin © N (utaj,0° +07 +02). (56.17) 


Based on these statistical assumptions, hypothesis tests 
can be performed about fixed and random factor effects. 
Using the mixed model (56.16), we are interested in 
testing whether there is a difference between the fac- 
tor level means u +œ; (j = 1,...,h). The hypotheses 
for testing the fixed effects can be formulated as 


Ay:a;=OVi against Hı:3œ; #0. (56.18) 


Regarding random effects, tests about particular levels 
are useless. This is similar to the random-effects model 
(56.8). Again, we perform tests on the variance compo- 
nents o2 and o3 instead. These can be formulated as 
follows 

Ho: ož =0, and Ay: of = 


Y (56.19) 


H: oO, and H: of =O, 


respectively. If all treatment (problem instances) com- 
binations have the same number of observations, i. e., 
if the design is balanced, the test statistics for these hy- 
potheses are ratios of mean squares that are chosen such 
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Table 56.3 Expected mean squares and consequent appropriate test statistics for a mixed two-factor model with h fixed 


factors, g random factors, and r repeats (after [56.3]) 


Effects Mean squares Df 

Fixed factor MSA e= íl 
Random factor MSB Gal 
Interaction MSAB (h—1)(q—1) 
Error MSE hq(r — 1) 


Table 56.4 ANOVA for the MAMP case 


Mean squares Factors Df Sum Sq 
MSA objreco 3 154.59 
MSB fSeed 8 251.79 
MSAB objreco:fSeed 24 185.60 
MSE Residuals 324 USI 27 


that the expected mean squares of the numerator dif- 
fers from the expected mean squares of the denominator 
only by the variance components of the random factor 
under test. Chiarandini and Goegebeur [56.3] present 
the resulting analysis of variance, which is shown in Ta- 
ble 56.3. 


ANOVA Model Building 
The ANOVA table for the experiments from the MAMP 
case study is shown in Table 56.4. Equating the ob- 
served mean squares in the lines of the ANOVA table to 
their expected values and solving for the variance com- 
ponents leads to the following equations [56.2] 


.> . MSB—MSAB 
a 


MSAB — MSE 
= = = 0.306907 , 


; 
6? = MSE = 4.664423. 


= 0.593502 , 


Next, we will compare these results to the REML- 
based analysis of the mixed model. 


REML Model Building 
We have specified sum contrasts instead of the default 
treatment contrasts used in lmer(). Again, f Seed rep- 
resents the problem instance, whereas the algorithm 
instance a, j = 1,...,4, is represented by obj reco. 


Linear mixed model fit by REML 
Formula: yLog ~ objreco + (1 | fSeed) 
+ (1 | £Seed:objreco) 


Random effects: 
Groups Name Variance Std.Dev. 
fSeed:objreco (Intercept) 0.30691 0.55399 
fSeed (Intercept) 0.59351 0.77039 


Expected mean squares Test statistics 
h 2 

o? + ro} +r = MSA/MSAB 
G” thier E MSB/MSAB 
oo. MSAB/MSE 
o? 
Mean Sq F value Pr(>F) 
51.53 11.05 0.0000 
31.47 6.75 0.0000 
TIB 1.66 0.0288 
4.66 

Residual 4.66442 2.15973 


Number of obs: 360, 
groups: fSeed:objreco, 36; fSeed, 9 


Fixed effects: 
Estimate Std. Error t value 


(Intercept) -6.0222 0.2956 -20.370 
objrecol 0.6176 0.2539 2.433 
objreco2 0.6918 0.2539 2.725 
obj reco3 -0.6671 0.2539 -2.628 


As can be seen from the Random effects sec- 
tion of the REML model output, the estimated variances 
for the problem instance and the instance-interaction 
random effects are 6? = 0.59351 and ay = 0.30691, 
respectively. The Random effects section presents 
the estimates of the fixed effects model parameters, i. e., 
obj reco. 


MAMP-3 Validation of the Model Assumptions 
Again, a check of the diagnostic plots (Fig. 56.7) reveals 
that a log transformation of the response improves the 
model adequacy. 


MAMP-4a Hypothesis Testing: Random Effects 
We will consider random effects first. Regarding prob- 
lem instances, test about levels are meaningless. Hence, 
we perform tests about the variance components o2 
and Oy: which were presented in (56.19). First, we 
test the null hypothesis, which states that the com- 
ponents of the random effects are zero. Based on 
the ANOVA from Table 56.3, we obtain the values 
for the MAMP case that are shown in Table 56.4. 
The values reveal that there are main factor effects 
(fixed and random), but no significant interaction 
effects. 
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b) Q-Q plot for residuals 
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Fig. 56.7a,b (a) Q-Q plot of the residuals for raw data. (b) Q-Q plot for the log-transformed responses 


Alternatively, we can compute the likelihood ratios 
of models with and without the factors under observa- 
tion. 


Data: mamp.df 
Models: 
mamp.lmer2: yLog ~ objreco + (1 | fSeed) 
mamp.lmer3: yLog ~ objreco + (1 | £Seed) 
+ (1 | £Seed:obj reco) 
DE AIC BIC logLik 

mamp.ilmer2 6 1616.7 1640.0 -802.35 
mamp.ilmer3 7 1616.6 1643.8 -801.31 

Chisq Chi Df Pr(>Chisq) 

2.0929 a 0.148 


These tests indicate that there are also no significant 
instance-algorithm interactions. Additional likelihood- 
ratio tests show that the fixed factor and random factor 
effects are significant. 


MAMP-4b Hypothesis Testing: 

Fixed Factor Effects 
Regarding fixed factors, we are interested in testing for 
differences in the factor level means u + @;. These tests 
were formulated in (56.18), i. e., we are testing Ho: all 
a; are equal to 0 versus Hy: at least one a; Æ 0. Here, 
we use the test statistic from [56.2, p. 523] for testing 
that the means of the fixed factor effects are equal. The 
appropriate test statistic for testing that the means of the 
fixed factor effects are equal, i. e., Ho is true, is 


_ MSA _ 154.59/3 
~ MSAB _ 185.6/24 


Fo = 6.663 362, 


with values taken from Table 56.4. The reference dis- 
tribution is Fy—1,(n—1)(q—-1). We calculate the p value 
for the test on the fixed-effect term. The p value ob- 
tained is 0.002, hence the results collected indicate that 
the factor recombination (obj reco) has a statistically 
significant impact on the performance of the algorithm. 
Using sum of contrasts implies that ` a; =0. The 
point estimates for the mean algorithm performance 
with the j-th fixed factor setting can be obtained by 
[Lj = H + aj. The fixed factor effects can be estimated 
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Fig. 56.8 Paired comparison plots. Results from four ES 
instances with different recombination operators are shown 
in this plot 
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in the mixed model as 


ù=5 
å =}... 


which results in the following estimates: a, = 
0.6175519, â = 0.6918047, a3; = —0.6671266, and 
a4 = —0.6423659. 

The same estimates were obtained with the REML 
analysis as can be seen from the REML model out- 
put in Sect 56.5.3. The corresponding fixed effects are 
shown in the Fixed effects section of the REML 
output. For example, we obtain the following value: 
objrecol =a = 0.6176. 


56.6 Summary and Outlook 


In order to answer question (Q-1), we propose an ap- 
proach to generate natural problem classes, which are 
based on real-world data. If no such data are available, 
artificial problem generators such as MSG can be used. 
Since our approach uses a model, say M, to generate 
new problem instances, one conceptual problem arises: 
this approach is not applicable, if the final goal is the 
determination of a model for the data, because M is per 
definition the best model in this case, and the search for 
good models will result in M. However, there is a sim- 
ple solution to this problem. In this case, the feature 
extraction and model generation should be skipped and 
the original data should be modified by adding some 
noise or performing transformations on the data. Nev- 
ertheless, if applicable, the model-based approach is 
preferred, because it sheds some light on the underly- 
ing problem structure. 

The model-based approach can be used to generate 
infinitely many test-problem instances. Instead of using 
a fixed number of problem instances, we propose: 


1. Using randomly generated problem instances 
2. Treating the problem instance as a random factor. 


Algorithms with different parameterizations are 
tested on this set of randomly generated problem in- 
stances. This experimental setup requires modified 
statistics, so-called random-effects models or mixed 
models. This approach may lead to objective evalua- 
tions and comparisons. If normality assumptions are 


MAMP-5 Confidence Intervals and Prediction 
We generate paired comparisons plots, which are 
based on confidence intervals. The wrapper func- 
tion intervals() from Chiarandini and Goege- 
beur [56.3] was used for visualizing these confidence 
intervals as shown in Fig. 56.8. When intervals over- 
lap we conclude that there is no significant difference. 
Here, we can conclude that the recombination op- 
erators (1) and (2) show a similar performance, 
whereas performances between (3) and (2) are 
different. 

Intermediate recombination of the object variables, 
i.e., (3) and (4), results in a significant improvement 
of the performance. 


met, confidence intervals can be determined, which 
forecast the behavior of an algorithm on unseen prob- 
lem instances. Furthermore, results can be generalized 
in real-world settings. This gives an answer to question 
(Q-2). 

In order to demonstrate the applicability of our ap- 
proach, the performance of an evolution strategy was 
analyzed. The first SAMP example illustrates that the 
selection of the problem instance from the problem 
class ITysg has no significant impact on the per- 
formance of the optimization algorithm. Furthermore, 
confidence intervals, which can be used to predict the 
performance of the algorithm on a problem class, were 
determined. The MAMP case exemplifies how to ana- 
lyze the effect of different algorithm parameter settings 
on the performance. Four variants of the recombina- 
tion operator and nine problem instances were used. 
The analysis reveals that the choice of the recombina- 
tion operator has a significant effect on the algorithm’s 
performance: the performance of the algorithm differs 
with different recombination operators. Intermediate re- 
combination of the object variables results in an perfor- 
mance improvement. We demonstrated that the problem 
instances contribute significantly to the variability in 
the response and that there is no significant instance— 
algorithm interaction. 

The software that was used in this study will be inte- 
grated into the R package SPOT (sequential parameter 
optimization toolbox) [56.19]. 
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57.1 Intelligence and Computation 


Developments in computational intelligence (CI) are of the human or business. While prediction is the ulti- 
driven by real-world applications. Over the years a lot mate goal and computational modeling is the means to 
of CI has become ubiquitous to the average user and achieve this goal, we will use concepts of predictive an- 
is deeply interwoven into the way modern design, re- alytics and (data-driven) computational modeling as if 
search and development is done. they were the same. 

In our view, CI is human intelligence assisted and Computational modeling methods allow us to gen- 
(dramatically) enhanced by computational modeling. erate various hypotheses about a specific problem based 
Intelligence is the capability to predict, and, in theory, on observations in an objective way. The mental mod- 
there are two directions to get to prediction through els that the scientists develop during this process help 
computing — data-driven modeling and first principle them to filter through these hypotheses and come up 
modeling. In reality though, since even fundamental with new experiments that either support or falsify 
models and theories have to be validated by data, ev- some of the previous hypotheses or lead to new ones. 
erything is data driven. For this reason, from now on This process supports the scientific method and sig- 
we will focus on data-driven computational modeling, nificantly accelerates technological development and 
and say that it exists to enhance predictive capabilities innovation. 
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There are many examples of new computational 
methods empowering problem solving in the areas of 
material science, energy management, plant optimiza- 
tion, sensory evaluation science, broadband technology, 
social science (economic modeling), infectious disease 
prevention, etc. And while success in many cases is un- 
deniable, two main challenges still remain. 

First, there is an education gap to bridge before 
modern CI techniques can reach their full potential, are 
widely accepted, and become as natural as performing 
experiments in the lab. While many engineering edu- 
cational programs are embracing these techniques and 
help raise awareness of the useful methods in data- 
driven modeling and computational statistics, the ma- 
jority of programs in pure sciences tend to ignore them 
for the most part. There is still a considerable (psycho- 
logical, cognitive, educational) barrier for experimental 
scientists — biologists, chemists, physicists, computer 
scientists — to fully exploit the potential of CI. People 
will happily save an hour of computing time by spend- 


ing an additional week in the lab, while in some cases it 
makes much more sense to spend a week of computing 
time to spare one experiment in the lab (consider, e.g., 
an expensive car crash-test). We appeal to educational 
programs to nurture the interest in computation among 
graduates and facilitate the joint projects of academia 
with industry targeted at the use and further develop- 
ment of computational intelligence methods for real- 
world problems. 

Second, there is a development gap in the pro- 
duction of scalable off-the-shelf CI algorithms. The 
parallelization bottleneck seems to affect most CI 
methods when they are executed on massively par- 
allel architectures. The fact that computational ad- 
vances in hardware (exa-scale computing) happen at 
a much faster pace than advances in the design of 
scalable CI algorithms raises the question: Up to 
which moment can we get more intelligence, i.e., 
more predictive capability, with more computational 
power? 


57.2 Computational Modeling for Predictive Analytics 


While many barriers remain in improving the incorpo- 
ration of CI in classical education, in solving the new 
(previously unthinkable) challenges, and in further in- 
novating CI technology, the current time is a perfect 
moment to make this happen. 

First of all, the realization for the indispensability of 
CI across all industries and all sciences grows as does 
the number of required CI practitioners (computational 
statisticians, data scientists, modelers). The report of 
Manyika et al. on Big Data [57.1] predicts a potential 
gap of 50—60% (300 000 people) in demand relative to 
the supply of well-educated analytical talent in the USA 
by 2018. The data science and big data movement have 
grown in the last decade to become a buzz-word om- 
nipresent in scientific magazines, technology reviews, 
and business offerings. 

While we are happy that the attention of the aver- 
age user is being focused on the importance and impact 
of computational modeling, we are also concerned with 
the fact that too many details are omitted and almost 
everything (business strategies, CI methods, targets for 
predictive analytics, etc.) gets thrown onto one pile. 

While Big Data is occupying the minds of future 
engineers, data scientists, and business majors as a next 
big thing to watch and a synonym of predictive analyt- 
ics, we want to balance the story some more and provide 
a full picture of what we think constitutes predictive an- 


alytics by computational modeling. While business and 
industry is striving to become data driven these days, 
it seeks CI strategies to compete, innovate, and capture 
value. Success and impact of CI will be generated only 
if the right strategies are used in the right place. 

Success of CI in industry will be awarded to meth- 
ods that create impact measured in attaining the new 
level of understanding and knowledge, in units of dol- 
lars. In Fig. 57.1 we sketch a relation between the 
degree of intelligence and the level of competitive ad- 
vantage from [57.2]. Further on, we will use the terms 
predictive analytics and computational modeling (for 
predictive analytics to sustain human intelligence) as if 
they were the same. 

We distinguish three pillars of computational mod- 
eling for predictive analytics: business analytics applied 
to big data (millions to billions of records, dozens 
to hundreds of variables), process analytics applied to 
medium-sized data (tens of thousands of records, hun- 
dreds of variables), and research analytics applied to 
precious data (tens to thousands of records, dozens to 
hundreds to thousands of variables) (Fig. 57.2). 


57.2.1 Business Analytics 


Business analytics is the part of predictive analytics 
associated with big data. In recent years, other sci- 
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ences also created big data problems, so the field could 
be called big data analytics. The distinguishing fea- 
ture of business analytics is the fact that it is used 
to inspect big data streams to provide a quick and 
simple analysis with immediate value reliably and con- 
sistently. Because of the size, big data already offers 
tremendous challenges in stages preceding analytics — 
in storage, retrieval, and visualization. These imply that 
the predictive goals can only be modest, except when 
big computing facilities and specialized data bases are 
available (like it happens in environmental and biolog- 
ical research, Internet search, smart grids, etc.). Main 
goals here are: 


@ Visualization (e.g., dashboards). 

@ Recommendation (e.g., studying customer habits 
and preferences to recommend a new suitable prod- 
uct item). Recommendation uses network analysis 
to select relevant or similar items. 

@ Identification of (simple) trends to enhance cus- 
tomer experience and increase surplus. Trends are 
typically found using time series analysis. 

@ Binary classification to distinguish out-of-the-ordi- 
nary data points from the prototypes following the 
trends (credit risk analysis, fraud detection, spam 
identification). 


Because of the memory limitations, the challenge 
in business analytics is to quickly give an answer to 
simple questions with the main focus on algorithms for 
in- and out-of-memory computation and visualization. 
Industries benefitting most from business analytics are 
retail, banking, insurance, health-care, telecommunica- 
tions, and social networks. 

For example, at large multinational manufactur- 
ing companies, business analytics predominantly re- 
volves around the multivariate forecasting of sup- 
ply and demand. The expected prices and volumes 
of feedstocks and raw materials as well as the ex- 
pected demand for various products are important 
to minimize risk and optimize production as well 
as the supply chain. Classical statistical forecasting 
techniques are the main workhorse for this area and 
the main challenges consist of being able to gather 
the required data, dealing with possibly large num- 
bers of candidate inputs and outputs for the models 
and properly dealing with the hierarchies that exist, 
e.g., products-markets-industry resulting in an explo- 
sion in the number of models that have to be built and 
maintained. 
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Fig. 57.1 Davenport and Harris [57.2] have wonderfully adapted 
the graphics from SAS software. The graph above eloquently ex- 
plains why to use predictive modeling to excel, compete, and 


capture value 


Business analytics 


Predictive 


modeling 


Big data 
High data redundancy 
Immediate value 


Medium-sized data 
High deployment 
constraints 
Immediate value 


Precious data, 
context matters 
Customized solutions 
Long-term value 


Fig. 57.2 Predictive modeling has three components: Business an- 
alytics, predictive analytics, and research analytics 


57.2.2 Process Analytics 


Process analytics exploits medium-size data to gener- 
ate time-sensitive prediction of an industrial process 
(e.g., manufacturing, process monitoring, remote sens- 
ing, etc.) with immediate value. 

Process analytics models must be very robust, sim- 
ple (mostly linear), and concise to be deployed in real 
industrial processes. 

This well understood and probably most conserva- 
tive area of predictive analytics has experienced a big 
change in the last years. A couple of decades ago, pro- 
cess optimization and control groups had more people 
and less pressure. Nowadays, pressure for integrating 
production workflows has increased together with the 
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need to meet tighter quality specifications, much tighter 
emission thresholds, requirements to reduce produc- 
tion, operation, and energy costs, and to maximize 
throughput. The sensor’s side has changed — sensors 
have become much more sophisticated and much more 
abundant. The human interference has also decreased 
due to (sometimes exaggerated) drive to automation, 
and cost reduction. 

All these factors have dramatically increased the de- 
mand for reliable optimization and control. In general, 
process analytics models must be very robust, simple 
(mostly linear) and concise to be deployed in real in- 
dustrial processes. The main challenge for this industry 
is to integrate more sophisticated models and adopt new 
computational methods for process analytics to adapt to 
the changing world of new requirements while main- 
taining robustness over a wide process range. At the 
time when this chapter was written data-driven mod- 
eling was still considered exotic for the field of process 
analytics, and model deployment is still heavily con- 
strained. 

The main goals in process analytics are process 
forecasting and process optimization and control. 

The challenge in process forecasting is to build 
models that hit the tradeoffs between model inter- 
pretability and their long-term (real-time) predictive 
power. The technological challenge of successful CI 
methods is the capability to identify driving features 
in a large set of correlated features. For example, 
think of a problem of predicting the quality of a man- 
ufactured plastic using the smallest subset of avail- 
able factors controlling the production process — pres- 
sures, temperatures, flows. Robust feature selection is 
as important as good prediction accuracy — models 
that are too bulky will never be accepted by process 
engineers. 

The main challenge in process control is the mul- 
tiobjective nature of control specifications and subse- 
quent optimization problems. Consider an example of 
manufacturing and wholesaling thin sheets of metal. 
The thickness of the sheet is an important quality 
characteristic that should not fall below a predefined 
minimum, or the product will be considered off-spec. If 
due to the processing condition the thickness variabil- 
ity is high (sheets are several meters wide and tens of 
meters long, rolled at high speeds, high temperatures), 
penalty for off-spec material is high, and costs for raw 
steel are also high — the manufacturer faces a delicate 
problem of making the sheet thicker than the allowed 
minimum to keep the clients happy but not too thick 
to keep the production costs down. These competing 


objectives usually require a multiobjective approach to 
process optimization. 

Process analytics relies on a rich data set coming 
from the many sensors in a typical plant. Mature plat- 
forms exist that store this, often high-frequency, sensor 
data in databases and plant information systems. The 
primary intent for this data is to run the various plant 
control and quality control systems but archived data 
are often available for predictive modeling as well. The 
use of models that predict the aging and lifetime of cat- 
alysts and the associated changes in optimal settings for 
the plant are good examples. 

Another example is the building of the so-called 
soft sensors that link difficult measurements, such as, 
e.g., grab samples that need to be brought to the lab 
for analysis with results only becoming available af- 
ter some time to some of the easier high-frequency 
measurements, such as, e.g., temperatures, flows, and 
pressures. These models then serve as substitutes for 
the difficult measurements at a high frequency and 
can be calibrated if needed when the slow measure- 
ments become available. There are also many opportu- 
nities to use the demand and supply forecasting models 
from the business analytics side to optimize produc- 
tion and product mix that is most optimal for a given 
scenario. As an extreme example, it may be cheaper 
to shutdown a plant for a while vs continued pro- 
duction when demand is forecasted to be very weak. 
The level and amount of coupling that is possible be- 
tween demand-supply forecasts and actual production 
can vary significantly and depends on many factors, 
but it is clear that much more is possible in this 
area. 

Examples of industries employing process analytics 
are manufacturing, chemical engineering, energy, envi- 
ronmental science. 


57.2.3 Research Analytics 


Research analytics is used to accelerate the devel- 
opment of new products and systems. This task is 
fundamentally different from all the ones mentioned 
previously as it is usually applied to small, com- 
plex, and precious data, is heavily dependent on 
problem context and provides long-term value with- 
out immediate rewards. (By small we mean any 
data set where the number of record is compara- 
ble or even smaller than the number of features. 
In this way, gene expression data with thousands 
of variables taken over dozens of individuals is 
small.) 
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Research analytics provides very customized solu- 
tions and requires a close collaboration between ana- 
lysts/modelers and subject matter experts. 

Research analytics is by nature much less generic 
and becomes very dependent on the specific product 
that is being developed. In research, once you have pre- 
dictive analytics, then there is only a small step to make 
from optimization of existing products to the design of 
new ones. One example of a research analytics success 
story is the development of an application to predict 
the exact color of a plastic part based on the compo- 
sition of the colorants and the specific grade a plastic 
being used, see [57.3]. Robust color prediction mod- 
els led to the capability of actually designing colorant 
compositions in silico directly from customer specifica- 
tions. The models also provided the specifications that 
were necessary to even let the customer produce that 
part himself. 

How far one is able to take this depends on the fi- 
delity of the models as well as the quality of the data 
that is available. Another example of research analyt- 
ics at work is the design of new coatings and catalysts 
based on high throughput experimentation where all 


57.3 Methods 


Over many years of exercising process and research 
analytics, we built up a practice of using predictive 
modeling as the integration technology for real-world 
problem solving. In the last 8—10 years, predictive mod- 
eling for computational intelligence has evolved from 
the solution of last resort to the main stream approach 
to industrial problem solving (prediction, control, and 
optimization). It is technology that glues together fun- 
damental modeling and domain expertise, high-per- 
formance computing and computer science, empirical 
modeling and mathematics — a heaven for an inquiring 
mind and interdisciplinary enthusiast. 

Predictive modeling is a bridge that connects theory 
and facts (data) to enable insight and system under- 
standing. The theory for poorly understood problems 
is often based on simplifying assumptions, on which 
the fundamental models are built. The facts, or empiri- 
cal evidence, are often affected by high uncertainty and 
a limited observability of the system’s behavior. 

Predictive modeling applied iteratively to a grow- 
ing set of facts tests the theory against the data and 
extrapolates models build on the data to confirm or ad- 
just the theory until the theory and facts start to agree. 


the available data is being used to build models on 
the fly. These models are than used to design the next 
experiments such that the information gain is maxi- 
mum. The requirements for the modeling process are 
quite high because everything is embedded in a high- 
throughput workflow but the benefits are also huge. Sig- 
nificant speedups in the total design time as well as 
the performance of new products can be achieved this 
way. 

We stress that because research analytics is an en- 
hancement to human intelligence for the development 
of new products and systems, the benefits of its appli- 
cation scale proportionally with the size of the problem 
and the impact of that particular product or system. For 
big enough problems the benefits quickly get into the 
hundreds of thousands to millions of dollars. 

Research analytics can help drive innovation in all 
industry segments, particularly in materials science, 
formulation design, pharmaceuticals, engineering, sim- 
ulation-based optimization in research, bio-engineer- 
ing, healthcare, telecommunications, etc. In the coming 
10 years, we will continue to see the trend of innovation 
enabled to a large extent by predictive modeling. 


The validation always lies in the hands of a subject mat- 
ter expert who in the case of success accepts both the 
theory and the designed predictive models as plausible 
and interesting. While the real validation comes when 
models are deployed and keep generating value, with- 
out the first step of intriguing the subject matter expert 
the project does not have a chance to succeed. 

To clear any obstacles toward the acceptance of 
models by the domain expert the models should be: 


1. Interpretable 
2. Parsimonious 
3. Accurate 

4. Extrapolative 
5. Trustable, and 
6. Robust. 


In an industrial setting, the capability of having 
a trustable prediction of the output within and outside 
the training range is as important as interpretability and 
the possibility of integrating information from first prin- 
ciples, low maintenance and development costs with no 
(or negligible) operator interference, robustness with re- 
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spect to the variability in data, and the ability to detect 
novelties in data to attune itself toward changes in sys- 
tem’s behavior. 

There is no single technique producing models that 
are guaranteed to fulfill all of the requirements above, 
but rather there is a continuum of methods (and hybrids) 
offering different tradeoffs in these competing objec- 
tives. 

Commonly used predictive modeling techniques in- 
clude linear regression, and nonlinear regression [57.4], 
boosting, regression random forests [57.5], radial-ba- 
sis functions [57.6], neural networks [57.7], support 
vector machines (SVM) [57.8, 9], and symbolic regres- 
sion [57.10, 11] (see more in [57.12]). 

In Fig. 57.3, we place some of the most common 
methods for predictive modeling for process and re- 
search analytics in the objective space of development 
time versus the level of a priori knowledge about the 
problem. When identifying which methods to use other 
objectives (like interpretability and extrapolative capa- 
bility) must also be taken into account. The time axis is 
depicted on a log scale, and the exact development time 
depends on implementation and a particular algorithm 
flavor. 

Support vector machines and ensemble-based neu- 
ral networks lose to linear, nonlinear, and regularized 
regression in interpretability, but have advantages for 
problems where little a priori information is known 
about the system, and no assumptions on model struc- 
tures can be made (see Fig. 57.3). 

Regression random forests, and symbolic regres- 
sion [57.13—15] have further advantages for problems 
where not only model structures but also the variable 
drivers (significant factors) are unknown. 


A priori knowledge 
A 


Variables are 
known, model 
structure is 
known 


Linear regression 
Non-linear regression 


Variables are 
known, model 
structure is 
NOT known 


SVMs, NNs 


Random forests Variables are 
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model structure 

Symbolic regression is NOT known 

HE OU 


Time 


Fig. 57.3 Predictive modeling methods as competing 
tradeoffs in development time versus the level of a priori 
knowledge about the problem 


Random forests proved to be robust and very ef- 
ficient for predicting the response within the training 
range and for identifying the most significant variables, 
but because they do not possess extrapolative properties 
they can only be used in problems where no extrap- 
olation is necessary. Recent studies [57.16] indicate 
that variable selection information obtained by random 
forests can loose meaning if correlated variables are 
present in the data and affect the response differently. 

In business analytics, when the speed of model 
development is the main goal, linear regression and reg- 
ularized learning are the only remaining options. (Re- 
cent developments for predictive modeling for big data 
are also focusing on the feature generation problem, 
where the set of original data variables gets expanded. 
to a much larger set of new features — transformations 
of the original variables, for which regularized linear 
regression is applied. Much like in support vector re- 
gression). 

In process analytics when the driving input factors 
are known — ensemble-based neural networks, support 
vector regression, and ensemble-based symbolic regres- 
sion are the modeling alternatives. 

If very little is known about the process or sys- 
tem, experiments are demonstrating correlations among 
input variables, and concise interpretable models are 
required — symbolic regression is the only resort, 
which comes at a price of a higher development time 
(Fig. 57.3). 

We stress the importance of using ensembles of pre- 
dictive models irrespective of which modeling method 
is used. Ensemble disagreement used as a trustability 
measure defines the confidence of prediction and is cru- 
cial for reliable extrapolation. (It cannot be stressed 
enough that all prediction in a space of dimensionality 
above 3 is mostly extrapolation, even when evaluated 
inside the training range.) 

We deal mostly with process and research analyt- 
ics. In our experience, the aspect of trustablility via 
ensembles of global transparent models, coupled with 
the massive algorithmic efficiency gains and the abil- 
ity to easily handle real-world data with spurious and 
correlated inputs has led to symbolic regression largely 
replacing neural networks and support vector machines 
in our industrial modeling. Our experience also is that 
symbolic regression models tend to extrapolate well as 
well as provide warning of that extrapolation. 

The reason for symbolic regression being success- 
ful for process and research analytics is the fact that all 
real-world modeling problems we have seen up to now 
contained only a dozen of relevant inputs (never more 
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than 25 variables, in most cases less than 10) which 
were truly significantly related to the response. Because 
symbolic regression searches for plausible models in 
a space of all possible structures from the given set of 
potential inputs, and allowed functional transforms, the 
computational complexity increases nonlinearly with 
the dimensionality of the true design space. For this rea- 
son, symbolic regression effortlessly identifies dozens 
of driving variables among tens to hundreds of candi- 
dates (albeit using hours of multicore computing time). 
But it should not be used for problems where hundreds 
of inputs are significantly related to the response and 


57.4 Workflows 


Although there is no universal solution for predictive 
modeling and no size fits all, especially for research an- 
alytics, nothing is as important for a successful solution 
as a good modeling workflow. 

We would like to make a case for the utmost impor- 
tance of workflows and the need to nurture and actively 
proliferate them through all CI projects. In the next 
section, we give an example on how a successful work- 
flow developed in a project from flavor science could be 
seamlessly applied to a project in video quality predic- 
tion. And because predictive modeling for CI will soon 
be used in nearly all industry segments and research 
domains, we believe that it is the responsibility of CI 
practitioners to facilitate innovation through prolifer- 
ation and popularization of (interpretable) workflows 
allowing straightforward application in new domains. 

The most general approach to practical predictive 
modeling is depicted in Fig. 57.3. 

We view this generic framework as an iterative feed- 
back loop between three stages of problem solving (just 
as it usually happens in real-life applications): 


1. Data generation, analysis and adaptation 
2. Model development, and 
3. Problem analysis and reduction. 


An important observation is made in the To- 
ward 2020 Science report edited by Emmott and Ri- 
son [57.17]: 


What is surprising is that science largely looks at 
data and models separately, and as a result, we miss 
the principal challenge — the articulation of mod- 
elling and experimentation. Put simply, models both 
consume experimental data, in the form of the con- 


should be filtered out of thousands of candidates. We 
claim though that no methods are available to solve the 
latter kind of problems because the necessary amount 
of data capturing true input-output relationships will 
never be collected. 

Although tremendous progress has been made 
over the past decade in terms of the efficiency and 
quality of symbolic regression model development, 
we also have made corresponding advances from 
a holistic perspective encompassing the overall mod- 
eling workflows from data collection through model 
deployment. 


text or parameters with which they are supplied, and 
yield data in the form of the interpretations that are 
the product of analysis or execution. Models them- 
selves embed assumptions about phenomena that 
are subject of experimentation. The effectiveness of 
modeling as a future scientific tool and the value of 
data as a scientific resource are tied into precisely 
how modelling and experimentation will be brought 
together. 


This is exactly the challenge of predictive modeling 
workflows — a holistic approach to bring together data, 
models, and problem analysis into one generic frame- 
work. Ultimately, we want to automate this iterative 
feedback loop over data analysis and generation, model 
development, and problem reduction as much as possi- 
ble, not in order to eliminate the expert, but in order to 
free as much thinking time for the expert as possible. 

This philosophical shift away from human replace- 
ment in the modeling workflow toward human aug- 
mentation has been very important in the last decade. 
A successful workflow must offer suites which mine the 
developed models to identify driving factors, variable 
combinations, and key variable transforms that lead to 
insight as well as robust prediction. 


57.4.1 Data Collection and Adaptation 


Very often, especially in big companies, and especially 
for process analytics, CI practitioners do not have ac- 
cess to data creation and experiment planning. This gap 
is a typical example of a situation, where multivariate 
data is given and there is no possibility to gather better 
sampled data. 
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In other situations, there is a possibility to plan the 
experiments, and gather new observations of the re- 
sponse for desired combinations of input variables, but 
the assumption always is that these experiments are 
very expensive, i. e., require long computation, simula- 
tion, or experimentation time. Such a situation is most 
common in research analytics and meta modeling for 
the design and analysis of simulation experiments. 

The questions to ask at the data collection and adap- 
tation stage are: How to design experiments within the 
available timing and cost budget to optimally cover 
the design space (possibly containing spurious vari- 
ables)? How can available data and developed models 
guide design-space exploration in the next iterations? 
Is available data well sampled? Is it balanced? What 
is the information content of performed experiments? 
Is there redundancy in the data and how to minimize 
it? 


57.4.2 Model Development 


In model development, the focus is on automatic cre- 
ation of collections of diverse data-driven models that 
infer hidden dependencies on given data and pro- 
vide insight into the problem, process, or system in 
question. 


57.5 Examples 


57.5.1 Hybrid Intelligent Systems 
for Process Analytics 


A good example of a unified workflow for process 
analytics is the hybrid intelligent systems framework 
popularized at the Core R&D department of the Dow 
Chemical Company in the late 1990s. 

The methodology was developed to improve soft 
sensor performance (performance of predictive mod- 
els), to shorten its development time, and minimize 
maintenance. It employed different intelligent system 
components — genetic programming, support vector 
machines, and analytic neural networks [57.18]. 

The process analytics in this framework consists of 
three steps following data collection: 


1. Data preprocessing and compression. Support vector 
regression using the €-insensitive margin is used to 
identify and remove data outliers and compress data 
to a representative set of prototypes (support vec- 
tors). The result is a clean and compressed data set. 


Irrespective, of which modeling engines are used at 
this stage, the questions on how to best generate, eval- 
uate, select, and validate models given particular data 
features (size and dimensionality) are of great impor- 
tance. Model quality, in general, i.e., generalization, 
interpretability, efficiency, trustworthiness, and robust- 
ness is the main focus for model analysis leading to the 
next stage. 


57.4.3 Problem Analysis and Reduction 


The stage of problem analysis and reduction supposes 
that developed models are carefully scrutinized, fil- 
tered, and validated, to infer preliminary conclusions 
on problem difficulty. The focus is on driving inputs, 
assessment of variable contribution, linkages among 
variables, dimensionality analysis, and construction of 
trustable model ensembles. The latter if defined well 
will contribute to intelligent data collection in the style 
of active learning. 

With a goal to augment human intelligence by com- 
putation, we emphasize the critical need for a human, 
an inquiring mind who will test the theory, the facts 
(data) and their interpretations (models) against each 
other to iteratively develop a convincing story where all 
elements fit and agree. 


2. Preliminary variable selection using ensemble- 
based stacked analytic neural networks [57.19]. The 
result of this step is a ranking of input variables and 
quantification of variable contribution based on it- 
erative input elimination and re-training. 

3. Convolution parameter estimation to identify rele- 
vant time-lags of significant inputs using appropri- 
ate convolution functions. 

4. Development of transparent predictive models us- 

ing symbolic regression via genetic programming 

and final variable selection using symbolic regres- 
sion models. 

Model selection and analytical function validation. 

Online implementation. 

7. Soft sensor maintenance to guarantee robustness of 
process prediction. 


aur 


Examples of the use of this workflow for reactor 
modeling can be found in [57.18]. 

We all practiced the hybrid intelligent systems 
workflow in the past, but the massive algorithmic effi- 


ciency gains in ensemble-based symbolic regression via 
genetic programming of the last decade [57.14, 20] have 
led us to simplify the workflow and largely eliminate 
steps one and two to replace them by direct application 
of symbolic regression. 


57.5.2 Symbolic-Regression Workflow 
for Process Analytics 


The major modeling engine breakthrough was the in- 
corporation of a multiobjective viewpoint; this intro- 
duced orders of magnitude improvements in model 
development speed while simultaneously allowing the 
analyst to choose the proper balance between complex- 
ity and accuracy post facto. In essence, the data could 
now define the appropriate model structure and driving 
inputs, which became the main reason for symbolic re- 
gression’s success for predictive modeling. 

Other conceptual advances ordinal genetic pro- 
gramming, interval arithmetic, Lamarckian evolution 
and secondary optimization objectives, such as age, 
model dimensionality, nonlinearity, etc., have brought 
us to the current situation where we can largely inject 
data into a (properly designed) symbolic regression en- 
gine and interesting and useful models will emerge. 

The symbolic regression workflow has become as 
depicted in Fig. 57.4, but with model development done 
using Pareto-aware symbolic regression [57.14]. 


Distillation Tower Example 
The dataset comes from an industrial problem on mod- 
eling gas chromatography measurements of the compo- 
sition of a distillation tower and is available online at 
http://www.symbolicregression.com. 


Problem Data \ 
analysis collection 
& & 
reduction adaptation 


Model 
development 


Fig. 57.4 Generic iterative model-based problem solving 
workflow (after [57.15]) 


A chemical reaction typically generates a variety of 
chemicals along with the one (or several) of interest. 
One method of isolating the mixture coming from the 
reactor into various purified components is to use a dis- 
tillation column. The (hot gaseous) input stream is fed 
into the bottom and on the way to the top goes through 
a series of trays having successively cooler tempera- 
tures. The temperature at the top is the coolest. Along 
the way, different components will condense at different 
temperatures and be isolated (with some statistical dis- 
tribution on the actual components). With vapors rising 
and liquids falling through the column, purified frac- 
tions (different chemical compounds) can be retrieved 
from the various trays. The distillation column is very 
important for the chemical industry because it allows 
continuous operation as opposed to a batch process and 
is relatively efficient. 

This distillation column problem contains nearly 
7000 records and 23 potential input variables — mix- 
ture of flows, pressures, and temperatures — in addition 
to the quality metric and material balance. The response 
variable is the concentration of a purified component at 
the top of the distillation tower. This quality variable 
needs to modeled as a function of relevant inputs only. 
The range of the measured quality metric is very broad 
and covers most of the expected operating conditions in 
the distillation column. 

To design the test data, we sorted the samples by 
their response values and selected every third and sev- 
enth samples for the validation set and every fourth 
and eight samples for the test set. The remaining points 
formed the training set. 

Many input variables in the data are heavily cor- 
related. Because symbolic regression can deal with 
correlated variables, we used all 23 inputs in the first 
round of modeling to perform initial variable impor- 
tance analysis. 

The workflow that follows exploratory data analysis 
is described below: 


1. Initial modeling: We allocated 2 hours of com- 
puting time on a quad-core machine to perform 
24 20-minute independent runs of symbolic re- 
gression by genetic programming using Evolved- 
Analytics’ DataModeler [57.14]. All symbolic re- 
gression runs used basic arithmetic operators aug- 
mented by a negation and a square as primitives. All 
models were stored on disk, and all other settings set 
to default settings of the symbolic regression func- 
tion of [57.14]. In total, more than 3000 symbolic 
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regression models were generated during 24 inde- 
pendent runs. 

2. Variable importance analysis: For all models 
presence-based importances were computed. Fig- 
ure 57.5 demonstrates that only a handful of vari- 
ables is identified as drivers ([57.14] suggests to use 
importance threshold of 20%). 

3. Variable combination analysis: All developed mod- 
els were analyzed for dimensionality and most 
frequent variable combinations. In Fig. 57.6, one 
can see model subsets niched according to con- 
stituting variable combinations. The bottom graph 
suggests that variables colTemp1l, colTemp3, and 
colTemp5 might be sufficient for describing the re- 
sponse, since they cover the knee of the Pareto front 
in complexity vs. accuracy space. 

4. Variable contribution analysis: Models were sim- 
plified by identifying and eliminating the least con- 
tributing variable. Variable combination analysis 
was repeated for simplified models and resulted in 
identifying colTemp1 and colTemp3 as new candi- 
dates for a sufficient subspace. 

5. New runs performed on a subset of input vari- 
ables identified as drivers: The new batch of 
independent symbolic regression runs was ap- 
plied to the same data but only using colTemp1 
and colTemp3 as the candidate input variables. 
As expected, models generated in this experi- 
ment demonstrated that the same complexity— 
accuracy tradeoffs can be achieved in only two- 
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Fig. 57.5 Variable presence in developed symbolic regression 


models 


dimensional rather than 23-dimensional input 
space. 

6. Ensemble generation using developed models and 
a validation set: Final model ensemble was gen- 
erated automatically using developed symbolic re- 
gression models and validation data set. It was 
augmented by quadratic and cubic models on two 
variable drivers. 

7. Ensemble prediction validation using test data: 
Ensemble prediction and ensemble disagreement 
were finally evaluated on the test data. Initial re- 
quirements for prediction accuracy to not exceed 
5—7% of standard deviation were met by all en- 
semble models. Ensemble prediction is graphed in 
Fig. 57.7. 


This example demonstrates the use of a good model 
development workflow. An ensemble similar to the 
one described here has been deployed for controlling 
a gas chromatography measurement in a real distilla- 
tion column. 


57.5.3 Sensory Evaluation Workflow 
for Research Analytics 


A flavor design case study is an example of a more 
specialized workflow [57.21]. In sensory evaluation, 
scientifically designed experiments are used to define 
a small set of mixtures that can be presented aromati- 
cally to evaluators to identify the ingredients that drive 
hedonic response (positively or negatively) of a target 
panel of consumers. Each panelist is asked how much 
they like the flavor, ranging from like extremely to dis- 
like extremely with 9 distinctions. Details of the study 
can be found in [57.21]. Our focus here is the workflow 
that allowed to evaluate the consistency of liking prefer- 
ences in the target population and gain insight into how 
to design or identify flavors that most consumers would 
consistently like. 

The data for this project was provided by the Gi- 
vaudan Flavors Corp. It falls into a category of pre- 
cious data. It consists of sensory evaluation scores of 
36 mixed flavors containing seven ingredients evalu- 
ated by 69 human panelists. In other words, data has 
seven input variables (flavor ingredients), 36 records 
(flavors), and 69 response measurements per record 
(Fig. 57.8). 

Because of the high variability of response values 
per flavor, panelist responses were modeled individ- 
ually. Because transparent and diverse input response 
models were required to approximate this challeng- 
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Fig. 57.6 Complexity—accuracy tradeoffs for most frequent variable combinations in the distillation column example 
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Fig. 57.7 Prediction of the final ensemble of symbolic re- 
gression models on test data. All models seem to agree on 
unseen test data set. This should not be surprising, because 
the training, validation, and the test set were designed to 
cover the full range of operating conditions > 


ing data set, modeling was done using ensemble-based 
symbolic regression. 

For each panelist, a standard workflow was applied 
to identify driving ingredients which changes in pan- 
elist’s liking [57.22]. 

When developed, model ensembles predicting in- 
dividual responses could be bootstrapped to a richer 
set of virtual mixtures (tens of thousands of flavors in- 
stead of the available 36). The bootstrapped responses 
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Fig. 57.8a-d Example of panel segmentation by propensity to like from [57.22]: (a) Decision regions for evaluating 
cumulative distribution for liking score density model (b) hard to please panelist (c) neutral panelists, (d) easy to please 
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were used to cluster the target population into three 
segments: easy to please — (cyber)individuals who con- 
sistently give high ratings to most flavors, hard to 
please — individuals that consistently use a low range of 
scores for all flavors, and neutral panelists whose pref- 
erential range is centered around the medium score — 
neither like, nor dislike. Such segmentation of the tar- 
get population by people’s propensity to like products 
turned out to be very useful in several other applications 
beyond flavor design. It focuses product development 
by giving insight into the fundamental variability in the 
preferences of the target audience. 

The standard workflow for variable importance es- 
timation applied to model ensembles forecasting the 
scores of individual panelists also allowed to segment 
the target population by ingredients that drive liking in 
the same direction. Such segmentation of the consumer 
market combined with the cost analysis for new prod- 
uct design offers visualization and analysis of beneficial 
tradeoffs for product specialization. 

The third outcome of this study was the de- 
velopment of a model-guided optimization workflow 


57.6 Conclusions 


In this chapter, we discussed how computational intel- 
ligence leads to predictive analytics to produce busi- 
ness impact. We identified three main areas of pre- 
dictive analytics: business analytics that deals mainly 
with visualization and forecasting, process analytics 
which aims to improve optimization and control of 
manufacturing processes, and research analytics which 
aims at speeding up and improving product and pro- 
cess design. All three areas have the potential to 
save and earn many millions of dollars but deal with 
very different data sources, context, information con- 
tent, amount of available domain knowledge, and 
time and prediction requirements for value genera- 
tion. Driven by different motivations, the areas are 
subsequently employing different predictive modeling 
methods. 

We presented several predictive modeling methods 
in the context of different prediction requirements, so- 
lution development, and deployment constraints. We 
emphasized that there is no single method that fits all 
problems, but rather there is a continuum of methods, 
and each problem dictates selection of a method by 
specific time requirements and the amount of available 
a priori subject-matter knowledge (Fig. 57.3). 


for designing optimal virtual mixtures. Multi-objec- 
tive optimization using swarm intelligence was used 
to find tradeoffs in the flavor design space that 
simultaneously maximize the average liking score 
and minimize variance in the liking across virtual 
panelists. 

Such model-guided optimization workflow com- 
bined with the standard ensemble-based modeling 
workflow presents a strong motivation for the develop- 
ment of a targeted data collection system for designing 
new products. 

We should point out that despite a very custom 
design and specialized domain of sensory evaluation 
in food science, the workflow could successfully be 
applied in the very different domain of video qual- 
ity prediction. Ensemble-based symbolic regression 
was used to model the perceived quality of per- 
turbed video frames and results were used to predict 
customer satisfaction and segment the representative 
population of video viewers by propensity to notice 
perturbations and sensitivity to particular perturba- 
tions [57.23]. 


We stressed the importance of good and stable pre- 
dictive modeling workflows for success in CI projects 
and provided several examples of such workflows for 
process and research analytics, illustrating that re- 
search analytics projects require highly customized 
approaches. 

We point out that successful CI projects are am- 
plifiers, that necessarily keep the human in the loop 
and vastly enhance her/his capabilities. Because of this, 
integrating CI in the various process and business work- 
flows is essential! 

It is clear that our ability to generate data as well as 
our ability to analyze it and produce actionable knowl- 
edge are quickly expanding. The challenge remains on 
how to develop scalable CI algorithms that keep up with 
the ever rising tide of data, given that computational ad- 
vances in hardware (massive parallelization, exa-scale 
computing) are developing at a much faster pace than 
the CI algorithms. 

A question that still puzzles us is: Can we get more 
intelligence with more computational power, and where 
(and whether) it stops? Undoubtedly, the right answer 
lies in the development of new algorithms that can 
tackle the new challenges — advanced material design, 
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problems in bio-informatics, complex-system model- 
ing in social sciences, and social networks. We expect 
the largest impact of predictive modeling to happen in 
the areas of research and process analytics — in design 
of new products and new processes. Examples of de- 
sign problems that can be assisted by data-driven CI 
methods for research analytics are the development of 
advanced materials — photovoltaic cells, alternative fu- 
els, bio-degradable replacements for paints and plastics, 
composite materials, sustainable food sources. From 
the process analytics side, we would like to see CI 
methods used for optimization of water purification, 
emission control in combustion processes, simulation- 


based optimization of social events on a world scale 
(terror attacks, revolutions, pandemics spread), effi- 
ciency optimization of manufacturing cycles, garbage 
minimization, and recycling. 

It cannot be stressed enough that the dynamics 
around CI is changing — instead of CI being an op- 
tional addition to the arsenal of problem solving tools 
and methods, CI is becoming indispensable to deal and 
make progress with this new breed of real-world prob- 
lems. The only way for CI practitioners to bring CI 
to prime time is to develop scalable algorithms, pro- 
liferate good workflows, and implement them in great 
applications. 
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58. Solving Phase Equilibrium Problems 
by Means of Avoidance-Based Multiobjectivization 
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58.1 Coping with Real-World Optimization 


Tr cawod cminection aaiieaiien piob- PRODI GIS coreane en 1159 
lems with a certain characteristic. Despite their 58.2 The Phase-Equilibrium Calculation 
; A ; S P ; Problemi n e asea 1161 
low dimensionality, finding the desired optima 58.3 Multiobjectivization-Assisted 
is difficult as their basins of attraction are small : p T 
and surrounded by the much larger basin of the Multimonal Optimization: MORMG: E 1162 
i ; 58.3.1 Basics of Multiobjective 
global optimum, which unfortunately resembles eee us 
: A 7 OPUMIZANON..csasdccssesccecrssececesoes 1164 
a physically impossible and therefore unwanted 58.4 Solving G | Phase-Equilibri 
solution. We tackle such problems by means of oo ee eee i aa al 
Pua oa is j F : Probleme- siener an a 1165 
a multiobjectivization-assisted multimodal opti- E aS 
aA : p ie 58.4.1 Ternary Liquid—-Liquid 
mization algorithm which explicitly uses problem Equilibrium: 
knowledge concerning where the sought solu- Water/Methanol/MMA .......c-.0-+.+- 1165 


tions are not in order to find the desired ones. 
The method is successfully applied to three phase- 
equilibrium problems and shall be suitable also for 
tackling difficult multimodal optimization prob- 
lems from other domains. 


58.4.2 Three Phase Equilibria: 
Water/MMA and Water/Furfural... 1167 
58.4.3 Obtaining the Phase Diagrams.... 1168 
58.5 Conclusions and Outlook...................... 1169 
REFEFENCES...... 0... eee eee c ccc eeceeceeaeseesneeneeees 1169 


58.1 Coping with Real-World Optimization Problems 


A multitude of methods from within and beyond evo- 
lutionary computation (EC) has been applied to real- 
valued multimodal optimization problems. These are 
generally considered the harder, the more basins of 
attraction they contain, and the less smooth the fit- 
ness landscape is. Additionally, a search space that 
extends over a large number of dimensions is said to 
complicate search for the desired global or good local 
optima [58.1]. 

However, in a real-world setting, even a low- 
dimensional problem may turn out to be quite difficult. 
This can stem from different factors, one of which 
would be a very small extent of the basins that con- 
tain the sought optima. Figure 58.1 visualizes the fitness 
landscape of an optimization problem that possesses 
this property. The application background will be de- 
tailed in Sect. 58.2, but for now it suffices to know 


that there are only two variables a and b, and that the 
desired minima (function values do not depend on vari- 
able order and are thus symmetric to the main diagonal) 
are located near (0.650,0.001) and (0.001,0.650), re- 
spectively. It is easy to see that the appropriate basins 
are small; in the figure, they are hardly recognizable at 
all. 

Another complicating factor would be uncertainty 
about the relative target function value of the sought 
optima. If it is not a priori known whether we are look- 
ing for a global or only a certain local optimum, there 
is no way around enumerating all existing optima and 
choosing the right solution out of these afterward. Such 
difficulties may occur in cases where it is not possi- 
ble to integrate the whole available application specific 
knowledge into the established target function, i.e., if 
its value must be obtained by simulation and the exist- 
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Fig. 58.1a,b Visualizations of the two-dimensional exam- 
ple problem. In the bottom panel, the search space is 
transformed by a square root. The desired optima are 
marked with white dots. Note that the diagonal consists of 
globally optimal but undesired (trivial) solutions 


ing simulation tool is not able to represent all important 
features of the real system. 

Obviously, there are several workarounds to over- 
come the difficulties imposed by this problem: 


© Applying a transformation to the search space, so 
that the local optima at the lower boundaries occupy 
more space. This is shown in Fig. 58.1. 

@ Only initializing the optimization algorithm with so- 
lutions on the boundaries of the search space. In this 
case, we sometimes start from very near to the local 
optima, and thus have a higher chance to find them. 

© Exploiting the symmetry of the landscape by a spe- 
cial representation. This can be done by enforcing 


a > band would help, e.g., recombination operators 
of evolutionary algorithms (EAs). 


However, all these approaches are dependent on 
the location of the desired optima. Any algorithm ex- 
ploiting this expert knowledge would neccessarily show 
a worse performance on problems without these spe- 
cial features, as predicted by the no free lunch theo- 
rem [58.2]. Instead, a more general method, which uses 
information on where the desired optima is not, will be 
discussed and evaluated in this chapter. 

Many different EAs may be used to tackle this 
global or multimodal optimization problem because 
they are able to detect several optima simultaneously 
or subsequently. The latter may be achieved by multi- 
start approaches as sequential niching [58.3], whereas 
the former is established by means of diversity mainte- 
nance. That is, candidate solutions of the search popula- 
tions are prevented from converging to the same region 
by implicitly or explicitly keeping them apart [58.4]. 
Prominent examples are crowding [58.5] and fitness 
sharing [58.6], and their successors. More recent ap- 
proaches include, but are not limited to UEGO [58.7], 
clearing [58.8], species conservation [58.9], clustering- 
based niching [58.10], and cellular EA (CEA) [58.11]. 
Although there is no commonly accepted formal defini- 
tion of what a niching method is [58.12], most of these 
algorithms may be subsumed under the term niching 
EA. They all use the distance between candidate solu- 
tions (diversity) as an implicit criterion which shall be 
maximized. 

However, nothing prevents us from utilizing a diver- 
sity criterion directly. A step into this direction has been 
taken in the shifting balance GA [58.13]. But although 
it employs a separate diversity evaluation via subpopu- 
lation distance computation, it finally resorts to a single 
objective by weighting the distance and target function 
values. 

In [58.14], we established a more radical approach 
and employ diversity in search space as an additional 
objective and treat the resulting combined problem by 
an evolutionary multiobjective algorithm (EMOA). The 
expected benefit is twofold: 


@ It enables placing solution candidates in basins that 
would otherwise go unnoticed due to their small 
size. 

@ We obtain a good overview of the available interest- 
ing search space regions in a single run. 


As we presume that this approach is not only ap- 
plicable to the thermodynamic problems treated in 


Solving Phase Equilibrium Problems 


58.2 The Phase-Equilibrium Calculation Problem 


this work but also to real-valued engineering problems 
with similar properties, it is also followed and further 
extended here. Other related multiobjectivization ap- 


proaches are discussed in Sect. 58.3 after introducing 
the problem context. 


58.2 The Phase-Equilibrium Calculation Problem 


The knowledge of phase equilibria is required for the 
design and optimization of separation processes which 
are essential parts of typical chemical plants. The aim of 
a phase-equilibrium calculation is to quantitatively re- 
late the variables (in particular, temperature T, pressure 
p, and mole fraction x) which describe the state of equi- 
librium of two or more homogenous phases [58.15]. 

In any problem concerning the equilibrium distri- 
butions of k components between two phases, one must 
always begin with the equality of the chemical potential 


w as 
Vie {1,... k}: = pi. (58.1) 


To establish the relation of u; (We use the domain- 
specific notation with upper index denoting different 
phases and lower index standing for separate sub- 
stances.) to T, p, and Xx; , it is convenient to introduce 
a certain auxiliary function such as the fugacity coeffi- 
cient g’ (T, p, x;), which can be calculated by a thermo- 
dynamic model. Then, (58.1) can be rewritten as 


Vie {l,..., Bix gl =x! gl. (58.2) 


Typically, the calculation is performed at constant tem- 
perature and pressure, and the remaining concentrations 
x; and x/’, respectively, are to be found. The fugacity co- 
efficient g; of component i in the mixture is calculated 
as 


res 


M; 
Ing; = —InZ, 58.3 
i = RT (58.3) 


with Z being the compressibility factor, defined as 
Z=—., (58.4) 


where v is the molar volume, and R is the gas constant. 


The residual chemical potential ,1}°* is given by 


we = as + RT(Z— 1) 


das da™* (58.5) 
T Ox; -2 ( dxe ) ? 


where (da™*/dx;) is a partial derivative of the resid- 
ual Helmholtz energy with respect to the mole fraction 
stated in the denominator, while all other mole fractions 
are considered constant. 

The residual Helmholtz energy according to the 
perturbed chain statistical associating fluid theory 
(PC-SAFT) is considered as the sum of different con- 
tributions resulting from repulsion (hard chain), van 
der Waals attraction (dispersion), and hydrogen bond- 
ing (association) 


as = gh oe qiisp + gs9soc | (58.6) 


The detailed equations for each contribution can be 
found in [58.16] and [58.17]. 

Solving phase-equilibrium problems according 
to (58.2) may lead to trivial solutions, i.e., x; =x, 
which are mathematically correct but have no physi- 
cal meaning (except at the so-called critical demixing 
point). To avoid obtaining them, the initial guesses for 
the minimization procedure may not be too far away 
from the correct solutions, provided that the correct so- 
lutions are known. 

In the case of polymer solutions, initialization is 
very critical, because the concentration of the polymer 
in the solvent-rich phase can be in the magnitude of 
1072, which is a numerical challenge for simulation 
programs [58.18]. Another difficulty arises as the num- 
ber of components in the mixture increases. All these 
challenges point out the need for a robust algorithm to 
solve the phase-equilibrium calculation for an arbitrary 
number of components and phases, and which is also 
applicable to polymer solutions. 

Figure 58.1 actually shows a phase-equilibrium 
problem, namely a simple two-component mixture of 
water and pentanol. This type of liquid-liquid equi- 
librium (LLE) data are necessary for the design and 
optimization of liquid—liquid extractors and decanters. 
The two variables correspond to the concentrations of 
water in the water-rich phase (for the larger of the two) 
and in the pentanol-rich phase (for the smaller one). 
Under the assumption that a > b, and that w stands for 
water and p for pentanol, we have a= x, and b= x”. 
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The remaining mole fractions x’, and x” can be obtained 
indirectly as x, = 1 — x/, and x” = 1 — x//, because for 
every phase, the following equality holds: 


k k 
I n 
Ke ee 


i=1 i=l 


(58.7) 


For this two-component problem, two equations of type 
(58.2) have to be satisfied, resulting in two error values 
ew = |x ol, — x!” | and ep = |x yl — x” "|. A feasi- 
w ww wPw p p?p p Pp z 
ble solution to the problem shall exhibit errors below 
10—!° due to practical requirements. In the following, 
ew and ep are aggregated into a single target function 
value by using the sum of squares, which is to be mini- 


mized (note the vector notation) 


AG x) =e be ; (58.8) 


Table 58.1 The sought optima at different temperatures 


Mole fraction 40°C 60°C 90°C 
Xe, 0.74698 0.7097 0.65084 
E 0.00020913 0.00038142 0.00082809 


In Fig. 58.1, (58.8) is modeled at a temperature of 
90°C, for which the sought optimum is located near 
the coordinates (0.650,0.001). As system properties 
change with temperature and pressure, the pursued 
optimum also moves through the search space. Ta- 
ble 58.1 depicts approximate solutions for different 
temperatures and constant pressure of 1.0132 bar. 
The trivial solutions are the only feature representa- 
tive for all phase-equilibrium problems. Thus, this is 
the only information that shall be exploited in the 
following. 


58.3 Multiobjectivization-Assisted Multimodal Optimization: MOAMO 


As seen in Sect. 58.1, the optimization problem at hand 
is inherently multimodal. That is, local optimization 
schemes are only successful if started from a region 
near the desired nontrivial solution. To make things 
worse, the basin of attraction of the undesired triv- 
ial solutions may largely dominate the search space as 
found for the very simple LLE problem (two phases, 
two components: water/pentanol). Hitting the basin 
of attraction of the desired solution can be very dif- 
ficult, and if failing on this, the final outcome of 
quasi-Newton or similar algorithms will be a trivial 
solution. 

Stochastic optimization methods like EAs and other 
metaheuristics employ a more globally oriented opti- 
mization scheme. Several attempts using these methods 
have been tried on equilibrium detection problems in 
recent years, namely genetic algorithms (GA) and simu- 
lated annealing in [58.19] or differential evolution (DE) 
and tabu search (TS) in [58.20]. The algorithms have 
been mostly used in their canonical form with some 
parameters tuning and a concluding local optimization 
step by means of a quasi-Newton method. Alternative 
approaches applied artificial neural networks for learn- 
ing and predicting phase equilibria as in [58.21], the 
authors of which evolve the neural networks by means 
of genetic programming (GP), and [58.22], where the 
authors employ a real-coded GA to optimize initial 
weights and biases of the neural network before it is 
further refined using a quasi-Newton method. Where 


enough training data is available, the binodal curves of 
equilibria can be learned and predicted for the missing 
areas. 

Some recent metaheuristic attempts concentrate on 
the global (multimodal) nature of the optimization 
problem to find equilibrium points for rather difficult 
systems where global optima are located in relatively 
small basins. [58.23] use tabu search, [58.24] a ran- 
dom tunneling method, and [58.25] a DE hybrid with 
TS components. While we agree that looking elsewhere 
for even better solutions is mandatory for a multi- 
modal problem, it may be even more rewarding to 
obtain a good overview over large portions of the search 
space before climbing down into the individual optima. 
This has been attempted by using a refined version of 
the algorithm of [58.26] which has been applied to 
phase stability problems by [58.27]. The base algorithm 
GLOBAL has been developed further in [58.28]. As the 
latter methods start from a random sample, it may how- 
ever happen that either the initial sample is too small so 
that important optima are missed, or it is relatively large 
and thus costly. 

The optimization concept suggested in this work 
therefore relies on an evolutionary multiobjective algo- 
rithm (EMOA) approach in order to generate a spec- 
trum of possible near-optimal solutions before ap- 
plying a local search method on these. We term 
it multiobjectivization-assisted multimodal optimiza- 
tion (MOAMO). The method was successfully applied 
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Fig. 58.2 The general concept of 
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onto the two-phase 2-component water/pentanol sys- 
tem in [58.14]. Here, we demonstrate that it is viable 
for more complicated equilibrium problems with more 
phases and components. Although not yet tried on 
polymer problems, this ultimate goal seems to be in 
reach as very small basins of attraction can be attained 
reliably. 

Figure 58.2 shows the main concept of the 
MOAMO approach. The key idea is to use a population- 
based multiobjective algorithm as a preprocessing step 
for generating search points in the different basins of 
attraction of the tackled problem, the basin of the non- 
trivial optimum being among them. To do this, the 
practitioner first has to formulate an additional objec- 
tive function. This second objective is then employed 
to obtain good coverage of the search space despite 
the high attraction of certain areas. We label this type 
of multiobjectivization avoidance-based because appli- 
cation knowledge about where the sought optimum is 
not helps to transform the single-objective optimiza- 
tion problem into a multiobjective one that is easier to 
solve. More precisely, it enables detecting several dif- 
ferent basins, among them many that would most likely 
have gone unnoticed with the single-objective approach 
alone. 

For this specific application, the distance to the triv- 
ial solution (equal concentrations) is taken into account. 
From then on, the system can work autonomously. In 
the next step, the multiobjective optimization is car- 
ried out. The obtained search points then are fed one 
by one into a local optimization method, until a satisfy- 
ing nontrivial solution is found. For this local search, 
only the original objective is relevant. We employed 
the algorithm of [58.29] and the covariance matrix 
adaptation evolution strategy (CMA-ES) of [58.30] for 


this last step. The experimental results suggest that 
especially the latter seems well suited for the task. 
However, one may resort to another method here (e.g., 
quasi-Newton or similar standard optimization algo- 
rithms as described in [58.31]) if it is deemed more 
appropriate. To avoid superfluous local optimization 
steps on candidate solutions that are close to each 
other, this phase may be prepended with a clustering 
step so that one tries a representative of each group 
of solutions first and then proceeds in a round robin 
fashion. 

The idea of simplifying a difficult single-objective 
problem by a multiobjective approach has some precur- 
sors in evolutionary computation and has been coined 
as multiobjectivization by [58.32]. The approach can 
be divided into two general categories, namely mul- 
tiobjectivization by adding objectives and multiobjec- 
tivization by the decomposition of a scalar objective 
function. 

For the latter one, it can be proven that the ap- 
proach can only decrease the number of local op- 
tima [58.33]. It was for example successfully applied 
to protein structure prediction problems in [58.34, 35]. 
MOAMO belongs to the category of multiobjectiviza- 
tion by adding objectives. No theoretic guarantees of 
benefits can be given [58.36] for this approach, but 
nonetheless it has already been tried in several dif- 
ferent ways [58.37-40]. However, these applications 
somewhat remain in the tradition of evolutionary multi- 
objective algorithms that already contain diversity pre- 
serving mechanisms. The second objectives suggested 
all refer to the current population or single individu- 
als thereof and do not take characteristics of the actual 
problem into account. MOAMO strongly differs as in- 
stead of a population-relative, it employs an absolute 
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Fig. 58.3 (a) Pareto dominance for minimization: p, and 
P> are non-dominated, p, is dominated by py, and p4 is 
dominated by p, and p,. (b) A non-dominated front be- 
tween objectives fı and f2, consisting of points pı to p4. 
cı to cq denote the hypervolume contribution of each point 
(the space not covered by any other point) against the ref- 
erence point r 


distance objective, namely the distance to the known 
trivial solutions. The MOAMO approach is therefore 
especially well suited to phase equilibrium problems, 
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Fig. 58.4 Working scheme of the SMS-EMOA. Termination is 
done according to predefined conditions, e.g., a certain budget of 
fitness evaluations 


as the fugacity equations do not allow to directly con- 
clude where the sought solution is, but at least they 
provide information on where it is not. It has been 
demonstrated in [58.14] that by using the multiob- 
jective EA as preprocessing step, the important basin 
can be located with a much smaller amount of func- 
tion evaluations than would be needed by sampling 
the search space randomly, even if the basin is very 
small. 

In the following, basic EMOA concepts are sum- 
marized and the particular multiobjective optimization 
algorithm employed in our experiments is introduced, 
namely the SMS-EMOA by [58.41] and [58.42]. 


58.3.1 Basics of Multiobjective Optimization 


Multi-objective optimization fundamentally relies on 
Pareto dominance. A point in the objective space of 
two or more objective functions is dominated, if there 
is at least one other that is not worse in all ob- 
jectives and better in at least one (Fig. 58.3a). As 
the optimization progresses, the population approaches 
the Pareto front which resembles the set of optimal 
compromises and consists of non-dominated points 
only. 

Several criteria exist to judge the quality of whole 
populations within the algorithm run (as means to 
determine the next search steps) and thereafter to as- 
sess optimization success. One of the most popular 
is the hypervolume, the amount of objective space 
coverered by the population with regard to a ref- 
erence point as documented in the right panel of 
Fig. 58.3. 

The S-metric selection evolutionary multiobjective 
algorithm (SMS-EMOA) is a further development of 
the popular NSGA2 (nondominated sorting genetic al- 
gorithm 2) by [58.43]. Figure 58.4 displays its major 
steps. Starting from a usually randomly placed popu- 
lation, a loop begins with deriving one new individual 
(search point) and adding it to the population. The 
domination count of each individual is computed by 
counting how many other individuals dominate it. If 
such dominated individuals exist, the one with the 
largest domination count is deleted. Otherwise, the 
hypervolume contribution of each individual is deter- 
mined (Fig. 58.3b), and the individual with the smallest 
contribution is deleted. If the current state does not ful- 
fill the termination criterion (e.g., a predefined budget 
of function evaluations) the loop starts over. After ter- 
minating, the remaining population is the result set. 
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58.4 Solving General Phase-Equilibrium Problems 


We present the results of phase-equilibrium calculations 
for the three-component system water/methanol/MMA 
as well as for the three-phase systems water/MMA and 
water/furfural. The corresponding optimization prob- 
lems have four, three, and three decision variables. 
PC-SAFT uses statistical mechanics for its sim- 
ulation of thermodynamic systems and thus requires 
a calibration of some pure-component parameters and 
one binary parameter. The aim of this calibration is 
to achieve a consistency between the calculated phase 
equilibria and results of physical experiments. Car- 
rying out this task manually for a single substance 
takes up several days of work for a chemical engi- 
neer, although the data of the physical experiments 
are already available in the literature [58.44]. These 
data contain series of measurements of temperature, 
density, and pressure for the vapor and the liquid 
phase of each substance. Among the several param- 
eters that model the molecular properties, there are 
five per substance that have to be estimated. These 
are the number of sphere segments m, the segment 
diameter o, the segment energy parameter €/k, an 
association energy «^i /k, and the effective associa- 
tion volume kêi, Two different association sites are 
assigned to all the considered substances. If the sub- 
stance is non-self-associating, then association energy 
as well as association volume are set to zero. Be- 
sides the five (three) parameters per substance, the 


Table 58.2 PC-SAFT pure-component parameters for 
considered components 


model requires one parameter kj that is characteris- 
tic for each binary mixture. The respective values for 
all these parameters were taken from [58.45,46] and 
are summarized in Tables 58.2 and 58.3. The appli- 
cability of PC-SAFT to model the mentioned systems 
in good agreement with experimental data has been 
proved in [58.46]. 

The following experiments show that the MOAMO 
approach provides a reliable and fast tool for the de- 
tection of equilibrium points which are difficult to find 
with standard optimization tools as a gradient or quasi- 
Newton search. 


58.4.1 Ternary Liquid-Liquid Equilibrium: 
Water/Methanol/MMA 


In Fig. 58.5, the ternary phase diagram of wa- 
ter/methanol/MMA at 50°C and 1.013 bar with two 
liquid phases is shown. The calculation of the tie-lines 
was performed for different fixed concentrations of 
MMA in one liquid phase (xuma), See Table 58.4, at 
constant temperature and pressure. 


Pre-Experimental Planning 
The first objective (58.9) is generated from the error val- 
ues output by PC-SAFT. These refer to the departure 
from the equilibrium state between every two phases of 


© A Exp. data 
—— PC-SAFT 
OPI k; = 0 (MMA-water) 


kij = 0 (MMA-methanol) 


1165 


7°85 |3 Hed 


Sub- m o ce/k € numa x AiBi ky = 0.05 (water-methanol) 
stance 
Water 1.0656 3.0007 366.5121 2500.6706 0.0349 
Methyl 3.0632 3.6238 265.6874 0 0.0349 0.6 
methacry- MMA Methanol 
late 
(MMA) 0.4 
Methanol 1.5255 3.2300 188.9046 2899.4906 0.0352 
Furfural 4.1604 3.0204 270.0700 0 0.0349 
1 0 
Table 58.3 PC-SAFT binary parameters 0 0.2 0.4 0.6 0.8 1 
a Water 
Binary system kij 
Water/MMA 0 Fig. 58.5 Phase diagram of water/methanol/MMA system at 50 °C 
Water/methanol —0.05 and 1.013 bar. The symbols are experimental data from [58.47] 
Water/furfural —0.006 and [58.48]. The line is the calculation result of PC-SAFT with 
MMA/methanol 0 
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one component as given by (58.1). 


3 
file’) = Yo eeg (58.9) 


i=l 


Different formulations of the second objective for the 
SMS-EMOA were tried and several of them work well. 
Therefore, a generalization of the distance criterion for 
the two-component two-phase case in Sect. 58.1 was 
chosen. It measures the Euclidean norm of a vector 
of concentration differences (slightly shifted to allow 
for minimization) and is easily extendable for more 
components 


fha, x”) = J/2— jx 


=x" 


(58.10) 


Sj 


Experimental Task 

The task for MOAMO in this experiment is to reli- 
ably reach the sought optimum for all indicated MMA 
concentrations, that is the number of individuals con- 
verging to the optimum in the local search phase shall 
be considerably larger than 1 on average. Furthermore, 
the MOAMO-based approach shall find the optimum 
considerably faster than a naïve multistart local search 
procedure. 


Setup 
For each of the concentrations indicated in Table 58.4, 
MOAMO is run five times with 30 individuals in the 


Table 58.4 MOAMO with 30 individuals, remaining pop- 
ulation put into local optimization and rate of success and 
convergence to trivial solution, averaged over five runs. 
Where the sum of optimum and trivial is below 30, some 
local searches did not converge. The last column gives the 
empirical success probabilities for random start points of 
the local search 


FNMA Optimum Trivial Success rate (%) 
0.05 25.2 3.8 45.0 
0.15 0.0 30.0 Del 
0.25 5.8 24.0 3.6 
0.35 19.6 10.2 3.9 
0.45 25.8 4.0 BD 
0.55 29.4 0.6 2.9 
0.65 28.8 0.0 Bell 
0.75 29.0 0.6 Bull 
0.85 232 0.6 23 


multiobjective first step. Each search point contained in 
the last population is then optimized by a local search 
procedure (CMA-ES is employed for this second step). 
For each local search, it is recorded if either the unde- 
sired trivial solution or the sought optimum is obtained 
or if the search did not converge. Other than population 
size and run length (30 and 5000), the SMS-EMOA pa- 
rameters are chosen as in [58.41]. 

In order to perform a comparison, the local search 
procedure (CMA-ES) is started 1000 times for each 
MMA concentration from a randomized start point and 
the rate of success for converging to the sought opti- 
mum is recorded. The CMA-ES terminates if progress 
or adapted stepsizes decrease below 107 !? as usual. 


Observations 
Table 58.4 comprises the results for the MOAMO ap- 
proach and in comparison the success rates for the 
random start local search procedure. Run lengths of the 
CMA-ES are not given in detail, but mostly range be- 
tween 2600 and 5000 evaluations. 

For the MMA concentrations from 0.25 to 0.85, 
both methods are consistent! MOAMO obtains the 
sought optimum from at least 60% of the last popu- 
lation’s search points, while the success rates of the 
random start local search vary between 2 and 4%. How- 
ever, 0.05 and 0.15 are special cases: In the first case, 
the problem is obviously not that hard as the random 
start local search also detects the sought optimum often, 
and in the second case, the MOAMO approach com- 
pletely fails. 


Discussion 
The most striking result of the experiment is that hard- 
ness of the problem for the two compared approaches 
seems uncorrelated. An MMA concentration of 0.05 
is much more easily solved by the random start lo- 
cal search than any other, but the success rates for 
MOAMO do not reflect this. For 0.15, the opposite hap- 
pens as the problem poses average difficulties for the 
random start local search procedure, but is very hard 
for MOAMO. We conjecture that this is an exception as 
we are almost at the critical point here, where concen- 
trations in both phases differ less and less. Presumably, 
trivial solution and sought optimum are too equal to 
separate them in the SMS-EMOA phase via the dis- 
tance objective. However, we can be satisfied with the 
results for the other concentrations, where the MOAMO 
approach reliably detects the sought optimum and is 
much faster than the random start local search pro- 
cedure, even if the effort for the first (multiobjective) 
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phase is considered (which is on the order of one or two 
local searches). 


58.4.2 Three Phase Equilibria: 
Water/MMA and Water/Furfural 


We now turn to an application of the MOAMO ap- 
proach on 2 component/3 phase systems in order to 
detect the heteroazeotrope point (a 3-phase equilib- 
rium). The first objective is again obtained from the 
phase equilibrium equations and differs from the one 
chosen for the 3 component/2 phase system (58.9) in the 
number of relevant phase equations. Due to transitivity, 
four error values remain here. Additionally, a quadratic 
form is chosen here instead of the absolute value form 
used in the previous case, under the assumption that 
the quadratic form simplifies the local optimization task 
(Quasi-Newton as well as evolutionary optimization 
methods usually perform better in this case). 


2 


Fie’ x2") = Y eo o 
i=1 


+ Œg; x pY]. (58.11) 


As for the previous system, it is necessary to determine 
a suitable second (distance) criterion for the multiobjec- 
tive first step. However, for three phases, the approach 


— PC-SAFT kj=0 
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Fig. 58.6 Phase diagram of water/MMA system at 1 
bar. The symbols are experimental data from [58.49] 
and [58.50]. Lines are calculation results of PC-SAFT with 
MOAMO-approach 


taken in [58.14] has to be generalized in a different 
way than done for three components. Interestingly, our 
preliminary test showed that it is sufficient to consider 
only one component and its three phases to create a dis- 
tance criterion. We may use mutual phase concentration 
differences of phases 1 and 2, 2 and 3, and 1 and 
3 to aggregate an objective function. (Note that Eu- 
clidean distances have been employed in the previous 
section, however our tests show that for the multiob- 
jective MOAMO step, the choice of the distance norm 
itself is not very important and Manhattan distances as 
used here are also sufficient.) 


2 
fae’. x” x”) =2-Y (x 
i=1 
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(58.12) 


Alternatively, the phase concentration differences can 
also be stated as three separate criteria, resulting in 
a four-objective problem for the SMS-EMOA 
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The following experiment will show whether the ag- 
gregated formulation or the separate criteria are more 
advisable. 

The binary system water/MMA in Fig. 58.6 exhibits 
a heteroazeotrope behavior at | bar. According to the 
phase rule, only one variable can be fixed to deter- 
mine the heteroazeotrope, as in this case the pressure. 
The temperature of the heteroazeotrope was found at 
81.93 °C and the concentrations of MMA in the three 
phases were xuma = 0.841826, xVima = 0.488033, and 
Xima = 0.002577. 

The identification of the heteroazeotrope point for 
water/furfural at 1 bar was more complicated than the 
previous system due to the fact that two sought wa- 
ter concentrations are close to each other (X ater = 
0.911822 and x’ ter = 0.973374), see Fig. 58.7. The 


water 
JI 


third water concentration was found at Xie. = 


0.507017 and the heteroazeotrope temperature was de- 
termined at 97.64°C. 


7°85 |3 Hed 


1168 


7°89 |3 Hed 


Part E 


Evolutionary Computation 


— PC-SAFT k= —0.006 


40 @ Exp. data VLE 
A Exp. data LLE 
20 
0 0.2 0.4 0.6 0.8 1 


Xwater 


Fig. 58.7 Phase diagram of water/furfural system at 1 bar. 
The symbols are experimental data from [58.51]. Lines are 
calculation results of PC-SAFT with MOAMO-approach 


Pre-Experimental Planning 

Taking over the SMS-EMOA parameters (population 
size and run length) from the previous experiment led 
to an unreliable behavior for the two systems tested 
here. Seemingly, they are more difficult to solve than 
the given three-component/two-phase system. There- 
fore, population size is doubled to 60 individuals and 
run length is accordingly slightly increased to 6000 
evaluations. 


Experimental Task 

As the last paragraph indicated that the problems in 
this section are even more difficult than the that of 
Sect. 58.4.1, there is no point in testing against random 
start local search again. Instead, it shall be determined if 
the aggregated (58.12) or the separate criteria approach 
(58.13) is more suitable for solving the problems with 
MOAMO. To enable a decision between the two, a sig- 
nificant difference in success rates is required. 


Setup 
For each of the two systems (water/MMA and wa- 
ter/furfural) and each of the problem formulations (ag- 
gregated/separate), 30 MOAMO runs are performed 
and the number of successes is recorded. A run is 
successful if at least one of the local search steps ob- 
tains the sought optimum the number of successful 
local searches is not recorded. As before, we employ 
the combination of SMS-EMOA and CMA-ES. The 


Table 58.5 Success rates for detecting the heteroazeotrope 
point via MOAMO approach under different formulation 
of the distance criterion 


System Distance criterion Success rate (%) 
Water/MMA Aggregated 100.0 

Separate 50.0 
Water/furfural Aggregated O83 

Separate 36.7 


resulting values for the first objective function (com- 
puted from the error values output by PC-SAFT) shall 
be below 10~!> in this case, requiring to modify the 
CMA-ES internal stopping criteria accordingly. Its ini- 
tial step size is set to 0.01. The SMS-EMOA parameters 
are set as in the previous experiment except population 
size and run length which are modified as documented 
above. 


Observations 
The number of successful runs is given in Table 58.5. 
The aggregated approach seems to consistently perform 
better than the one with separate criteria, and success 
rates hint to the fact that the second system poses more 
difficulty than the first one. 


Discussion 

Fortunately, the much simpler (aggregated) approach is 
also the more reliable for both systems. The much larger 
objective space in the first phase (four instead of two ob- 
jective functions) obviously outweights the benefits of 
a correct mapping by far. Furthermore, for higher num- 
bers of phases, the number of objectives would grow 
faster than linear, so that in conclusion, the aggregated 
approach is much more suitable than the one with sep- 
arate objective functions. 


58.4.3 Obtaining the Phase Diagrams 


Once the heteroazeotrope point is detected, a phase di- 
agram of the system may be obtained by systematic 
exploration of the two-phase equilibria at different tem- 
peratures. We simply increase or decrease the temper- 
ature (which is a free variable for two-phase systems) 
by 1°C and take the solution for the last tempera- 
ture step as initial point for a local search (executed 
by the CMA-ES) on every binodal curve. Figures 58.6 
and 58.7 have been generated by means of this method. 
(Note that this is different from the common approach 
of detecting several two-phase equilibria by means of 
a quasi-Newton method first and then to conclude on 
the heteroazeotrope point from these.) 
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58.5 Conclusions and Outlook 


In this chapter, a multistage method named MOAMO 
(multiobjectivization-assisted multimodal optimiza- 
tion) was presented. It is especially designed for 
difficult multimodal direct search problems as arising 
in phase equilibrium detection. However, the method 
is very well applicable whenever some problem 
knowledge is available concerning where the global 
optimum is not. The experimental analysis, performed 
on three different systems with either three compo- 
nents and two phases or two components and three 
phases, has shown that the approach is reliable and 
fast. It outperforms random multistart local search 
by a large margin under nearly all tested conditions. 
Two important properties of the approach need to be 
emphasized: 


@ Unlike many attempts to solve phase-equilibrium 
problems by means of evolutionary or related al- 
gorithms, MOAMO utilizes known features of the 
problem to direct the search and thereby avoids 
spending too much effort in repeatedly approaching 
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59. Modeling and Optimization 
of Machining Problems 


Dirk Biermann, Petra Kersting, Tobias Wagner, Andreas Zabel 


In this chapter, applications of computational 
intelligence methods in the field of production en- 
gineering are presented and discussed. Although 
a special focus is set to applications in machining, 
most of the approaches can be easily transferred to 
respective tasks in other fields of production engi- 
neering, e.g., forming and coating. The complete 
process chain of machining operations is consid- 
ered: The design of the machine, the tool, and the 
workpiece, the computation of the tool paths, the 
model selection and parameter optimization of the 
empirical or simulation-based surrogate model, 
the actual optimization of the process parameters, 
the monitoring of important properties during the 
process, as well as the posterior multicriteria de- 
cision analysis. For all these steps, computational 
intelligence techniques provide established tools. 
Evolutionary and genetic algorithms are commonly 
utilized for the internal optimization tasks. Model- 
ing problems can be solved using artificial neural 
networks. Fuzzy logic represents an intuitive way 
to formalize expert knowledge in automated de- 
cision systems. 


In production engineering and particularly in the field 
of machining, improvements in materials, coatings, 
tools, and machines continuously provide potentials for 
improving the processes. In order to exploit these poten- 
tials, however, optimal setups of the changing processes 
have to be found. Since modern production processes 
involve many complex subsystems, as well as preced- 
ing and subsequent steps, all these systems and steps 
have to be adapted for achieving the optimal result. 

In this chapter, it is shown that computational intel- 
ligence (CI) provides methods to assist in achieving this 
ambitious aim. A particular focus is on the applications 
of evolutionary computation (EC) in machining, but 
also artificial neural networks (NN) and fuzzy logic are 
considered. A comprehensive overview is presented by 
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considering several subsystems, as well as the preced- 
ing and subsequent steps in the operating sequence. In 
this aspect, the chapters contribute to common surveys 
in the literature [59.1—5], which are often only focused 
on the modeling and optimization of the actual process. 
In order to assist interested engineers in choosing 
a suitable method for their problem, the solutions of- 
fered by CI are structured according to the specific 
subproblems to be solved in a machining problem. To 
keep the big picture still apparent, these subproblems 
are integrated into the complete operating sequence in 
the following section. They are then discussed accord- 
ing to their chronological order in the sequence. The 
chapter is concluded with summarizing remarks on CI 
applications in the field of production engineering. 
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59.1 Elements of a Machining Process 


An overview of the elements and steps to be consid- 
ered when optimizing a machining process is shown 
in Fig. 59.1. In the focus of the considerations is the 
actual process. The results of this process, however, 
significantly depend on its elements, in particular on 
the mechanical properties and the dynamic character- 
istics of the machine, geometry, and the properties of 
the tools, as well as the layout of the workpiece which 
determines the required machining operations. All these 
elements can be individually optimized to improve the 
results of the process. For the latter, often complex 
numerical control (NC) paths for the machines have 
to be generated using computer-assisted manufactur- 
ing (CAM) software. To accomplish this, a model of 
the final workpiece geometry is required. If no such 
model is available, e.g., after manual modifications of 
a prototype, CI-based methods can assist in computing 
an optimized workpiece model for the CAM software. 
However, even if a model is available, the NC paths 
computed by the CAM software can be far from opti- 
mal due to the complexity of the process, e.g., in five- 
axis milling operations. In this case, the subsequent op- 
timization of the position-dependent parameters of the 
NC code, such as the inclination angles œ and f, and 
the feed rate f [59.6], can significantly increase the ef- 
ficiency of the process. 

When all the components of the actual process are 
selected and fixed therewith, the optimization of the 
adjustable process parameters can begin. Thereby, CI- 
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Fig. 59.1 Overview of the elements and steps of an arbitrary ma- 
chining process 


based techniques are usually based on a self-organiz- 
ing process. In order to let the self-organization take 
effect, a high number of experiments is required. Since 
a real-world experiment involves high costs, it can be- 
come necessary to use a surrogate model on which 
the method is applied. In this case, however, additional 
problems have to be solved. It has to be selected which 
kind of model (empirical, analytical, physical, numeri- 
cal) is applied and which type or realization of this kind 
of model is implemented, e.g., an empirical model can 
be computed using artificial neural networks, Gaussian 
processes, or regression techniques. As soon as a model 
is chosen, the parameters of this model (internal coeffi- 
cients, material constants, etc.) have to be adapted with 
respect to the given application. This often represents an 
additional nonlinear optimization problem which can 
be solved using techniques of EC. 

Moreover, the robustness of the process can be 
increased by a monitoring-based process control. To 
accomplish this, dynamic characteristics of the pro- 
cess, such as acoustic emission signals and force 
measurements, are analyzed online and control op- 
erations are initiated as soon as these characteristics 
show suspicious patterns. In this kind of applica- 
tion, however, it is necessary to automatically detect 
what indeed is a suspicious pattern. Fuzzy logic and 
NNs have proven to be capable of performing these 
tasks. 

A lot of information can be obtained in order to 
analyze the process and its results. This information 
can either be achieved by measurements during and 
after the process or by performing simulation stud- 
ies. They usually build the basis for the calculation 
of the actual objectives. In this context, machining 
processes have to be optimized with respect to sev- 
eral conflicting aims, e.g., a simultaneous minimiza- 
tion of tool wear and maximization of the material 
removal rate. Even if multiobjective optimization tech- 
niques are used, a lot of details can be lost in this 
formulization step. Often the first version of the objec- 
tives does not result in the desired results. Additional 
objectives have to be defined or preferences have to 
be integrated. In order to allow a deeper understand- 
ing of the process to be obtained and a refinement 
of the objectives to be made, an intuitive visualiza- 
tion and exploration of the detail information is re- 
quired. For this task, again Cl-based techniques can be 
used. 
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59.2 Design Optimization 


The optimal design of a machine, tool, or workpiece 
is a great challenge in the field of production engi- 
neering. The optimization task is often conducted as 
an iterative manual process which is based on expert 
knowledge and which can be very cost and time con- 
suming. Roy et al. [59.7] gave an extensive overview of 
the recent advances in automated and interactive design 
optimization. They presented a classification of the op- 
timization problems and discussed the most important 
optimization approaches and techniques. In the follow- 
ing subsections, examples of successful applications of 
CI for the optimization of machine, tool, and workpiece 
designs are provided. 


59.2.1 Optimal Design of Machines 


Designing machines necessitates the consideration of 
multiple objectives, such as geometric accuracy and 
costs. Liu and Liang [59.8], for instance, presented an 
approach combining a modified Chebyshev program- 
ming method for the scalarization of these objectives 
and a particle swarm optimization algorithm for evolv- 
ing the machine designs. They were dealing with re- 
configurable machine tools, so not only the process 
accuracy and investment costs of the machine layouts, 
but also the configurability was considered. Signifi- 
cant changes in the shape of the product could thus be 
easily adapted. Mekid and Khalid [59.9] discussed an 
optimization method based on a multiobjective genetic 
algorithm for the design of three-axis micromilling 
machines. They took user requirements (for example 
the workspace volume), axis positions, and geomet- 
ric errors of the machine into account. For the latter, 
they used a mathematical error model of the three-axis 
milling machines. 


59.2.2 Tool Optimization 


Designing machining tools is a very difficult optimiza- 
tion task since not only complex geometries, but also 
different machining criteria have to be taken into ac- 
count [59.10]. Abele and Fujara, for example, presented 
a simulation approach for optimizing the drill geometry 
based on a genetic algorithm [59.11]. They consid- 
ered not only the structural stiffness of the tool during 
their optimization run, but also took the coolant flow 
resistance and the chip evacuation capability into ac- 


count. They also defined the machinability, especially 
the grindability of the chip flute, as constraint. In order 
to take all these criteria into account, different simu- 
lation approaches have to be used (Sect. 59.4). Abele 
and Fujara used, for example, the finite element method 
in order to analyze the structural stiffness. The cutting 
forces were computed using a semiempirical cutting 
force model. Additionally, a model of the grinding 
wheel had to be determined in order to evaluate the 
grindability of the optimized drill geometry. Another 
application was presented by Jared et al. [59.12] who 
integrated GA into the computer-aided design software 
CATIA. In one of their case studies, the volume and the 
tip deflection of a cutting tool were minimized by au- 
tomatically parameterizing length and angles between 
segments of a 2-D (two-dimensional) profile which 
were then extruded to the actual tool. 


59.2.3 Workpiece Layout Optimization 


The layout of products can usually be described as 
multiobjective optimization problem. For example, the 
design of aerospace structures always faces a trade- 
off between the stiffness and the weight of the prod- 
ucts [59.13]. The layout of a cooling system, e.g., for 
a turbine blade [59.13] is a tradeoff between the ma- 
chining quality, the cooling effect, and the production 
costs. Weinert et al. [59.14—17] developed a simulation 
system for optimizing the layout of mold temperature 
control systems in order to minimize the production cy- 
cle times and costs, and to maximize the product quality. 
They developed an efficient simulation system in order 
to evaluate the effect and homogeneity of the tempering 
of the design layout and to estimate the manufacturing 
costs [59.18]. Using fast but sufficiently accurate eval- 
uation methods, a computer-aided optimization of the 
temperature control system based on multiobjective op- 
timization methods, like NSGA-II [59.19] and SMS- 
EMOA [59.20], became possible [59.2 1—24]. Neverthe- 
less, this optimization task is very complex and the en- 
gineer’s experience is still necessary. Due to this, Bier- 
mann et al. [59.25] combined the computer-aided opti- 
mization system with the possibility of user interaction 
so that a visual real-time manipulation of target func- 
tions is possible. Diirr and Jurklies [59.26] presented 
a fuzzy expert system in order to use the expert knowl- 
edge in a computer-assisted way. 
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59.3 Computer-Aided Design and Manufacturing 


In the modern construction process, computer-aided de- 
sign (CAD) software is used for all design tasks — for 
example for the model of the workpiece. This model 
is the basis for the generation of the NC paths by 
CAM software. However, if only a physical prototype 
exists or manual modifications of the original model 
have been performed, methods to compute a respec- 
tive model are required. To accomplish this, the original 
object is scanned and a point-based representation is ob- 
tained. From this point data, a new CAD model has to 
be calculated or the original model has to be adapted. 
This process is called surface reconstruction or reverse 
engineering. 

When a model of the workpiece is available, 
NC paths can be generated based on CAM software 
for most machining processes. For complex five-axis 
milling processes, however, the results of standard 
CAM software are not always optimal with respect to 
the requirements of the specific machine and process. In 
this case, CI-based techniques can be used to improve 
the NC paths generated by the CAM software. 


59.3.1 Surface Reconstruction 


The optimization of the visual quality of triangulations 
with respect to different quality criteria was success- 
fully performed using evolutionary algorithms by Wein- 
ert etal. [59.27]. Based on an initial triangulation, as 
provided by the software of the scanning system, edges 
were flipped in order to minimize the total length of all 
edges, the surface area, the sum of angles between nor- 
mals, and the total absolute curvature. It was found that 
the latter is best suited for generating visually smooth 
surfaces. 

Small tolerances in the representation of the origi- 
nal object, however, result in a huge number of required 
scan points. Current scanners are able to provide this 
dense and precise set of scan points, but the result- 
ing triangulations become very large and difficult to 
handle. Approximating triangulations tackle this prob- 
lem. The number of control points for the triangles 
is independent of the size of the point set and usu- 
ally considerably smaller than the number of scan 
points. Weinert et al. [59.28] documented the capabil- 
ities of a standard evolution strategy to optimize the 
control point positions of approximating triangulations. 
In order to avoid an uncontrolled expansion of the tri- 
angulation, balancing strategies based on mass-spring 
systems were integrated. 


Unfortunately, even approximating triangulations 
produce a nonsmooth surface and are therefore not con- 
venient for the later computation of NC paths. Nonuni- 
form rational B-splines (NURBS) [59.29] are another 
popular mathematical model for free-form surfaces 
in CAD software. The most important advantages of 
NURBS over triangulations are their smoothness, their 
compact definition, the possibility for an intuitive local 
manipulation, as well as the ability to combine NURBS 
patches to larger structures. Mehnen et al. [59.30, 31] 
applied an evolution strategy to the coordinates of the 
NURBS’s control points in order to minimize the dis- 
tance between the scan points and their projection to 
the NURBS. Wagner etal. [59.32] did the same us- 
ing a real-valued genetic algorithm. They also proposed 
another distance indicator that is based on a sampling 
of the NURBS and that is much cheaper to evaluate. 
The use of the sampling-based distance measure in 
combination with a equation-solver-based hybrid real- 
valued genetic algorithm significantly reduced the run- 
time of the optimization. This approach was further 
enhanced [59.16] to a two-step approach, in which the 
single-objectively optimized solution is used as initial 
individual for a multiobjective optimization. As addi- 
tional objective, the smoothness of the NURBS was 
considered. This objective was also considered by Jared 
et al. [59.12] in their GA-based optimization of NURBS 
in CATIA. 

In addition, Weinert etal. [59.33] combined 
NURBS with constructive solid geometries [59.34] 
in a hybrid evolutionary algorithm/genetic program- 
ming approach. By these means, the construc- 
tional logic behind the workpiece could also be 
approximated. 


59.3.2 Optimization of NC Paths 


The five-axis milling process offers the possibilities 
to tilt the milling tool and, thus, to use shorter and 
therewith stiffer tools. This allows complex free-form 
surfaces to be machined in one workpiece clamping, 
and the engagement conditions to be adapted [59.35]. 
An improvement of the machining results and a reduc- 
tion of the machining time can be achieved. However, in 
contrast to the three-axis process, the generation of the 
NC paths particularly for the machining of free-form 
surfaces is much more complex [59.6]. 

Weinert and Stautner [59.36] presented an algo- 
rithm for converting three- into five-axis milling paths 
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in which the position of the tool tip is kept from 
the three-axis NC program. An optimization approach 
based on an evolutionary strategy was used to improve 
the tool movement [59.37]. To accomplish this, they de- 
veloped a fast simulation system of the five-axis milling 
process based on a discrete dexel model of the work- 
piece (Sect. 59.4) [59.38]. 

The NC paths generated for a five-axis milling pro- 
cess are often not smooth enough since the kinematic 
behavior of the specific milling machine is not taken 
into account. Zabel et al. developed a simulation ap- 
proach which is placed in the process chain between 
the CAM system and the real-milling process [59.39]. 
The five-axis tool movement is optimized taking the 
tool axis configuration and the dynamic behavior of 
the milling machine into account. For this purpose, 
methods of evolutionary computation and wavelet the- 


ory were combined [59.35]. In 2007, Mehnen et al. 
integrated a multiobjective optimization algorithm into 
this simulation system which combined the variation 
of a modern single-objective approach with the se- 
lection mechanism of a classical multiobjective opti- 
mization algorithm in order to optimize the tool move- 
ment [59.40]. 

One challenging task during the optimization of 
the five-axis milling process is the avoidance of col- 
lisions between the milling tool and the workpiece. 
Kersting and Zabel [59.6] developed an efficient sim- 
ulation approach, which maps the high-dimensional 
restriction area on a two-dimensional matrix struc- 
ture. They showed that the use of a multipopula- 
tion multiobjective evolutionary algorithm in the re- 
striction-free area improved the corresponding Pareto 
fronts [59.41]. 


59.4 Modeling and Simulation of the Machining Process 


The optimization of real-world applications using CI- 
based or classical optimization approaches requires that 
a performance value or vector can be obtained for all 
possible settings of the input parameters, whereby the 
performance values are usually calculated based on 
measurements during or after the actual process. In or- 
der to achieve a near-optimal result, however, far more 
than 100 different parameter vectors have to be evalu- 
ated — even for low-dimensional problems. This amount 
of real experiments is often impossible due to the costs 
related to them. As a possible solution, the use of em- 
pirical or physical (simulation) models can significantly 
reduce the number of required experiments since most 
of the evaluations can be performed on the model. For 
both kinds of approaches, CI techniques have already 
been successfully used. Some examples are presented 
in the following subsections. 


59.4.1 Empirical Modeling 


For the use of empirical models, real or simulated ex- 
periments are still required in order to build up a data 
base for the training of the model. In contrast to the 
direct optimization of the process, however, these ex- 
periments are performed as a block of moderate size in 
the beginning of the optimization. Afterward, the model 
allows new parameter settings to be predicted based on 
the information obtained from training data. The deter- 
mination of near-optimal solutions can be performed on 
the model. 


The number of empirical models is exhaus- 
tive [59.42]. Nevertheless, NNs often showed their 
capability to empirically model responses from ma- 
chining processes. For instance, the material removal 
rate of an abrasive jet drilling process was successfully 
predicted by using an NN with back error propaga- 
tion [59.43]. As input parameters, varying gas pres- 
sure, nozzle inside diameter, abrasive flow rate, size of 
the medium particle, and standoff distance were con- 
sidered. Accordingly, the ablating depth obtained for 
specific values of the peak power, pulsing frequency, 
and overlapping in a laser drilling process could be 
predicted using NN [59.44]. Casalino etal. [59.45] 
showed that NN achieve higher prediction accuracies 
than regression techniques in predicting surface rough- 
ness and resultant forces for varying cutting speed, 
feed rate, and radial depth in milling. In the same line, 
NN were used for the prediction of the specific cut- 
ting constants resulting from different cutting speeds, 
feeds, inclination angles a and £, cutting depths, and 
cutting widths [59.46]. With respect to tool wear, the 
wheel life of a cylindrical grinding wheel was modeled 
using a feedforward backpropagation NN. A direct pre- 
diction of the tool wear was also accomplished using 
NN [59.47, 48]. Moreover, the thermal expansion of the 
Y-axis ball screw was predicted based on temperature 
measurements at different points of the machine struc- 
ture [59.49]. 

In addition, CI-based techniques can also indirectly 
be used for empirical modeling. As soon as complex 
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empirical models, such as Gaussian processes, sup- 
port vector or other kernel machines, are used, the 
determination of the optimal model parameters is an in- 
dividual nonlinear optimization problem. Evolutionary 
algorithms, in particular the covariance matrix adap- 
tion evolution strategy (CMA-ES) [59.50], showed to 
be suitable for solving these problems [59.51, 52]. 


59.4.2 Physical Modeling for Simulation 


In cases where sufficient knowledge about the physi- 
cal laws of the process is available, simulation models 
based on equations representing these physical laws 
are likely to be superior to the very general formula- 
tions of the empirical models. Nevertheless, also these 
models have parameters that are related to the prop- 
erties of the material, tool, and machine. Since these 
parameters can often not be measured, their values 
are usually set by minimizing the error between the 
predictions of the simulation and a training set of obser- 
vations from real-world experiments. As consequence, 
EC is a valuable tool for calibrating simulation models 
which was shown to be superior to classical data fitting 
tools [59.53]. 

In an exemplary application, the dynamic behav- 
ior of manufacturing systems was characterized by its 
frequency response function. This function can be mod- 
eled by a superposition of decoupled damped harmonic 


oscillators, whereby each oscillator has three parame- 
ters (mass, natural frequency, and damping) [59.54]. In 
order to minimize the deviation between the measured 
frequency response function and one of the oscillators, 
an interactive approach based on evolutionary algo- 
rithms was successfully implemented [59.54]. 

An open issue in the simulation of machining pro- 
cesses is the modeling of the extremely high strain 
rates which can only rarely be covered by classical 
material models and tensile tests. As a possible solu- 
tion, EC can be used as a submodule of a simulation 
in order to predict the deformation and flow charac- 
teristics for high strain rates. For instance, Weinert 
etal. used symbolic regression by means of a genetic 
programming system to evolve mathematical formu- 
lae that describe the trajectories of single particles of 
steel based on recordings of a high-speed camera dur- 
ing the turning process [59.55, 56]. Teti etal. [59.57] 
employed NN to reconstruct the stress-strain curve of 
the workpiece material from experimental data of ten- 
sile tests. They found out that the learned NN is capable 
of predicting workpiece material properties in a wide 
range of temperature and strain rate values. A hybrid 
simulation model based on physical equations and the 
empirical stress-strain prediction was finally proposed. 
Two recent overviews of hybrid models for simulation 
which also incorporate CI techniques were provided by 
Jawahir et al. [59.58, 59]. 


59.5 Optimization of the Process Parameters 


In this section, possible applications of EC methods for 
the optimization of the actual process parameters are 
discussed. Since a recent survey book for the model- 
based optimization of process parameters exist [59.1], 
only a short summary of possible applications is pro- 
vided. In contrast to this survey, the following presen- 
tation does not distinguish between different processes, 
as the aspects related to the use of EC are independent 
of the actual process, e.g., milling, turning, or grinding. 

As already discussed in the previous section, it 
is mandatory to approximate the process quality indi- 
cators by means of analytical, empirical, or physical 
models. In the literature, no direct application of EC 
optimization techniques to machining processes was 
reported until now. Instead, polynomial or process- 
related equations were usually fitted to experimental 
data [59.60-78]. Neural networks [59.63, 79—83], other 
empirical models [59.51, 62, 84], and simulation mod- 


els [59.85,86] were also popular to accomplish this 
task. 

For the actual optimization, two important deci- 
sions on the formulation of the problem have to be 
taken in order to choose the EC method. These deci- 
sions are concerned with the representation of the input 
parameters and the objectives. In most cases, continu- 
ously defined input parameters, such as feed and cutting 
speed, are to be optimized. This relates to techniques 
such as evolution strategies, particle swarm optimiza- 
tion, and real-valued genetic algorithms (GAs). If also 
discrete parameters, such as the cooling concept or 
tool material, are considered, special evolution strate- 
gies [59.65,87] or binary GAs may better be suited. 
With respect to the objectives, it has to be decided 
whether a single optimal solution or a set of tradeoffs 
is desired. In the former case, almost all EC tech- 
niques can directly be used. Due to the complexity of 
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production engineering problems, however, a suitable 
scalarization of the different objectives has to be found 
in order to achieve reasonable results. In the latter case 
of searching for an approximation of the trade-off struc- 
ture, it is important that the algorithm is capable of 
coping with multiple objectives which have to be con- 
sidered in parallel [59.51, 72, 74, 78, 79, 84, 86]. 

In the literature, the use of continuous input vari- 
ables and single-objective formulations is established. 
The most popular EC methods are particle swarm op- 
timization (PSO) [59.63, 68, 75, 76, 81-83, 85, 88] and 
standard GA or evolutionary algorithm (EA) [59.60, 
62, 64, 67, 69, 77, 80]. The use of specifically designed 
heuristics [59.71, 75, 89] is rather uncommon. Never- 
theless, the formulation of the problem and the design 
of the algorithm should aim at incorporating as much 
knowledge as possible into the optimization [59.16]. 

Unfortunately, the generality of ClI-based_ tech- 
niques often results in problem formulations which are 
not completely sophisticated. An important factor of- 


59.6 Process Monitoring 


The analysis of different process variables — like for 
example the cutting forces, acoustic emission, or tem- 
peratures — allows conclusions about the process- 
dependent state of the machining processes and its 
components (tools, machines, workpieces, etc.) to be 
drawn and provides the possibility for an adaptive 
process control [59.93]. The idea of process monitor- 
ing is to measure, visualize, and analyze the values 
of these variables during the machining process. Teti 
et al. [59.93] gave an extensive overview of advanced 
monitoring of machining operations describing sensor 


59.7 Visualization 


In the field of production engineering, the complex op- 
timization problems are often characterized by multiple 
objectives and restrictions. Additionally, the decision 
space can be high dimensional — like for example in the 
case of optimizing NC paths (Sect. 59.3.2) [59.6]. In 
order to analyze the optimization problems and the ap- 
plied optimization approach, an intuitive visualization 
of the data resulting from the evolutionary process is 
advisable [59.94]. For this purpose, Pohlheim [59.95] 
reviewed several visualization techniques in order to 


ten neglected when optimizing production engineering 
problems is the uncertainty about the external pro- 
cess variables, e.g., properties of the tool or material. 
Although modern algorithms are capable of incorpo- 
rating them into the optimization [59.90], only a few 
applications actually take these uncertainties into ac- 
count [59.70]. More specifically, two sources of un- 
certainty can be considered [59.91]: perturbations in 
the input variables, e.g., due to online control, and 
environmental uncertainties, such as outdoor temper- 
ature, humidity, and the already mentioned external 
process variables. A detailed overview of such factors 
can be found in the literature [59.92]. A compre- 
hensive survey of possible problem formulations and 
respective optimization approaches was presented by 
Beyer and Sendhoff [59.91]. In production-engineer- 
ing applications, however, classical statistical methods 
are usually used to cope with these problems. The 
potential of ClI-based techniques has not yet been 
exploited. 


systems for machining, signal processing, monitor- 
ing scopes, and the decision-making support systems. 
In order to evaluate the measured values, cognitive 
computing methods — for example genetic algorithms, 
fuzzy logic, or NNs — can be used. In contrast to 
the rule-based fuzzy logic approach, NNs do not store 
the knowledge in an explicit form. A survey of the 
successful applications of these techniques for the ad- 
vanced monitoring of machining operations was pro- 
vided by Teti et al. [59.93]. It is thus omitted in this 
section. 


obtain a better understanding of the optimization pro- 
cess of real-world problems. He recommended the 
use of three diagrams in order to analyze the opti- 
mization algorithm: A convergence diagram, visualiza- 
tion of the change of the best individual during the 
optimization approach, and a diagram of the objec- 
tive values of all individuals in the population of all 
generations. 

Miiller et al. discussed techniques for an intuitive 
visualization and interactive analysis of Pareto sets ap- 
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plied on production engineering systems [59.94]. They 
analyzed different visualization and analysis methods 
in order to gain insight into both the optimization prob- 
lem and the optimization algorithm, and to support 
an intuitive decision-making process. For this purpose, 
they presented a simultaneous visualization of the deci- 
sion and the objective space. An interactive navigation 
through the solution sets supports the user to detect spe- 
cific process characteristics [59.94]. This also helps to 
redesign the objective formulation in cases where the 
optimization results are not in agreement with the ac- 
tual preferences of the decision maker. 

In order to support the trade-off analysis in multi- 
ple dimensions, Obayashi and Sasaki [59.96] presented 
a visualization approach based on self-organizing maps 
(SOMs). The idea is to map from the high-dimen- 


59.8 Summary and Outlook 


This chapter focused on applications of CI in the op- 
timization of machining problems. For this purpose, 
the whole process chain — from the design of a ma- 
chine, tool, or workpiece, as well as the corresponding 
optimization of process parameters, to the process mon- 
itoring and subsequent analysis of the results — was 
taken into account. Different modeling and simula- 
tion techniques, which are necessary to optimize real- 
world problems, were discussed. Successful examples 
in the field of production engineering were compiled to 
present the applicability of the CI methods. In conclu- 
sion, evolutionary and genetic algorithms are general 
and powerful solvers for nonlinear optimization tasks, 
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Modern air vehicle design has been increasingly driven 
by environmental as well as operational constraints. En- 
vironmental concerns, including emissions and noise, 
are gaining increasing importance in the design and 
operations of commercial aircraft. Taking into account 
the current prognoses for the growth in air traffic, the 
above-mentioned challenges become even more signif- 
icant [60.1—4]. In this context, the development and 
assessment of new theoretical methodologies represents 
a cornerstone for reducing the experimental load, ex- 
ploring trade-offs, and proposing alternatives along the 
design path. The fidelity of such methods is essential 


to reproduce real-life phenomena with a significant de- 
gree of accuracy and to take them into account from 
the very beginning of the design process. Due to the in- 
trinsic complexity of aircraft design, the design space 
is often huge and difficult to explore fully, so that fast 
semi-empirical tools and rules [60.5—7], derived from 
classical configuration data, have been traditionally ap- 
plied. However, they exhibit a severe lack of accuracy 
when designing novel and unconventional concepts. 
Therefore, highly accurate analysis methods have been 
continuously introduced both in geometric represen- 
tation and physical modeling, but the main drawback 
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is that they are computationally expensive. For exam- 
ple, the solution of the Navier-Stokes equations around 
complex aerodynamic configurations requires a huge 
amount of computational resources even on modern 
state-of-art computing platforms. This turns out to be 
an even bigger issue when hundreds or thousands of 
analysis evaluations, like in parametric or optimization 
studies, have to be performed. In order to speed up the 
computation while keeping a high level of fidelity, the 
scientific community is increasingly focusing on sur- 
rogate methodologies like meta-models, multi-fidelity 
models, or reduced-order models. These can provide 
a compact, accurate, and computationally efficient rep- 
resentation of aircraft design performance. 

The present chapter will give details and references 
about the adoption of surrogate models and, in par- 


ticular, reduced-order models within the aerodynamic 
shape optimization context. In Sect. 60.1, the aerody- 
namic design problem and its approximated version 
will be introduced. Then, an overview of various surro- 
gate models and surrogate-based optimization methods 
will be given. In Sects. 60.3 and 60.4, the concept of 
model order reduction will be recalled and the per- 
formance analysis of reduced-order models based on 
proper orthogonal decomposition (POD) will be dis- 
cussed. In the Sect. 60.5, some techniques to adaptively 
and globally improve the accuracy of POD-based sur- 
rogates will be presented. Finally, Sect. 60.6 will be 
devoted to the analysis and comparison of the perfor- 
mances of various surrogate-based optimization meth- 
ods with respect to the aerodynamic shape design of 
a transonic airfoil. 


60.1 The Aerodynamic Design Problem 


A broad class of aircraft design applications can be 
numerically modeled with the minimization of a func- 
tion f which depends on two sets of variables: the 
design variables w, which the designer can directly con- 
trol it, and the state variables x, which provide the 
evolution of the system representing the underlying 
physics. The design problem can be formulated as the 
non-linear programming problem 


min fw, x) 
subject to r(w,x) = 0, h(w,x) =0, 
g(w,x) <0, 


wL wswy. 
(60.1) 


f is the objective function which the designer wants 
to minimize to improve performance. In aircraft de- 
sign, typical objective functions are weight, noise, drag, 
aerodynamic efficiency, or a combination of thereof. 
r(w,x) is the state equations set, which links the de- 
sign variables and the state variables and it usually 
represents the governing laws of physics. In aerody- 
namic design, the state equations are modeled through 
computational fluid dynamics, e.g., the Navier-Stokes 
equations, which relate scalar or vector field (state) vari- 
ables, like pressure or velocity, to the design variable 
vector. In a shape optimization problem, the design vec- 
tor is made dependent on the aircraft component shape 


by means of a parameterization approach. The vec- 
tors g(w, x) and h(w, x) are filled, respectively, with in- 
equality and equality constraint functions, which must 
be satisfied for a design candidate to be considered fea- 
sible. Typical constraint functions in aircraft design are 
related to the generation of a minimum lift level to 
balance the weight or a threshold pitching moment co- 
efficient to allow for trim. wz and wy are the lower 
and upper bounds of the design variables and thus 
specify the range of allowable values for the design vec- 
tor w. 


60.1.1 Problem Approximation 


The computational time required to solve this prob- 
lem is basically affected by two parameters: the num- 
ber of function evaluations required to minimize the 
objective function and the cost of a single evalua- 
tion. Given a vector w*, the latter is dominated by 
the computational effort needed to solve the state 
equations 


r(w*,x) =0. 


The adoption of a surrogate model reduces the cost 
per objective function evaluation. A surrogate model 
consists in replacing the expensive objective f and 
constraint functions g,h with less expensive, lower- 
fidelity models f and 8,h. Concerning reduced-order 
modeling, it can be observed that the dimensionality 
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of the optimization problem is twofold: the state vec- 
tor and design vector dimension. As the first one is 
usually much bigger than the second one, model re- 
duction can be applied to make explicit the dependency 
of x on w and solve the state variables as functions of 
the design ones. In other words, unlike response sur- 
faces and meta-models, an approximation ĉ of the state 
variables is available thanks to the model order reduc- 
tion. As a consequence, the reduced-order approximate 
form of the optimization problem (60.1) can be cast 


as 


min fw) 
w 
subject to h(w) = 0, èw) <0, (60.2) 
WLSwswu, 
where the dependence on the state variables has been 


dropped and the state vector has an explicit approximate 
relation with the design vector: x = k(w). 


60.2 Literature Review of Surrogate-Based Optimization 


This section proposes a survey of the most relevant 
surrogate-based optimization concepts. The topics have 
been widely discussed in the recent past, thanks to 
their innovative character and broad application areas. 
The introduction of surrogate models as fitness approx- 
imation within an evolutionary optimization system 
mitigates the demand for large computational resources 
associated to such search algorithms, allowing us to 
find a proper balance between the complete exploration 
of huge design spaces and limited cost. To this aim, 
reduced-order modeling through POD is a step for- 
ward, as a modal decomposition of an ensemble of 
functions, derived from numerical simulations, is per- 
formed to extract the most relevant patterns in the data 
set. Hence, compared to standard, interpolating meta- 
models, which are usually trained on an integral func- 
tion representing the objective, reduced-order models 
should assure a deeper insight into the phenomena 
modeled. 

Surrogate-based optimization (SBO) has been in- 
troduced to tackle the number of function evaluations 
in many engineering optimization problems. In aircraft 
design common practice, they can be used as a quick 
evaluator in several tasks: parametric analyses over the 
design space, optimization and control, and uncertainty 
quantification. A special challenge is represented by 
their use in global optimization as state-of-the-art meth- 
ods, which often requires more function evaluations 
than can be comfortably affordable. A well-established 
approach consists in fitting some kind of response func- 
tions to basic data obtained by evaluating the objectives 
and constraints at a few points. The resulting surfaces, 
affordable at low cost, can provide fast answers in terms 
of trade-off analysis and optimization, as well as just an 
intuitive sketch behavior by means of simple visualiza- 
tion. The basic process consists of the following steps: 


sampling the design space — once the design variables 
have been chosen, a sampling plan is defined and some 
initial sample designs are analyzed with an accurate 
solver; surrogate model selection and construction — 
a surrogate model type is selected and used to build 
a meta-model of the underlying problem; model valida- 
tion — the model is checked according to some statistical 
metrics and, if not accurate enough, a search is carried 
out using the model to identify new design points for 
analysis; model updating — the new results are added to 
those already available and a new meta-model is built 
(repeating the last three steps); optimization — the re- 
fined surrogate is used to provide objective/constraint 
functions. 

As SBO covers so many topics, the literature on 
the subject is huge. Many ideas have been proposed in 
the last 20 years, which are classified for design space 
dimensions, surrogate methods, search algorithms, up- 
dating algorithms, application areas. Hence, an exhaus- 
tive survey of all the possible ideas for each topic 
and all their possible combinations would go beyond 
the scope of the present research. Generally speaking, 
surrogate models can be roughly divided into three 
classes: data fit surrogates, multi-fidelity models, and 
reduced-order models. Data fitting models rely on the 
approximation of data (response values, gradients, and 
Hessians) generated from the high-fidelity model. In 
order to give a global behavior to surrogate methods, 
the whole design space must be sampled in advance 
by using design of experiments. Global approxima- 
tions, often referred to as response surface methods, 
can be obtained with polynomial regression [60.8], 
Gaussian processes, Kriging interpolation [60.9], ra- 
dial basis function networks [60.10], multi-adaptive 
regression splines [60.11], and artificial neural net- 
works [60.12]. 
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A second class of surrogates is the hierarchical one 
(also called multi-fidelity or variable fidelity). Unlike 
the data fit surrogates, they do not need to be trained 
on a sampling dataset, but they rely on a lower fi- 
delity approximation which, however, is still inspired 
by the physical behavior of the system. Multi-fidelity 
models are classified according to the way they oper- 
ate the fidelity reduction: examples in aerodynamics are 
coarser mesh discretization, partially converged solu- 
tion [60.13], and model fidelity reduction [60.14-16] 
(e.g., using the Euler model instead of the Navier— 
Stokes equations by neglecting the effects of fluid 
viscosity and heat transfer). The name multi-fidelity 
usually refers to the capability of mixing and exploiting 
both high-fidelity and lower-fidelity models in an effi- 
cient way so as to keep the fidelity of the former only 
when it is needed and to take advantage of the higher 
speed of the latter otherwise. 

A third class is represented by reduced-order mod- 
els. A reduced-order model (ROM) is mathematically 
derived from a high-fidelity model using a projec- 
tion technique. It consists in computing a set of basis 
functions (e.g., eigenmodes, left singular vectors) upon 
which the available dataset (ensemble) is projected to 
compute the unknown model parameters. The model re- 
duction is obtained by capturing the principal dynamics 
of the system and neglecting the less significant from 
a physical point of view. Hence, similarly to data fit 
surrogates, reduced-order models require the a-priori 
solution of the expensive high-fidelity model. The ad- 
vantage of reduced-order models with respect to data 
fits is that the most significant features of the flow field 
can be derived by approximation, thus offering the po- 
tential to keep more physics within the surrogate. The 
proper orthogonal decomposition or principal compo- 
nent analysis (PCA) is an elegant and powerful data- 
reduction method for non-linear physical systems. Its 
application as a surrogate to the aerodynamic optimiza- 
tion of aircraft components is the core of the present 
chapter. 

Hereinafter, a more in depth look is given at the var- 
ious methods of constructing a surrogate model and, 
in particular, at optimization assisted with the surro- 
gate. Jones et al. [60.17] was among the first to propose 
a response surface methodology based on modeling 
the objective and constraint functions with stochastic 
processes (Kriging). The so-called design and analy- 
sis of computer experiments (DACE) stochastic process 
model was built as a sum of regression terms and 
normally distributed error terms. The main concep- 
tual assumption was that the lack of fit due only to 


the regression terms can be considered as entirely due 
to modeling error, not measurement error or noise, 
because the training data are derived from a determin- 
istic simulation. Hence, by assuming that the errors 
at different points in the design space are not inde- 
pendent and the correlation between them is related 
to the distance between the computed points, the au- 
thors came up with an interpolating surrogate model 
that is able to provide not only the prediction of ob- 
jectives/constraints at a desired sample point, but also 
an estimation of the approximation error. After the 
construction of such a surrogate model, the latter pow- 
erful property is exploited to perform an efficient global 
optimization (EGO), which can be considered as the 
progenitor of a long and still in development chain of 
SBO methods. Indeed, they found a proper balance be- 
tween the need to exploit the approximation surface (by 
sampling where it is minimized) with the need to im- 
prove the approximation (by sampling where prediction 
error may be high). This was done by introducing the 
expected improvement (EI) concept, already proposed 
by Schonlau et al. [60.18], which is an auxiliary func- 
tion to be maximized instead of the original objective. 
The EI function is designed in order to provide a proper 
balance between exploration and exploitation. 

In a further work, Jones [60.19] proposed a tax- 
onomy of global SBO methods. Seven methods were 
identified and classified according to whether they were 
interpolating (cubic splines, thin-plate splines, mul- 
tiquadrics, Kriging) or not (quadratic polynomials), 
whether they provided statistical information (Kriging) 
or not (splines), and whether the method for select- 
ing search points (updating the model by adding new 
sample points) was two stage (probability/expected 
improvement) or one stage (goal-seeking, credibility 
function). 

Gutmann [60.10] reported excellent numerical re- 
sults for a spline-based implementation of Method 7 
and proved the convergence of the method. Compared 
to previous methods, Method 7 required a high number 
of true function evaluations to find the global opti- 
mum, but, as Jones wrote, this is the price we pay for 
the additional robustness. An overview of SBO tech- 
niques was also presented by Queipo et al. [60.20] and 
Simpson et al. [60.21]. They covered some of the most 
popular methods in design space sampling, surrogate 
model construction, model selection and validation, 
sensitivity analysis, and surrogate-based optimization. 
Forrester and Keane [60.22] recently proposed a re- 
view of some advances in surrogate-based optimization. 
An important lesson learned is that only calling the 
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true function can confirm the results coming from the 
surrogate model. Indeed, the path towards the global 
optimum is made of iterative steps where, even exploit- 
ing some surrogate model, only the best results coming 
from the true function evaluations are taken as optimal 
or sub-optimal design. The true function evaluation also 
has to be invoked to improve the surrogate model. With 
the term in-fill criteria we usually mean some princi- 
ples that allow us to intelligently place new points (in- 
fill points) at which the true function should be called. 
The selection of infill points, also referred to as adap- 
tive sampling or model updating, represent the core of 
a surrogate-based optimization method and helps to im- 
prove the surrogate prediction in promising areas of the 
objective space. 

The right choice of the number of points which the 
initial sampling plan would comprise and the ratio be- 
tween initial/in-fill points has been the focus of several 
recent studies. However, it must be emphasized that no 
universal rules exist, as each choice should be care- 
fully evaluated according to the design problem (e.g., 
the number of variables, computational budget, type of 
surrogate). Forrester and Keane [60.22] assumed that 
there is a maximum budget of function evaluations, so 
as to define the number of points as a fraction of this 
budget. They identified three main cases according to 
the aim of the surrogate construction: pure visualization 
and design space comprehension, model exploitation, 
and balanced exploration/exploitation. In the first case, 
the sampling plan should contain all of the budgeted 
points, as no further refinement of the model is fore- 
seen. In the exploitation case, the surrogate can be used 
as the basis for an in-fill criterion, which means some 
computational budget must be saved for adding points 
to improve the model. They also proposed to reserve 
less than one half of the points for the exploitation 
phase, as a small amount of surrogate enhancement is 
possible during the in-fill process. In the third case, that 
is the two-stage balanced exploitation/exploration in- 
fill criterion, as also shown by Sdbester et al. [60.23], 
they suggested employing one third of the points in the 
initial sample while saving the remaining for the in- 
fill stage. Indeed, such balanced methods rely less on 
the initial prediction, and so fewer points are required. 
Concerning the choice of the surrogate, the authors ob- 
served that it should depend on the problem size, i. e., 
the dimensionality of the design space, the expected 
complexity, the cost of the true analyses, and the in- 
fill strategy to be adopted. 

However, for a given problem, there is no general 
tule. The proper choice could come up after various 


model selection and validation criteria. The accuracy 
of a number of surrogates could be compared by as- 
sessing their ability to predict a validation data set. 
Therefore, part of the true computed data should be 
used for validation purposes only and not for model 
training. This approach can be infeasible when the true 
evaluations are computationally expensive. To over- 
come this issue, Goel et al. [60.24] proposed a weighted 
average of an ensemble of surrogates. For example, 
a better model can be achieved by combining Kriging, 
which might accurately predict the non-linear aspects 
of a function, and polynomials to better capture the 
regression trends. Forrester also underlined that some 
in-fill criteria and certain surrogate models are some- 
what intimately connected. For a surrogate model to 
be considered suitable for a give in-fill criterion, the 
mathematical machinery of the surrogate should ex- 
hibit the capability to adapt to unexpected, local non- 
linear behavior of the true function to be mimicked. 
From this point of view, polynomials can be imme- 
diately excluded since a very high order would be 
required to match this capability, implying a high num- 
ber of sampling points. In principle, the convergence 
to a local optimum might be achieved by simply min- 
imizing the surrogate, evaluating the true function at 
the minimum point and updating the model database 
with the new point. Conversely, a global search would 
require a surrogate model able to provide an estimate 
of the error it commits when predicting. Thus, the 
authors suggested the use of Gaussian process-based 
methods like Kriging, although citing the work of Gut- 
mann [60.10] as an example of a one-stage goal seeking 
approach employing various radial basis functions. Fi- 
nally, some interesting suitable convergence criterion 
to stop the surrogate in-fill process were proposed. In 
an exploitation case, i.e., when minimizing the sur- 
rogate prediction, one can rather obviously choose to 
stop when no further significant improvement is de- 
tected. On the other hand, when an exploration method 
is employed, one is interested in obtaining a satisfying 
prediction everywhere, so that one can decide to stop 
the in-filling when some generalization error metrics, 
e.g., cross-validation, fall below a certain threshold. 
When using the probability or expectation of improve- 
ment, a natural choice is to consider the algorithm 
converged when the probability is very low or the ex- 
pected improvement drops below a percentage of the 
range of objective function values observed. However, 
the authors also observed that a discussion on conver- 
gence criteria may be interesting and fruitful, but in 
many real engineering problems we actually stop when 
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we run out of available time or resources, as dictated 
by design cycle scheduling or costs. This is what typi- 
cally happens in aerodynamic design, where the high- 
dimensionality of the design space and expensive com- 


60.3 POD-Based Surrogates 


In this section a review of the mathematical core of 
POD is presented. POD is a mathematical procedure 
that allows us to perform a modal decomposition of 
a large set of multi-dimensional data so as to derive a di- 
mensionality reduction and describe the original system 
with much fewer unknowns. The mathematical devel- 
opment of POD for fluid flow applications, in particular, 
is described in some detail in [60.25]. Here, the main 
aspects related to the construction of a reduced-order 
model through singular value decomposition are pre- 
sented and mainly the use of this technique for steady- 
state problems is addressed. 


60.3.1 Model Order Reduction 


Physics-based approximation concepts require a deep 
understanding of the governing equations and the nu- 
merical methods employed for their solution. The sub- 
stantial difference between a reduced-order model and 
data fit model consists in retaining an explicit depen- 
dency between state variables, related to the governing 
equations and design parameters. In other words, re- 
duced-order models operate on the dimensionality of 
the discretization of the state equations rather than 
on the design space. Thus, such models are partially 
independent of a notable increase in the number of 
design variables. A reduced-order model, in fact, mim- 
ics the basic structure of the problem and not just 
a functional relationship between input and output pa- 
rameters. Hence, the main advantage of using reduced- 
order models lies in their being mostly insensitive to the 
curse of dimensionality. 

To illustrate the model order reduction concept, 
consider the discrete mathematical model (e.g., the 
Navier-Stokes equations) of a physical system written 
in the form 

R(w, x(w)) = 0, (60.3) 
where w € R' is the vector of design variables and x € 
R“ the discretized vector of state (or field) variables (ve- 
locity, energy, density). Note that x is an implicit vector 


puter simulations often do not allow us to reach the 
global optimum of the design problem but suggest that 
we consider even a premature, sub-optimal solution as 
a converged point. 


function of the design variable vector. Unlike classical 
data fit methods (e.g., Kriging, RBF) which work on 
local or integral values of the state variables, reduced- 
order methods, instead, provide an approximation of the 
state vector in the form 


X=c)6,+---+cyby = @c, 
where 


Oy} E ROM 


9 = {$,,.. 
is a matrix of known basis vectors and 


c= {c1,...,¢cmu} eR” 

is a vector of unknown coefficients. The underlying 
approximation is that the state vector lies in the sub- 
space spanned by a set of basis vectors. Obviously, 
this is not true for each state vector, but a proper 
choice of an orthonormal basis can lead to the mini- 
mization of the approximation error in a least squares 
sense. This is how a proper orthogonal decomposi- 
tion is derived. Following this approach, the problem 
of representing a state vector with q unknowns can be 
recast into a problem with M unknowns and, as usu- 
ally q > M, it is possible to obtain an approximation 
of x very efficiently. The estimation of the vector c 
can be obtained with different techniques, classified as 
intrusive and non-intrusive. The first introduce the ap- 
proximation in the governing equations and find the 
coefficients by minimization of the residual norm; the 
second employ data fit techniques trained on a set of 
known coefficients. 

The basis vectors can be computed starting from 
state solutions of the discrete governing equations 
which correspond to M different values of the param- 
eters w. As a consequence, the matrix ® contains the 
basis vectors of the subspace 


® = span{x(w}), x(w2),...,x(wy)} ERP . 
(60.4) 
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For instance, the state solutions x; =x(w;) are ob- 
tained by solving the Reynolds-averaged Navier-Stokes 
(RANS) equations on M different configurations gen- 
erated by applying the parameterization method em- 
ployed on M design vectors w;. The definition of the 
M design sites where to compute the solutions is not 
a trivial issue; generally speaking, standard design of 
experiments techniques are used to sample the design 
space with good coverage properties, but, as will be 
discussed in next sections, this approach may lead to er- 
roneous results when facing highly multi-modal, highly 
non-linear problems. Indeed, the quality of the approxi- 
mation strongly depends on the location of training data 
in the design space. 


60.3.2 POD Theory and Solution 


The construction and training of POD-based surrogate 
models for aerodynamic applications are described in 
detail in [60.26, 27]. The singular value decomposition 
(SVD) solution of the POD basis vectors and coeffi- 
cients is used for steady-state problems. This approach 
is normally preferred to the eigenvalue/eigenvector so- 
lution, as it is faster and easier to implement. POD 
modeling is specifically focused on compressible aero- 
dynamic problems, hence the space domain will be 
represented by the discretized volume occupied by the 
flowing air, and the snapshot vectors will be defined 
from computed flow fields. The column vectors of the 
snapshot matrix (here also referred to as the ensemble 
matrix) contain the volume grid and flow variables as 
computed with a computational fluid dynamics (CFD) 
solver. 

The SVD solution allows us to obtain an optimal 
basis in the sense of the maximization of the averaged 
projection of the ensemble onto it. Hence, each snap- 
shot vector can be retrieved as linear combination of 
the POD basis. If a fluid dynamics problem is approxi- 


mated with a suitable number of snapshots from which 
a rich set of basis vectors is available, the singular val- 
ues become small rapidly and a small number of basis 
vectors are adequate to reconstruct and approximate the 
snapshots as they preserve the most significant ensem- 
ble energy contribution. In this way, POD provides an 
efficient mean of capturing the dominant features of 
a multi-degree of freedom system and representing it to 
the desired precision by using the relevant set of modes. 
The reduced-order model is derived by projecting the 
CFD model onto a reduced space spanned by only some 
of the proper orthogonal modes or POD eigenfunctions. 
This process realizes a kind of lossy data compression 
of the original ensemble. 

The resulting POD-based reduced-order model can 
be used in an optimization process to predict state so- 
lutions that are not included in the original ensemble. 
This useful feature requires the transformation of the 
projection coefficients from the discrete sample space 
for which they have been computed to a continuous 
space. In other words, by itself the POD model does 
not have a predictive feature globally, i.e., over the 
whole design space. Among the possible options to 
accomplish this task, here a functional relation is estab- 
lished between the POD coefficients, which represent 
the projection of a generic CFD flow field onto the 
set of POD basis vectors, and the design variables. It 
is well known that regression techniques are partic- 
ularly suitable to fit experimental data, as they filter 
the random noise out from the data. This behavior is 
less desirable when working with computer simula- 
tions based on determinism. In this case, one asks the 
data fit model to exactly reproduce the sample data 
used for training and to consistently catch the local 
data trends. A radial basis function (RBF) network an- 
swers to these criteria and, therefore, is used here to 
interpolate the POD coefficients over the whole design 
space. 


60.4 Application Example of POD-Based Surrogates 


This section proposes an application of POD-based re- 
duced-order models to the transonic flow around an 
airfoil. Indeed, as the POD model should introduce 
more physics within a surrogate approximation, one 
is interested in comparing such a novel approach to 
standard methodologies in order to establish, though in 
a preliminary way, its advantages and drawbacks. We 
consider a typical case in aircraft aerodynamics which, 


far from being considered of industrial interest, retains 
the main physics features of it. The aerodynamic case 
is represented by the steady, viscous air flow around 
a scaled RAE 2822 airfoil. This geometry was selected 
as it is a standard object in CFD numerical modeling 
and validation [60.28]. The POD snapshots are ob- 
tained by perturbing the RAE 2822 airfoil by means 
of the parameterization described later on. A mixed 
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POD/CFD approach (zonal POD) is proposed to in- 
crease the accuracy level of the surrogate model in 
transonic conditions. 


60.4.1 Parameterization 
and Design Space Definition 


In the present context, surrogate modeling is aimed at 
providing a fast and accurate tool to speed up the pro- 
cess in aerodynamic shape design. As a consequence, 
one of the most important issues is to show its suitabil- 
ity and applicability to the shape optimization problem. 
Indeed, the definition of the design space through shape 
modification parameters typically involves a complex, 
often highly non-linear relation between the flow field 
and the design variables. Moreover, modifying an air- 
craft component (e.g., a wing airfoil) requires several 
parameters, thus enlarging the dimensions of the design 
space. It is straightforward, then, that the complexity of 
the problem increases and approaches a real-world ap- 
plication level. The class-shape transformation (CST) 
method [60.29] provides an analytical form to represent 
various geometries of aeronautical interest and it shows 
the interesting properties of continuity, differentiability, 
and reproducibility of a huge number of test shapes. It 
allows us to specify the airfoil contour as a product of 
a class function, which in the proposed case defines the 
rounded leading edge/pointed trailing edge airfoil class, 
and a shape function obtained as a linear combination 
of n-th-order Bernstein polynomials. The design vector 
is 
w= (4%, AŤ, eA Aig) 5 

where the first and last parameters AXT, Ae! are related, 
respectively, to the upper and lower leading edge radius 


a | 
ul ul __ le 
Re | Ag = aa 


and to the trailing edge closure angle B (A!! = 
tan 6”), as is shown in detail in [60.29]. 

In the present context, seventh-order Bernstein 
polynomials are considered, hence each airfoil side 
is described by eight design variables. The design 
space DW is then a subset of R!6. A scaled 14% 
thickness ratio RAE 2822 airfoil is selected as base- 
line airfoil. The airfoil geometry is shown in Fig. 60.1, 
where the x-coordinate is the abscissa along the airfoil 
chord and y is given by the aforementioned approach. 


The values of the corresponding design parameters, 
which define the RAE 2822 profile according to the 
parameterization, and their range of variation, which 
defines the design space, are reported in Table 60.1. 


60.4.2 Design of Experiments 


The location of sample points is an important issue with 
respect to the cost and accuracy of any surrogate model. 
The design of experiment theory (DOE) [60.30] was 
developed to provide experimentalists with a tool to 
optimally choose the independent variable values for 
a limited number of experiments. The aim is generally 
to use the results of the experiments to study and inves- 
tigate the response and sensitivity of some dependent 
quantity to the independent variables. Classical DOE 
methods were originally designed to alleviate the ef- 
fects of noise, so they have been generally employed 
in conjunction with regression techniques. However, 
computer experiments are not subject to random er- 
rors, hence it is worthwhile using a different strategy to 
obtain as much information as possible about the input— 
output dependence. A variety of methods have been 
developed to fill the design space in an optimal sense. 
One of the most widely adopted is the Latin hyper- 
cube sampling (LHS) technique. It was first proposed 
by McKay et al. [60.31] as an alternative to Monte Carlo 
techniques for the design of computer experiments. The 
basic principle of LHS is to bound the randomness of 
the sample selection in a given region. In fact, given 
that ¢ is the number of design variables, each design 
variable range is divided into m intervals or bins of 
equal probability. This generates a total of m x t bins 
in the whole space. Within each bin only one sample 
is allocated randomly. This ensures that a one-dimen- 
sional projection onto the parameter space will produce 
one sample in each bin, thus eliminating the correla- 
tions between the design variables. LHS is useful for 
the initialization of POD-based surrogate models, but, 
as will be detailed later on, it exhibits some major 
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Fig. 60.1 Baseline geometry, RAE 2822 airfoil 
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Table 60.1 Design parameters, values, and ranges 


Parameter Baseline value Range 

AO 0.1293 0.1293, 0.2293 
At 0.1282 0.1282, 0.2282 
A5 0.1771 0.1771, 0.2771 
Ay 0.1219 0.1219, 0.2219 
A% 0.2393 0.2393, 0.3393 
AS 0.1662 0.1662, 0.2662 
AG 0.1976 0.1976, 0.2976 
As 0.2110 0.2110, 0.3110 


limits that prevent it from being used as a standard 
sampling technique for optimization purposes. Indeed, 
LHS is optimal in the sense of design space cover- 
age, but it does not allow for refining the sampling 
distribution according to enrichment or improvement 
criteria, e.g., design space exploration or objective func- 
tion minimization. 


60.4.3 Zonal POD 


The POD surrogate model is mainly designed as aROM 
within a shape optimization process, where the geome- 
try and, hence, the volume mesh vary with the design 
site. Moreover, the application is focused on transonic 
aerodynamics with potential flow separations and shock 
waves. Therefore, care must be taken with the defi- 
nition of the snapshot domain and how to extract the 
integral quantities of interest (e.g., aerodynamic force 
coefficients) from the snapshot structure. Indeed, as the 
snapshots are expressed through a linear combination of 
POD modes, shock waves, flow separations, and other 
non-linearities present in the training ensemble would 
be captured and replied in the POD modes, so that any 
prediction of a new snapshot would be likely to bring 
the footprint of those flow features with it. This is desir- 
able behavior on average for a physics-based approach, 
but when approaching the optima, which should be fea- 
tured with a shockless and fully attached flow profile, 
a POD approximation of this type would hide the po- 
tential improvement behind the trace of the original 
snapshots. This issue is of paramount importance and 
can be tackled by introducing and combining two con- 
cepts: zonal POD and adaptive sampling. The first will 
allow us to reduce the inherent variability of the snap- 
shots by means of a domain partitioning, thus avoiding 
the POD basis to capture all the physics within the field. 
The second technique will allow us to enrich the POD 
approximation by sampling at new points, which are 
optimal in the sense of exploration/model improvement 
balance. In this section, the discussion will be focused 


Parameter Baseline value Range 

Ab —0.1280 [—0.2280, —0.1280] 
Al —0.1483 [—0.2483, —0.1483] 
A, —0.1080 [—0.2080, —0. 1080] 
A} —0.2580 [—0.3580, —0.2580] 
A; —0.0918 [—0.1918, —0.0918] 
Ab —0.1079 [—0.2079, —0.1079] 
Ae —0.0561 [—0.1561, —0.0561] 
Ab 0.0638 [—0.0362, 0.0638] 


on the zonal POD approach. The basic idea proposed 
in [60.26] is to use a mixed full-order/reduced-order 
model (FOM/ROM) by splitting the solution domain 
into two sub-domains: the FOM (i. e., the CFD RANS 
model) is used only in the vicinity of the surface to 
accurately solve the near wall boundary layer, non-lin- 
earities (e.g., shock waves), and flow separations where 
they occur; the ROM (i. e., the POD surrogate model) is 
exploited to reconstruct the flow field far from the solid 
wall, where a smoother and weakly varying solution is 
expected. 

Figure 60.2 shows a sketch of the domain decom- 
position. The POD-based surrogate model is built on 
the spatial domain defined in the light gray region. 
Once the POD model has been trained, the surrogate 
response on the FOM/ROM boundary interface is ex- 
tracted and used as boundary conditions to iterate the 
full-order CFD solver in the inner domain (dark gray). 
Details about the specific boundary condition formula- 
tion across the two domains can be found in [60.26]. 

A useful advantage of the zonal POD is that any 
aerodynamic coefficient or surface distribution of inter- 
est (e.g., pressure or skin friction distributions) can be 
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Fig. 60.2 Zonal approach, FOM/ROM domains and vol- 
ume mesh 
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directly extracted from the CFD solution in the inner 
domain. On the other hand, when the full POD model 
is used (i. e., trained on the whole domain), the surro- 
gate model would provide a prediction of the mesh and 
state variables, so that a properly designed condensa- 
tion procedure has to be applied to retrieve the integral 
coefficients like lift (C1), drag (Ca), and pitching mo- 
ment (Cm) coefficients. 


60.4.4 Model Training, Validation, 
and Error Analysis 


Training and validation are key phases that POD surro- 
gates must undergo for assessment of their performance 
and potential. Validation means measuring the good- 
ness of a surrogate model with respect to a so-called 
truth response (e.g., the CFD solution) and, therefore, 
drawing information to eventually optimize it. The goal 
is to evaluate the potential of the model to globally ap- 
proximate the design space. Once a surrogate model 
has been trained, classical validation is carried out by 
sampling the design space once more, estimating the 
full and reduced-order models on the new sampling set, 
and computing a set of statistics from the data obtained. 
This approach requires computing new CFD solutions 
and, for this reason, it is computationally expensive. 
In order to reduce the number of full order computa- 
tions, the validation points could be represented by the 
same set used for training, provided that a cross-vali- 
dation technique is used (e.g., leave-one-out). Indeed, 
cross-validation implies the partitioning of a sample of 
data into complementary subsets: one subset is used for 
training, the other one for validation or testing. The 
variability due to the choice of the partitions is usu- 
ally reduced by performing multiple rounds of cross- 
validation and averaging over the rounds. Here a classi- 
cal validation is performed, while cross-validation will 
be used later on in auto-adaptive sampling, when the 
estimation of the quality of the POD model basis and 
coefficients will be required. 


Model Training and Setup 
Before getting into the validation process, the 
POD/ROM models have to be trained, so that an initial 
Latin hypercube sampling is done on the design space 
made of 16 variables (see Sect. 60.4.1). The size M 
of the training sampling is chosen to be very large 
(M = 180) to cover each design variable with a suffi- 
cient number of samples. The set of design sites {w;} 
is then transformed into the physical representation of 
the airfoil geometry due to the chosen parameteriza- 


tion. The baseline geometry is a modified RAE 2822 
airfoil, scaled to 14% thickness-to-chord ratio to am- 
plify compressibility effects. 180 calls of the volume 
mesh generator and CFD solver are launched in paral- 
lel at fixed flow conditions to compute the flow field 
around each airfoil shape. Due to a proper selection 
of the baseline geometry and design weight ranges, 
a wide and varied distribution of shock wave locations 
and flow separations is obtained through the training 
dataset. This is a highly desirable feature to test the 
predictive capability of such a physics-based surrogate 
model. The Mach number is 0.729, the Reynolds num- 
ber is 6500000, and the flow angle of attack is 2°. 
Fully turbulent flow is assumed. For each airfoil shape, 
a single-block structured volume mesh made of 25 186 
points (12288 cells) is computed by means of an auto- 
matic hyperbolic grid generator. Using fixed topology, 
mesh parameters, and sizes, standard quality grids are 
obtained for each geometry. The first cell at the wall 
is placed so as to have a unit y+ at the specified flow 
conditions. A sketch of the volume mesh around the 
baseline airfoil is shown in Fig. 60.2. 

With reference to Fig. 60.3, the mesh partitioning 
is applied to define the FOM/ROM domains, which are 
required to be non-overlapping and adjoining. This can 
be easily done when a structured mesh is available as 
the grid lines can be used as interfaces between do- 
mains. To this aim, the d parameter is introduced as 
the distance of the FOM/ROM interface from the airfoil 
leading edge. Indeed, different POD-based reduced-or- 
der models can be defined by varying this distance 
and, hence, reducing or increasing not only the size but 
also the inherent variability of the snapshot set. This 
mechanism has to be carefully considered beside the 
coexistence of eight heterogeneous variables (spatial 
coordinates, density, pressure and velocity) in the same 
snapshots, as it could introduce a bias in the correlation 
process. For example, the POD reduction could give 
more importance to the flow features related to the snap- 
shot variables which exhibit the largest absolute values 
or the widest range of variation. To avoid this, a scaling 
operator is applied to the snapshot set prior to feeding 
the POD model. The scaling factors are designed so as 
to map each variable to the interval [0, 1] by normaliz- 
ing as follows 
k Xh — min (xn) 

Mh max Ga) — min Om) ° 


(60.5) 


where xp is the vector containing the h-th flow variable 
in the snapshot s = (x1, x2, . . ., xg)?” and the minimum 
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FOM domain d = 1.25 


and maximum are taken over the vector xp. In the 
present investigation the scaling factors are defined 
once for each variable and kept constant even when 
varying the FOM/ROM domains (and hence the snap- 
shot size). 

Dealing with a zonal approach, we do not know 
a priori the optimal distance of the domain interface 
from the airfoil surface. Hence, in order to assess the 
effects of changing the interface position, three domain 
partitions are considered, corresponding to three differ- 
ent settings of the domain interface. As a consequence, 
three POD-surrogate models (here referred to with the 
initials SM) are built exploiting the CFD data obtained 
in the training phase: SM1 — a POD model with d = 0, 
i. e., the full-order model is not invoked in the validation 
as the POD approximation is used to get the flow field 
everywhere and no boundary condition is exchanged, 
the snapshot size N is 201488; SM2 — a POD model 
with d = 0.35, i.e., the full-order domain is the dark 
gray one in Fig. 60.3, while the POD approximation 
is used to get the flow field anywhere else, the snap- 
shot size N is 91792; SM3 — a POD surrogate model 
with d = 1.25, similar to the previous one but now the 
full-order domain is the light gray one in Fig. 60.3, the 
snapshot size N is 75 232. Besides, two more surrogates 
are introduced and trained on the same dataset to act as 
standard meta-models: SM4 — a Kriging interpolation 
model with Gaussian correlation using the aerodynamic 
efficiency C\/Cg as response function (Dakota package 
implementation [60.32]); SM5 — a quadratic polyno- 
mial regression model using the aerodynamic efficiency 
Cı/Ca as response function. Given t the number of de- 
sign variables, at least (t+ 1) x (t+ 2)/2 design sites 
should be evaluated to train this type of model. In the 
present case, (t+ 1) x (t+ 2)/2 = 153, hence the size 
of the a-priori sampling is sufficient. 

For each of the POD-based approximations, the 
ensemble energy content threshold € is reported in 


Fig. 60.3 FOM/ROM domains with 
varying interface 


FOM domain 
d=0.35 
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Fig. 60.4a,b Effect of zonal interface on the energy 
amount captured by POD 


Fig. 60.4a as a function of the number of POD modes. 
It is clearly evident that SM1 requires a big number of 
modes even to reproduce a relatively low energy level 
(95%), while SM3 performs considerably better (97%) 
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with just four modes preserved. SM2 requires more 
modes with respect to SM3 because the corresponding 
ROM domain embeds part of the supersonic region on 
the airfoil suction side; Fig. 60.4b clarifies this issue 
as it reports the FOM/ROM domains (as in Fig. 60.3) 
superimposed with the local Mach number contours. 
The solution is here computed around an airfoil se- 
lected within the ensemble database. While the SM3 
ROM domain is quite far off the supersonic region, 
the SM2 FOM/ROM interface lies across it, thus in- 
troducing a stronger source of variability (and of slight 
discontinuity due to the shock wave) into the ensemble. 
Therefore, for each model a given energy level is ob- 
tained with different number of modes. In order to make 
a fair assessment, the models will not be compared us- 
ing a pre-defined number of modes, but at a fixed energy 
level (95%). Indeed, the number of preserved modes is 
ten for SM1 (95%), seven for SM2 (96.4%), and four 
for SM3 (97%). 


Error Estimation 
For each design candidate, the aerodynamic efficiency 
E = Cı/Ca is computed and used to assess some error 
measures: the percentage error 


x100, „M; 


2) ae 
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PE; = 
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the mean percentage error (MPE) 


1 M 
MPE = = ) PE; ; 
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the standard deviation of the percentage error (SDPE) 
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the R-squared coefficient of determination 
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where index i denotes the i-th sample of the DOE 
validation plan, the hat quantities refer to the surro- 
gate predictions, while the hatless ones to the full-order 
predictions. This type of error measure provides a pic- 
ture of how the POD model reconstruction error is 


propagated on a surface integral, as only aerodynamic 
force coefficients appear. Therefore, it is very useful 
to understand the suitability of the surrogate model 
to approximate the fitness function in an aerodynamic 
optimization process, which usually requires the evalu- 
ation of aero-coefficients. However, in order to ensure 
a more general error analysis, the mean absolute per- 
centage error between the exact CFD computation and 
the predicted value is introduced at snapshot level as 


N 
B= A 


j=! 


Sij = Sij 


x 100, 


Sij 


where N is the snapshot size, s; j ($; j) is the j-th element 
of the computed (predicted) snapshot vector at the i-th 
validation site. 

Finally, monotonicity is one of the properties a good 
surrogate should have in an optimization process. Given 
two true data f (w;) and f (wj) and the corresponding sur- 
rogate predictions Î(wi) and f (w;), a surrogate model is 
monotonic when 


fw) <fw) => fw) <w). (60.6) 


This property can be global (i. e., valid for each w;, w; € 
DW C R’) or local. In order to measure the monotonic- 
ity, the following metric is introduced 


M i a 
6= Py -min(o a) , 


i=1j=1 


(60.7) 


where AE; = E; — & and AE; = E; -È. The G index 
can assume any non-negative value, zero value indicates 
global monotonicity, and the higher the magnitude, the 
more significant the monotonicity loss. 


Validation Results and Analysis 
The validation plan is generated with a new Latin 
hypercube sampling of size M = 50. The goodness- 
of-fit for each model is estimated; the results are sum- 
marized in Table 60.2. The surrogate models can be 


Table 60.2 Surrogate goodness-of-fit estimation 


Surrogate R-squared MPE SDPE G Ranking 
SM1 0.5876 10.33 48.14 1597.22 4 
SM2 0.8899 4.55 12.85 647.83 2 
SM3 0.9791 2.30 iol | A | 1 
SM4 0.8657 456 26.61 853.98 3 
SM5 0.06074 15.62 171.62 1761.64 5 


Aerodynamic Design with Physics-Based Surrogates 


60.4 Application Example of POD-Based Surrogates 


a) Model CL/CD 


504 
E SM1 m á 
> SM2 Lb ° 
45 o SM5 o a © 
fe} > 
Lal 
40 ob 
E 
35 Q 
30 
25 
20 > 
20 25 30 35 40 45 50 
Exact CL/CD 


b) Model CL/CD 


50 
E SM3 lo 
o SM4 

45 


— CFDf 


40 


35 


20 25 30 35 40 45 50 
Exact CL/CD 


Fig. 60.5a,b Correlation plot of the prediction of surrogate models 


ranked as reported in the rightmost column: SM3 ex- 
hibits superior performances for each quality index, 
while the quadratic polynomial surface is very poor 
in approximating the objective function. SM3 performs 
very well even on the SDPE estimate, which measures 
the variation of the percentage prediction error along 
the validation sampling. Hence, the prediction errors 
at any validation site are comparable and close to the 
mean value. This is a very desirable feature for a sur- 
rogate model designed for optimization. On the other 
hand, SM5 shows very poor performances because 
such a polynomial regression is unable to approximate 
a multi-modal, rapidly changing objective function. 
Looking at the figures in Table 60.2, models SM2 
and SM4 show similar results, even if they differ com- 
pletely in methodology and construction. This is a use- 
ful indication when seeking the proper balance between 
the FOM and ROM domains (i. e., to determine the 
distance d): the POD surrogate accuracy increases by 
moving the FOM/ROM interface away from the airfoil 


Table 60.3 Surrogate estimations of aerodynamic effi- 
ciency for best and worst validation airfoils 


Surrogate IDof IDof max min A max A min 


max min (%) (%) 
CFD 12 22 46.43 20.61 0 0 
SM1 12 22 52.19 22.84 12.40 10.81 
SM2 12 22 48.27 20.45 385 | 0.7/7 
SM3 12 2P. 46.40 20.12 —0.07 —2.39 
SM4 26 2D 47.40 19.86 AAs} = 
SM5 12 39 54.62 16.78 17.63 —18.59 


surface, and there exists a peculiar value of the distance 
d for which its predictive power is very close to stan- 
dard and efficient interpolation techniques. As shown 
in Table 60.2, the monotonicity measure is coherent 
with the previously introduced indicators and, consider- 
ing the big difference between SM3 and other models, 
it provides additional evidence of the quality of this 
model. 

Figure 60.5 reports the correlation plot between the 
prediction of the model and the true CFD data. Again, 
SM2, SM3, and SM4 are globally closer to the lin- 
ear trend, resulting in a better fit. The correlation plot 
highlights another significant feature of SM3 model, as 
it generally underestimates the aerodynamic efficiency. 
For further comparisons, Table 60.3 summarizes the 
validation set indices where each model predicts the 
highest and lowest efficiencies, the corresponding val- 
ues of aerodynamic efficiency, and the percentage error 
with respect to the CFD datum. This is useful for evalu- 
ating the capability of the model identifying the global 
extrema of the objective function. It is observed that 
only SM4 leads to a wrong estimation of the position of 
the optimal airfoil while SM5 underestimates the per- 
formance of the worst profile. 

The last two properties, i.e., the capability to pre- 
serve the monotonicity of the dataset and to correctly 
identify the best/worst candidates, are crucial aspects in 
SBO, so that models SM2 and SM3 seem to be more 
suitable for this purpose. 

The accuracy of POD models is also evaluated and 
compared in terms of the point-to-point snapshot er- 
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Fig. 60.6 Snapshot error prediction 


ror. Figure 60.6 shows the results for each snapshot 
belonging to the validation plan (again ranging from 1 
to 50). The error index is plotted in logarithmic scale. 
It turns out that, in strong transonic conditions, train- 
ing a POD model on the full CFD domain (SM1) would 
lead to misleading results in the prediction phase, as the 
model would not be able to catch the highly non-linear 
trends that characterize this kind of flow. Indeed, the 
high number of POD modes required and the low good- 
ness-of-fit performance suggest that further modeling 
is needed to optimize the computation of the basis vec- 
tors and modal coefficients in transonic aerodynamics. 
In the next sections, we will introduce some adaptive 
sampling concepts to globally improve the reduced-or- 
der models predictions. 

A final comparison is possible in terms of POD 
model accuracy versus computational time and cell sav- 
ing. In particular, the R-squared prediction error can be 
taken as a measure of the model accuracy, while the 
time saving index TS and the cell saved index CS are 
defined as 


T; -T N: 
Pa M Calico. ea 
TFULL NFuLL 


where T and N are, respectively, the computational time 
for 1000 CFD iterations and the number of solved com- 
putational cells. The subscripts FULL and SM refer to 
the full grid CFD computation and the CFD compu- 
tation on the smaller FOM domain. In Fig. 60.7a the 
three indices (R-squared, TS, and CS) are plotted against 


the distance d of the FOM/ROM interface from the 
airfoil leading edge. It shows that a clear trade-off 
exists between accuracy and time/cell saving and pro- 
vides useful guidelines to tailor the choice of the best 
POD model to the basic requirements of the target ap- 
plication. For instance, if the target is to do a pre- 
screening of the objective space, one could use a faster 
and less accurate POD model that, however, guarantees 
the preservation of the physics. Figure 60.7 proposes 
a comparison between surrogate models in terms of the 
MPE and SDPE. The plot shows a graphical picture 
of the results in Table 60.2 concerning the accuracy 
of the POD models with moving FOM/ROM inter- 
face and the comparison with more classical meta- 
models. 


Conclusions 

Three POD/ROM models have been trained and com- 
pared: the first one consisted in feeding the POD en- 
semble with the full field, hence without any domain 
decomposition; in the second and third one, the zonal 
approach was applied by defining two different values 
of the distance of the interface from the airfoil lead- 
ing edge. Results showed that the model accuracy is 
strongly dependent on the distance parameter, mainly 
because of the presence of the supersonic expansion 
lobe and the pressure jump across the shock wave on the 
airfoil suction side. In fact, the SM3 model showed su- 
perior performance with respect to both the other POD 
models and standard interpolation techniques like Krig- 
ing and regression methods like quadratic polynomial 
fitting. It also allows us to obtain a very accurate re- 
construction of airfoil surface distributions and, hence, 
of aerodynamic coefficients, which are very often the 
actual target of aerodynamic design. Another important 
conclusion of the work is that it seems completely mis- 
leading to base the POD ensemble on the full flow field 
when transonic conditions and shape modifications act 
together. Indeed, as the POD reconstruction is a linear 
combination of POD modes, capturing the combined 
non-linear effects of boundary layer and compressibil- 
ity is hardly possible when the position and intensity of 
the shock wave and its interaction with the boundary 
layer vary too much. Globally, the proposed POD sur- 
rogate model was shown to have many characteristics 
that make it suitable for aerodynamic design. However, 
a trade-off was found between POD model accuracy 
and resource saving as a function of the distance pa- 
rameter: the smaller the full-order domain, the shorter 
the computational time required but also the less accu- 
rate the reconstruction. 
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Fig. 60.7a,b Performance of surrogate models as a function of FOM/ROM interface positioning 


Indeed, one of the key points of the research is to 
recover the accuracy issues by optimally selecting the 
training candidates. In the proposed example, we se- 
lected 180 sites to a-priori sample the 16-dimensional 
design space, but in principle we do not have any in- 
formation about the appropriate size and locations of 
the sample points. Intuitively, we would like to have 
a sampling strategy that would fill the space in an 
efficient manner and would allocate more points in 


regions of the design space where the simulation re- 
sponse is strongly non-linear or is likely to find an 
optimum. In industrial practice, this would mean, given 
a computational budget, improving the quality of the 
POD surrogate by intelligently choosing the training 
samples. Conversely, given a POD model with a cer- 
tain quality level, the rationale would be to reach 
the same accuracy with less high-fidelity computa- 
tions. 


60.5 Strategies for Improving POD Model Quality: Adaptive Sampling 


In order to get rid of the issues raised in the previous 
section, a set of strategies has been proposed to update 
and enhance the surrogate/POD model through adap- 
tive DOE techniques. Indeed, the selection of the design 
sites to be included in the POD ensemble, instead of 
being fully derived from an a-priori sampling strategy, 
can be tailored to match specific POD-related improve- 
ment requirements. Adaptive sampling strategies can 
be properly designed to account for these requirements 
by means of the so-called in-fill criteria. While a- 
priori sampling techniques do not use any information 
about the model prediction, adaptive techniques incre- 
mentally select new sampling points by exploiting the 
input/output relation observed at the previous stages. 
Hence, some adaptive DOE strategies for POD-based 
reduced-order models are proposed. The main refer- 
ences are the works by Goblet and Lepot [60.33] and 
Sainvitu et al. [60.34]. 


60.5.1 Rationale 


Adaptive sampling is aimed at improving the modal ba- 
sis or the modal coefficients set, which represents the 
core of POD modeling. Indeed, given a POD model 
built on a snapshot ensemble {s(w1), s(w2),...,5(wa)}, 
the aim is to find a new point Wnew in the design 
space so that the new POD model, built on the new 
set {s(w 1), s(w2), . . . , sS(wm), S(Wnew)}, Will provide for 
improved predictions and better exploration of the de- 
sign space at the same time. The fundamental idea is 
to realize a proper trade-off between local accuracy and 
global design space exploration. On the one hand, we 
would like to sample near those design sites whose im- 
portance is higher (exploitation). On the other hand, 
knowing the relative distances between new sampling 
points and snapshot sites, we would like to sample far 
away from the known points in order to potentially 
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enrich the global prediction of the POD model (explo- 
ration). In other words, we need to know how much 
a new potential sample is near the training set and 
how much the nearest training sample weighs on the 
POD model. Of course, the meaning of distance and 
importance needs to be mathematically defined, but the 
underlying idea is to combine the information about the 
relative importance of the snapshot and the nearest dis- 
tance in order to satisfy both requirements. This leads to 
the definition of a potential of enrichment in the general 
form 


Ve(wi,W2,...,Wu.y) = R(w1, w2, ..., wm, Y) 
X Ie (w1, W2, ..., Wm, Y) » 
(60.9) 
where y is a generic point in the design space and 
{w1,W2,...,Wy} is the usual set of training points. 


A new sample can be obtained by simply maximizing 
the enrichment function. The function R gives a mea- 
sure of the distance of y from the set {w1, w2, . . . , wa}, 
according to a certain norm, and helps to obtain good 
space-filling properties. The function J. can be referred 
to as an importance function, as it has to properly 
provide the information about the quantity to be im- 
proved. A natural candidate could be a measure of 
the error at each new point if the surrogate model 
would be designed so as to provide for an error es- 
timation at each y. Otherwise, in a surrogate-based 
optimization process, the function J. could be directly 
linked to the approximation of the objective function 
to drive the search for a new optimized design sam- 
ple. With reference to the POD modeling presented 
so far, the importance function should be closely tied 
to the quality of the modal basis {$,,...,@,,} or the 
quality of the RBF models built on the modal coeffi- 
cient {œ (w), @2(w),...,@y(w)}. Hence, two different 
approaches can be followed, both based on the leave- 
one-out cross-validation technique. For the sake of clar- 
ity, the superscript ~/ will indicate that the referenced 
element (basis vector, coefficient model, SVD matrices) 
has been obtained by means of a leave-one-out process, 
i.e., by removing the j-th sample from the training set 
and re-computing the model. 


60.5.2 Improvement of the Modal Basis 


The first strategy consists in defining the importance 
function as a measure of the relative influence of each 
snapshot on the modal basis. This requires evaluating 
how much the modal basis changes when removing the 
snapshots one by one. The relative influence of the j-th 


snapshot on the modal basis is defined as 


; Is (Wj) 
Lw) = =r. (60.10) 
HLO EL wo 
where 
M 
hm) => Oi l (60.11) 


=i A(s) 


is the influence of the j-th snapshot on the modal ba- 
sis, o; is the singular values associated to the i-th basis 
vector, and o; is the i-th basis vector obtained after the 
substitution of the j-th snapshot vector with a null vector 
in the ensemble matrix. Goblite and Lepot [60.33] show 
how to efficiently compute ġ;” for each j. According 
to (60.9), this choice of the importance function drops 
the dependency on y, so that we need to condense it in 
the distance function. 

As was mentioned above, a new optimal sample can 
be found by maximizing the potential of enrichment. 
However, in order to avoid solving a maximization 
problem, the design space is heavily sampled with 
a Latin hypercube technique, e.g., 100 times the dimen- 
sion ¢ of the design space. Then, the Euclidean distance 
of each new sampled point y;, i = 1,...,l = 100¢ from 
each of the snapshot sites w,k = 1,...,M is com- 
puted and, for each y;, the distance from the nearest 
snapshot wz is stored as A(wz, y;). This represents the 
distance function. Hence, for each new candidate y;, 
the potential of enrichment can be written as Vg (yi) = 
A(wz, yi); (wz). Finally, a new sample point is selected 
at Whew = argmaxy, Vo (yj). 


60.5.3 Improvement 
of the Modal Coefficients 


The second adaptive method is conceived to improve 
the quality of the RBF networks built on the POD 
modal coefficients. Two sub-strategies are proposed: 
the first aims at improving the worst modal coefficient, 
the second is designed to improve all coefficients simul- 
taneously. 


First Sub-Strategy 
This strategy is applied when one of the coefficient 
model {a1 (w), . . . , @(w)} exhibits low quality with re- 
spect to the others. Therefore, we first need to select 
a modal coefficient. The quality of the i-th modal co- 
efficient œ;(w) is estimated by using a leave-one-out 
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process and computing a weighted form of the Pearson 
correlation coefficient as 


Oi 


Lii Ok 


u (œa) — a(o) u (a) 


x 
uea- y (a; ‘0; ") _ [u (a) f 


Pe (ai) = 


> 


(60.12) 
where u is the arithmetic mean operator applied to 
a generic dataset K = {k,..., Km} 

1 
a(k) = 7 > Kj. 


j=l 


Weighting the correlation coefficient is needed be- 
cause the POD modal coefficients are not equally 
important but they can be ranked according to the mag- 
nitude of the corresponding singular values. 

The modal coefficient with the lowest value of the 
weighted correlation coefficient is selected and tagged 
as 1. The importance function is then defined as 


Tox, (w) = (w) — 0 "(w;)} 


i.e., the absolute error of the 7-th model when leaving 
out the j-th snapshot. The choice of the distance func- 
tion is the same as in the previous case. Hence, for each 
new candidate y;, the potential of enrichment (with re- 
spect to the worst coefficient model) is defined as 


Vaz (yi) = A(z, yilo (wz) . 


Finally, a new sample point is selected at Whew = 
argmax,, Va;(Vi). 


The evaluation of the quantities œ; ” (w) is very ex- 
pensive as, for each j, a new model has to be computed. 
However, when using RBF network interpolators for 
POD coefficient models, the leave-one-out procedure 
can be performed at no extra cost by using the efficient 
formula provided by Rippa [60.35]. 


Second Sub-Strategy 

This sub-strategy is used when the quality of all coeffi- 
cient models is comparable and it is very similar to the 
improvement of the modal basis. The importance func- 
tion is defined as a measure of the relative influence 
of each snapshot on the whole set of coefficient mod- 
els. This requires evaluating the absolute error of each 
modal coefficient model when removing the snapshots 
one by one. The relative influence of the j-th snap- 
shot on the whole set of coefficient models is defined 
as 


T.(wj) 


To= H 
We are 


where 


M M 
Lw) =È olay (wy) =Y o; leio) -a7 w]. 
i=1 i=1 


(60.13) 


As in the previous cases, the potential of enrichment Vg 
(with respect to all POD coefficients) is defined as 


Va Yi) = A(wy vip) - 


Finally, a new sample point is selected at Whew = 
argmax,, Væ i. 


60.6 Aerodynamic Shape Optimization by Surrogate Modeling 


and Evolutionary Computing 


The POD surrogates as well as the adaptive sampling 
techniques described in Sects. 60.3 and 60.5 have been 
included within an evolutionary optimization loop. The 
aim is to assess POD-based surrogate models as fitness 
function evaluators in a shape optimization problem in 
transonic flow. Several approaches are proposed, dif- 
fering for the key ingredients of the methodology: the 
construction of the POD model (full/zonal approach), 


the strategy chosen to compute the training sample (a- 
priori, auto-adaptive), and the strategy to exploit the 
optimization results (single optimization, iterative op- 
timization with real-time updating). The optimization 
approaches share the same target, i. e., to improve the 
aerodynamic performance of the scaled RAE 2822 air- 
foil. The surrogate-based shape optimization process 
consists of an a-priori design of the experiment mod- 
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ule (e.g., LHS), the CST parameterization module, an 
in-house developed automatic mesh generator, the ZEN 
CFD flow solver [60.36], the POD/ROM module, which 
also encloses the adaptive sampling techniques, and 
the in-house adaptive genetic algorithm optimization li- 
brary ADGLIB [60.15, 37-39]. 


60.6.1 Problem Definition 


The geometry parameters, the design variable ranges, 
the parameterization technique, and the design point 
from Sect. 60.4.1 will be used. Here, we define the 
airfoil shape optimization problem in terms of objec- 
tive/constraint function specifications as 


eae Ci 
minimize — — 
wEDWCR!6 Ca 
t 
subject to (-) =0.14, C>05, 
C/ max 
Cm => —0.05 , 
Cm < 0.05. (60.14) 


In other words, the goal is to maximize the aerodynamic 
efficiency C\/Cy while keeping a minimum level of lift 
generation (Cı > 0.5) and pitching moment controlla- 
bility (|Cm| < 0.05). Moreover, a geometric constraint 
is added in order to set the airfoil maximum thickness- 
to-chord ratio t/c at 14%: this constraint is implic- 
itly treated within the parameterization. The constraint 
functions are actually treated as quadratic penalties, 
hence the constrained optimization is transformed into 
the following unconstrained problem 


C 
minimize — — + K[min(C, — 0.5, 0)]? 
wEDWCR!S Ca 


+ K[min(C,, + 0.05, 0)]? 
+ K[min(—Cm + 0.05, 0)? , 
(60.15) 


where K is a constant weight (equal to 10*) which 
amplifies the relative importance of possible constraint 
violations. For instance, a unit penalty will be applied 
to the objective function in the case of an airfoil having 
a pitching moment of +0.06. 


60.6.2 Optimization Strategies and Setup 


In Sect. 60.4, three POD-based surrogate models were 
introduced, trained, and validated against an indepen- 
dent dataset. Here, we propose a set of numerical exper- 
iments to assess their potential as fitness evaluator and 


their suitability for an evolutionary optimization prob- 
lem. Several optimization approaches were set up and 
tested in order to possibly cover all the issues concern- 
ing surrogate/ROM training and prediction. Table 60.4 
summarizes the characteristics of each optimization 
in terms of: fitness evaluator, optimization algorithm, 
POD energy threshold (when using POD as surrogate), 
high-fidelity computational budget, i. e., the total num- 
ber of computations with the ZEN RANS solver during 
the optimization process, number of a-priori LHS sam- 
ples Mapr, number of adaptively added samples Maap, 
and number of surrogate-based optima Mo, which are 
iteratively added to the ensemble database. It must be 
noted that not all the optimization strategies use POD 
as a surrogate; in particular, optimizations tagged as 
Kriging-driven Genetic Algorithm (KGA) and EGO 
have been performed by using a Kriging method as the 
fitness evaluator and the EGO algorithm [60.17], based 
on Kriging and expected improvement evaluation, to 
compute new optimal samples. The EGO algorithm rep- 
resents a modern state-of-art method in surrogate-based 
global optimization. In the following, with the term 
truth or true we will indicate the results obtained with 
the Zonal Euler—Navier—Stokes (ZEN) CFD solver as 
it is adopted as the reference high-fidelity simulation 
tool. Each optimization method is described in detail 
here: 


@ DGA (Direct Genetic Algorithm) — a plain, brute- 
force genetic optimization with the full high-fidelity 
solver ZEN called as fitness evaluator. 

© FPGAI (Full POD Genetic Algorithm 1) — a sur- 
rogate-based optimization where the aerodynamic 
analysis is carried out through a POD model built 
on the complete flow field of a set of 180 initial 
samples. This case corresponds to the POD-driven 
standalone mode and the surrogate POD evaluator 
is the one presented as SM1. No zonal approach is 
used. The POD energy content is 85%. The snap- 
shot size N is 201 488. 

@ FPGA2 (Full POD Genetic Algorithm 2), FPGA3 
(Full POD Genetic Algorithm 3) — same as FPGA1, 
but the POD models are defined by increasing the en- 
ergy content (95 and 99%, respectively). 

@ MPGAI (Mixed-flow POD Genetic Algorithm 1) — 
a surrogate-based optimization where the zonal 
CFD/POD model is trained on the initial design 
space sampling (180 snapshots) and adopted as 
the objective function evaluator throughout the op- 
timization cycle. The FOM domain is defined at 
a distance d = 1.25 chord length from the airfoil’s 
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Table 60.4 Optimization approaches 


Opt Tag Fitness evaluator Optimizer 
DGA ZEN ADGLIB 
FPGA1 standalone POD ADGLIB 
FPGA2 standalone POD ADGLIB 
FPGA3 standalone POD ADGLIB 
MPGAI1 zonal POD ADGLIB 
MPGA2 zonal POD ADGLIB 
KGA Kriging Dakota SOGA 
EGO Kriging Dakota EGO 
AFPGA1 standalone POD ADGLIB 
AFPGA2 standalone POD ADGLIB 
AFPGA3 standalone POD ADGLIB 
AMPGA1 zonal POD ADGLIB 
AMPGA2 zonal POD ADGLIB 


leading edge. The POD model used here has been 
already validated as SM3 in previous sections. The 
POD energy threshold is set at 95%. The snapshot 
size is 75 232. 

MPGA2 (Mixed-flow POD Genetic Algorithm 2) — 
same as MPGA1, but the POD energy content is in- 
creased up to 99%. 

KGA -— a surrogate-based optimization where 
a Kriging meta-model, built on the objective func- 
tion, is coupled to the genetic optimization. Here, 
the DAKOTA package [60.32] is used both for 
optimization process control and algorithm capa- 
bilities. The John Eddy Genetic Algorithm (JEGA) 
library [60.40] was used for optimization purposes. 
In particular, the single-objective genetic algorithm 
(SOGA) was used to perform optimization on a sin- 
gle objective function with general constraints. 
Kriging is initially trained on the 180 samples 
dataset. Then, a classical surrogate-based iterative 
optimization scheme is performed, consisting in 
building the surrogate, optimizing the surrogate ob- 
jective, evaluating the minimizers with the truth 
(CFD) model, and rebuilding the surrogate. In the 
present optimization, 10 SBO iterations are per- 
formed. 

EGO - the key idea in EGO [60.17—19] is to exploit 
the Gaussian process capability to provide both the 
prediction at a new input location as well as the un- 
certainty associated with that prediction. 

AFPGAI (Adaptive Full POD Genetic Algo- 
rithm 1), AFPGA2 (Adaptive Full POD Genetic 
Algorithm 2), AFPGA3 (Adaptive Full POD Ge- 
netic Algorithm 3) — the surrogate model employed 
is the same as FPGA3, but the training method 
is different and an adaptive sampling strategy is 


POD energy (%) Budget hi-fi Mapr Maap Mont 
- 9600 0 0 0 
85 180 180 0 0 
95 180 180 0 0 
99 180 180 0 0 
95 180 180 0 0 
99 180 180 0 0 
- 190 180 0 10 
- 553 153 400 0 
99 96 32 16 48 
99 96 16 32 48 
99 96 4 44 48 
99 112 8 56 48 
99 96 8 40 48 


added. In particular, it was decided to follow a dif- 
ferent approach: the aim was to check whether, with 
a limited computational budget, better results can 
be obtained by adaptively training the POD model. 
Hence, the surrogate training phase was split into 
three contributions: an a-priori contribution, sam- 
pling the design space with the LHS technique 
and producing Mapr samples, an iterative, adaptive 
sampling aimed at improving the modal basis and 
enriching the ensemble dataset with Maap samples, 
and a series of Mop genetic optimizations, each pro- 
ducing an optimal candidate to update the ensemble 
and recompute the surrogate. The last phase is also 
called real-time updating. The three strategies differ 
for the relative amount of these three contributions 
as highlighted in Table 60.4, keeping fixed the total 
computational budget. The POD energy content is 
99%. The snapshot size N is 201 488. 

AMPGA1 - the surrogate model employed is the 
SM2. The FOM/ROM interface is defined at d = 
0.35 chord length from the airfoil’s leading edge. 
However, the training method is different as it em- 
beds a-priori, auto-adaptive, and optimal samples as 
described earlier. The POD energy content is 99%. 
The snapshot size N is 91 792; 

AMPGA2 - the surrogate model employed is again 
the SM3, but it differs from MPGA2 because 
the training method embeds a-priori, auto-adap- 
tive, and optimal samples as described before. The 
POD energy content is 99%. The snapshot size N 
is 75 232. 


The optimization setup is the same for all the ap- 


proaches, except for AMPGA1 and AMPGA2. A pop- 
ulation of 64 individuals is let to evolve for 150 gen- 
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erations with an 80% bit crossover rate and a 2% bit 
mutation rate. The genetic evolution is repeated ev- 
ery time a new optimal sample has to be added to 
the ensemble. Hence, a total number of 9600 evalua- 
tions are required for each optimization process. The 
setup of AMPGA1 and AMPGA2 differ slightly be- 
cause the surrogate models adopted are more expensive 
(Fig. 60.7a). In order to increase the frequency of model 
updating stages, a population of 48 individuals is let 
to evolve for just 10 generations and the process is 
repeated 48 times to iteratively provide new optimal 
samples. The new feature is that each optimization step 
is a restart of the previous one with re-evaluation of 
the population candidates as the surrogate model has 
meanwhile been updated. In other words, the idea is 
to update the surrogate model more frequently (af- 
ter just 10 genetic algorithm (GA) generations instead 
of 150) even if with smaller amounts of improve- 
ment (10 generations are not enough to converge the 
GA). 


a) Objective function 
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Fig. 60.8a,b Non-adaptive POD-driven optimization his- 
tory 


By looking at the details of the SBO approaches 
described so far, it seems quite natural to divide 
them into two main classes: non-adaptive (FPGAx, 
MPGAx), i.e., those without any adaptation/real-time 
updating, and adaptive optimizations (KGA, EGO, 
AFPGAx, AMPGAx). Consequently, the presenta- 
tion of the results obtained will follow this logical 
sequence. 


60.6.3 Non-Adaptive Optimization Results 


Figure 60.8 shows the convergence history of the three 
FPGA optimizations compared to the plain DGA (solid 
black line) on the left and the two MPGA optimization 
histories on the right. The graphs show the sequence 
of the best candidates for each generation. It must be 
pointed out that while the DGA predictions (solid black 
lines) are obtained with the CFD solver, the POD-based 
predictions (dash, dotted, and dash-dotted lines) are the 
surrogate ones. For example, the dash-dotted line does 
not indicate that FPGA1 reached objective levels signif- 
icantly lower than DGA, but simply that the predicted 
values of the airfoil performances were significantly 
overestimated. The plot clearly highlights that what- 
ever the energy content, the full-POD approximation 
is not able to match the true data during the search 
process. Moreover, the general trend is towards an over- 
estimation of the aerodynamic characteristics, which 
leads to lower values of the objective function. On the 
other hand, the MPGA model agreement with the CFD 
progress is very satisfying, both in terms of trends and 
accuracy. 


60.6.4 Adaptive Optimization Results 


Figure 60.9a shows the convergence history of the it- 
erative SBSO (which stands for surrogate based shape 
optimization) KGA run. As was already mentioned, it 
is made of ten sequential surrogate optimizations; at the 
end of each of them, the optimal candidate is re-evalu- 
ated with the CFD solver and injected in the training set, 
so that an updated surrogate is available. The left-hand 
figure compares the surrogate and true prediction of the 
optimal candidate at each iteration. After about 6—7 
SBO iterations, the Kriging model has been improved 
enough to predict very closely to the CFD solver. In 
Fig. 60.9b, the convergence history corresponding to 
the tenth SBO iteration is superimposed with the DGA 
run: a noticeable agreement is found, both in the ini- 
tial drop in the fitness function and in the final plateau. 
Among the SBO minimizers, the ninth iteration shows 
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Fig. 60.9a-c Kriging-based optimization > 


the lowest true objective function value, so that it will 
be considered as the actual KGA optimum. 

Figure 60.9c reports the convergence history of the 
EGO optimization. Dark gray circles depict the initial 
DOE sampling (153 candidates), while light gray cir- 
cles denote the subsequent 400 candidates found by 
minimizing the expected improvement. The graph also 
reports the expected improvement values in gray white- 
filled circles and logarithmic scale (right axis). It is 
clearly evident how the progressive decrease of the EIF 
produces a better quality of the Kriging model, which 
in turn results in a minimization of the true objective 
function. 

The convergence histories of the AFPGA1, AF- 
PGA2, and AFPGA3 optimizations are reported in 
Fig. 60.10a together with the objective function val- 
ues computed on the training points. In the plot, each 
point represents a single high-fidelity evaluation, the 
squares depict the a-priori and adaptive training sites, 
while the circles connected with lines represent the se- 
quence of optima from Mop GA optimizations. It is 
fairly evident that adaptive sampling is often helpful 
as it allows us to find sub-optimal solutions even be- 
fore optimization (see AFPGAI and AFPGA2). On 
the other hand, this somewhat disappointing behav- 
ior in the optimization step is due to the fact that the 
surrogate underestimates the objective function, thus 
pushing the surrogate-based optimizer to explore un- 
interesting design space regions. In particular, results 
show that the more adapted the initial sampling, the 
smaller the underestimation. Hence, the ratio Mapr/ Maap 
should be kept low. However, another important fea- 
ture is related to the AFPGA3 method; it shows that 
by lowering the ratio Mapr/ Maap too much (up to 0.09), 
the performance of the method deteriorates, as the fi- 
nal AFPGA3 optimum is worse than the previous ones. 
Indeed, leaving too much room for adaptive criteria 
seems to produce a sampling with very poor exploratory 
capabilities. 

These considerations give a helpful hint about the 
right combination of a-priori and adaptive sampling: 
the ratio Mapr/Maap Should be kept between 0.1 and 
0.5, which is in line with the value of the EGO and 
goal-seeking methods proposed by Jones [60.19]. This 
information is exploited in tuning the parameters for 
AMPGA1 and AMPGA2 optimization. Figure 60.10b 
shows the true objective functions of the training sam- 
ples and of the sequence of optima candidates. Even 
if the AMPGAI performs quite well, it exhibits simi- 
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lar characteristics to the AFPGAx optimization. On the 
other hand, the AMPGA2 optimum outperforms the op- 
tima seen so far and, as it will be clear in the next 


1205 


9°09 | 3 Hed 


1206 PartE | Evolutionary Computation 


9°09 | 3 Hed 


a) True objective function 
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Fig. 60.10a,b Convergence history of POD-based optimization 


Table 60.5 Optimal candidates, objective function breakdown 


Opt. run ID Truth obj. Predicted obj. Penalty q Ca Gu 

DGA =i. les) —51.18 1.025 0.619 0.0118 —0.0602 
MPGA1 —48.70 —50.86 ills} 0.578 0.0116 —0.0606 
FPGA3 38.33 —73.45 0.608 0.553 0.0142 —0.0578 
KGA —47.65 —51.94 0.585 0.612 0.0127 —0.0576 
EGO —49.71 —49.71 0.530 0.618 0.0123 —0.0573 
AFPGA1 —49.24 —47.14 1.12 0.635 0.0126 —0.0606 
AFPGA2 —49.20 =l 0.551 0.631 0.0127 —0.0574 
AFPGA3 —48.13 —47.88 1.29 0.583 0.0118 —0.0614 
AMPGA1 —48.58 —44.61 0.567 0.576 0.0117 —0.0575 
AMPGA2 —51.13 —50.31 0.947 0.612 0.0117 = 00597 


section, it is the only candidate to get very close to the 
truth optimum, i. e., the DGA optimum. 


60.6.5 Optima Analysis 


This section gives details about the optima computed 
with each of the methodologies presented. In the fol- 
lowing, ten optimal candidates will be considered to 
assess the optimization results, namely the optima 
from runs DGA, FPGA3, MPGA1, KGA, EGO, AF- 
PGA1, AFPGA2, AFPGA3, AMPGA1, and AMPGA2. 
FPGA3 and MPGA1 have been selected among the 
FPGAx and MPGAx optima because they are the clos- 
est to the high-fidelity DGA optimum. The objective 
function breakdown for each optimal candidate is sum- 
marized in Table 60.5. The table reports both the true 
data, obtained by re-computing each design with the 
CFD solver, and the predicted objective function as cal- 


culated by the surrogate model. Each optimum does 
not satisfy the pitching moment constraint because the 
quadratic penalty function and its weight, chosen in the 
problem definition, purposely do not enforce this con- 
straint strictly to have a less stiff optimization problem. 
Indeed, getting precisely into the constraint boundaries 
would probably have penalized the aerodynamic ef- 
ficiency to much, i.e., the actual objective function, 
while applying small penalties near a constraint bound- 
ary gives more flexibility to the search of the optimal 
design. 


Non-Adaptive Optima 
Among the non-adaptive methods (here KGA is con- 
sidered as non-adaptive to set a comparison), optimal 
designs coming from MPGA1 and KGA are closer 
to the plain one in terms of global performance. The 
MPGAI optimum catches the DGA constraint viola- 
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a) True objective 
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Fig. 60.11a,b Computed optima in the surrogate vs truth objective plane 


tion almost perfectly, while KGA design performs even 
better on pitching moment but at the cost of a slightly 
lower aerodynamic efficiency. FPGA3 design, although 
using 75 POD modes, does not belong to an optimal 
sub-set but exhibits a small penalty. The best surrogate 
solution is the MPGA1, where a weak shock appears 
on the suction side but at a lower lift level. Indeed, the 
optimal leading edge radius is almost twice the DGA 
value, and this feature causes an over-expansion on the 
suction side, which in turn makes the shock wave occur 
more upstream and more strongly. The Kriging-based 
best candidate shows a reduced rear loading to limit 
nose-down pitching moment and trim drag associated 
with the rear location of the center of pressure. This 
beneficial feature is counterbalanced by the lift produc- 
tion increase in the fore airfoil part and, consequently, 
by a more pronounced pressure jump across the shock 
wave. 


Adaptive Optima 
In order to highlight the undertaken improvement path, 
Fig 60.11 reports a correlation plot where the whole 
set of optima is depicted in the surrogate objective — 
the true objective plane. Two different zooming levels 


60.7 Conclusions 


The aim of the present book chapter was to review 
and investigate ad-hoc computational techniques to ease 
the solution of complex aerodynamic shape optimiza- 


are set, as they reflect the non-adaptive and adap- 
tive process: the FPGA1, FPGA2, and FPGA3 optima 
show very large discrepancies between true value and 
surrogate prediction, hence they are located very far 
from the line of perfect fit. However, a trend is ob- 
servable as, increasing the POD energy content (i.e., 
passing from FPGA1 to FPGA3), the best candidate 
gets closer to the true optimum. By looking at the 
top part of the figure, a clustering of the remaining 
optima is observable, so that a closer look is offered 
in the bottom figure for better understanding. Among 
the adaptive optima, AMPGA2 and EGO produce the 
best results and demonstrate the benefits of opportunely 
coupling the zonal approach and an intelligent design 
space sampling. Indeed, these optimal candidates are 
the closest to the target point in the sense of the Eu- 
clidean distance in the objective plane. From a more 
strict aerodynamic point of view, turning on the adap- 
tive criteria brought the quality of the optimization 
result to a level that is very close to the expecta- 
tions of the designer, as a shockless profile featured 
with a gentle re-compression on the suction side rep- 
resents the golden goal for the proposed optimization 
problem. 


tion problems such as those commonly encountered 
in aerospace design at industrial level. Among the 
various approaches that are the subject of research in- 
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vestigations, we chose to focus on ad-hoc surrogate 
methods. In particular, we demonstrated that the well- 
known proper orthogonal decomposition approach is 
not adequate to provide reliable predictions in peculiar 
aerodynamic conditions like transonic flow and when 
the boundary of the computational domain changes like 
in shape optimization. We proposed a zonal approach 
to de-couple the strong non-linearities occurring near 
the body-wall from the POD approximation. This zonal 
approach proved to give reliable results at a reduced 
computational cost compared to the full CFD simula- 
tion. Furthermore, we showed that the zonal approach 
can give an accurate approximation of the true opti- 
mum when trained with specifically designed adaptive 
sampling techniques. The latter have been purposely 
conceived to improve POD model machinery, namely 
the basis vectors and coefficients. By using such an 
intelligent design of experiment method, the high-fi- 
delity computational budget can be further reduced and 
the overall performance of the design loop increases. 
The beneficial effects of this approach have been il- 
lustrated by a comparison of several surrogate-based 
optimization processes on the shape design of a two-di- 
mensional airfoil. The extension of the methodology to 
complex three-dimensional problems is straightforward 
and under way. Indeed, one of the main advantages of 
the proposed methodology is its relative insensitivity 
to the curse of dimensionality of the design parame- 
ter space. On the other hand, the larger snapshot size 
required by three-dimensional CFD flow fields, where 
millions of unknowns may be handled, does not repre- 
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61. Knowledge Discovery in Bioinformatics 


Julie Hamon, Julie Jacques, Laetitia Jourdan, Clarisse Dhaenens 


Biomedical research progresses rapidly, in par- 
ticular in the area of genomic and postgenomic 
research. Hence many challenges appear for bio- 
statistics and bioinformatics to deal with the large 
amount of data generated. After presenting some 
of these challenges, this chapter aims at pre- 
senting evolutionary combinatorial optimization 
approaches proposed to deal with knowledge dis- 
covery in bioinformatics. Therefore, the chapter 
will focus on three main tasks of data mining (as- 
sociation rules, feature selection, and clustering) 
widely encountered in bioinformatics applications. 
For each of them, a description of the task will 
be given as well as information about their uses 
in bioinformatics. Then, some evolutionary ap- 
proaches proposed to cope with such a task will 
be exposed and discussed. 
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61.1 Challenges in Bioinformatics 


Biomedical research progresses rapidly, in particular in 
the area of genomic and postgenomic research. Hence 
many challenges appear for biostatistics and bioinfor- 
matics to deal with the large amount of data gener- 
ated. This data, related to the sequencement of the 
genome, may deal, for example, with the identifica- 
tion of more than 1 million single nucleotide polymor- 
phisms (SNPs) — corresponding to genetic variations — 
that can be used to carry out genome-wide associ- 
ation studies (GWAS). Analyzing such data requires 
advanced methods able to deal with such a large num- 
ber of information and with their specificities. This is 
the reason why knowledge discovery approaches have 
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been proposed to either: 


1. Extract interesting rules or 
2. Reduce the dimensionality of the data or 
3. Classify/cluster data. 


All these knowledge discovery tasks have been 
addressed by several communities: statistics, machine 
learning, and combinatorial optimization. This latest is 
the subject of this chapter and a recent review reports 
synergies between operations research and data min- 
ing [61.1]. In this chapter, we focus on evolutionary 
combinatorial optimization and see how it may be used 
to extract knowledge for bioinformatics. 
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Many problems arise in bioinformatics. In order to 
illustrate this chapter, three types of applications will be 
mainly used. They are described hereafter: 


@ Microarray — Gene expression data: A typical 
bioinformatics application requiring knowledge dis- 
covery deals with DNA microarray data analysis. 
Indeed, DNA microarray experiments are of great 
interest and importance for biologists; thanks to 
their ability to simultaneously measure the expres- 
sions and interactions of thousands of genes. Such 
experiments are used to point out, for example, 
genes of predisposition to some diseases such as 
diabetes, cancer, etc. These experiments are gener- 
ating huge amounts of data that need to be analyzed. 
Those data are mainly represented in gene ex- 
pression matrix. Some experiments add the time 
parameter (to analyze the evolution of the expres- 
sions after a stress, for example) and report results 
at different time points. This special case is some- 
times called 3-D-microarray (three-dimensional mi- 
croarray). For microarray data analysis several data 
mining approaches have been proposed (associa- 
tion rule discovery, feature selection, clustering, and 
bi-clustering) [61.2] and benchmarks are available 
to compare efficiency of methods. Hence it will 
provide a good illustrative application along this 
chapter. As the number of genes to consider is huge 
many heuristics, and in particular evolutionary algo- 
rithms, have been proposed to deal with such data. 

© Genome-wide association studies: Another inter- 
esting approach to find genetic susceptibility for 
disease is to track genetic variations. Indeed, as 
indicated by Moore et al. in their recent study of 
Bioinformatics challenges for genome-wide associ- 
ation studies, the sequencing of the human genome 
has made possible to identify more than 1 million 
SNPs (genetic variations) across the genome that 
can be used to carry out GWAS in order to re- 
veal genetic basis of disease susceptibility [61.3]. 
First approaches used to deal with these massive 


amounts of GWAS data, mainly based on biostatis- 
tics have enabled the discovery of new associations. 
However, as such approaches consider only one 
SNP at a time and most of the time ignore the 
genomic and environmental contexts, more com- 
plex approaches, that consider genotype—phenotype 
relationships have to be proposed. Regarding the 
large number of SNPs to consider and the complex- 
ity of the relationships to discover, the knowledge 
discovery paradigm has been used to deal with 
such data and optimization approaches have been 
proposed. 

© Protein analysis: There are now plenty of identified 
proteins that are not completely known. For exam- 
ple, their function may still be unknown (even if the 
sequence may be known). However, the knowledge 
of their functions is crucial for the development 
of new drugs. Hence, automated function predic- 
tion is an active research field and computational 
techniques that use high-throughput experimen- 
tal data (protein and genome sequences, protein 
interaction networks, phylogenetic profiles, etc.), 
have been developed. Once again such experiments 
produce a large amount of data that need to be 
analyzed. 


Considering this variety of applications, this chapter 
aims at presenting evolutionary combinatorial opti- 
mization approaches proposed to deal with knowledge 
discovery in bioinformatics. Therefore, the chapter will 
focus on three main tasks of data mining widely en- 
countered in bioinformatics applications. These tasks 
are: association rules, feature selection, and clustering 
(unsupervised classification). For each of them, a de- 
scription of the task will be given as well as information 
about their uses in bioinformatics. Then, some evolu- 
tionary approaches proposed to cope with such a task 
will be exposed and discussed. A table provides an 
overview of these approaches and serves as a guideline 
for the reader to know which type of approach to use in 
a specific context. 


61.2 Association Rules by Evolutionary Algorithm in Bioinformatics 


61.2.1 Association Rules Discovery 


Task Description 
The problem of discovering association rules was first 
formulated in [61.4] and was called the market-basket 


problem. The initial problem was the following: given 
a set of items and a large collection of sales records, 
which consist of a transaction date and the items pur- 
chased in that transaction, the task is to find significant 
relationships between the items contained in different 
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transactions. Since this first application, many other 
problems, in particular in bioinformatics, have been 
studied with association rules that may be defined in 
a more general way. Let us consider a database com- 
posed of transactions (records or objects) described ac- 
cording to several — maybe many — attributes (features 
or columns). Association rules provide a very simple 
(but useful) way to present correlations or other rela- 
tionships among attributes (features) expressed in the 
form A => C, where A is the antecedent part (condition) 
and C the consequent part (prediction). A and C are 
sets of attributes that are disjoint. The best-known algo- 
rithm to mine association rules is A-priori, proposed by 
Agrawal and Srikant [61.5]. This two-phase algorithm 
first finds all frequent item sets (sets of items — or at- 
tributes — that often occur together within transactions) 
that have at least a given minimum level of confidence. 
This is done via an efficient search exploiting the down- 
ward closure property of support (which measures the 
frequency of the rules). A lot of improvements upon the 
initial method, as well as efficient implementations (in- 
cluding parallel implementations) have been proposed 
to be able to deal with very large databases [61.6-8]. 

We note that a specific case of rule mining deals with 
classification rules where the consequent is the same for 
every rule. This may be seen as a straightforward clas- 
sification task; however, the models and methods used 
for this are closed to those used more generally in rule 
mining; hence it will be considered in this section. 

The task of discovering effective association rules 
may be seen as a combinatorial optimization problem, 
as rules are combinations of attributes. Each attribute 
may participate to the rule in the antecedent or the con- 
sequent part. Each attribute may have several values 
that have to be checked. As the number of attributes 
may be very large (up to several thousands), the number 
of possible rules (choice of the attributes that participate 
to the rule and their values) may be very large. There- 
fore, efficient methods (heuristic approaches and in 
particular evolutionary approaches) are direly needed. 


Use in Bioinformatics 
In their survey, Atluri et al. present different types of as- 
sociation patterns and discuss some of their applications 
in bioinformatics [61.9]. They indicate that associa- 
tion rules discovery has not been widely used yet in 
bioinformatics except to deal with microarray data and 
data on genetic variations (SNP data) for which sev- 
eral works exist. Their feeling is that association rules 
have been underutilized in bioinformatics and they pro- 
pose, in their article, to give hints on how to exploit the 


potential benefits of such an approach to deal with pro- 
tein function prediction and in particular to address the 
noise and the incompleteness issues of currently protein 
interaction network data. 

In addition, association rules discovery allows the 
integration of external biological information with gene 
expression data. For example, Carmona-Saez et al. pro- 
pose an approach based on co-occurrence patterns, 
that integrates gene annotations and expression data 
to discover intrinsic associations among both data 
sources [61.10]. 


61.2.2 Evolutionary Approaches 
for Association Rules 
in Bioinformatics 


Motivations 
Association rules are a very general model and may 
overcome some drawbacks of other classical knowledge 
discovery tasks such as classification. For example, 
considering microarray data in which relationships be- 
tween genes are searched for, using classification will 
impose a gene participating to several relations to be 
classified in a single group. Classification will also have 
difficulty to point out relations between genes belong- 
ing to a same group and finally, classification will be 
made according to the whole set of experiments which 
do not allow to exhibit relationships between genes in 
a subset of conditions. Association rules may overcome 
these drawbacks by providing relationships between 
genes that occur in certain conditions. 

However, one of the drawback of classical associa- 
tion rules discovery approaches (algorithm A-priori for 
example), is the role played by the support measure. 
Indeed, allowing to identify low support rules (but still 
interesting information as rare rules may be very im- 
portant in the context of bioinformatics) will generate 
a huge number of rules that will be difficult to interpret. 
In this sense other types of approaches, using different 
quality measures, have to be proposed. 

In this sense, for example, Khabzaoui et al. pro- 
posed to analyze microarray data with an association 
rule-based technique. They modeled this problem as 
a multiobjective combinatorial optimization problem 
(which allows us to use other quality criteria than the 
support) and solved it using an evolutionary algorithm 
based on a genetic algorithm. Therefore, specific mech- 
anisms (mutation and crossover operators, elitism, and 
so on) are designed for this task [61.11]. In order to 
improve the quality of the rules obtained, cooperative 
approaches are proposed [61.12]. 


1213 


7'19 |3 Hed 


1214 Part E 


Evolutionary Computation 


7'19 |3 Hed 


Overview 

Firstly, this section will introduce common approaches 
used in evolutionary rule mining, like learning clas- 
sifier systems (LCSs), rough sets approach or genetic 
programming. Secondly, Pittsburgh and Michigan rule 
designs will be detailed. Finally, implementation details 
of genetic algorithms for rule mining with applications 
in bioinformatics will be presented, and a summary ta- 
ble will be provided. 


Some Classical Approaches. Learning classifier sys- 
tems (LCS) come from the machine learning com- 
munity and are useful for classification tasks using 
classification rules. LCS evolve a population of classi- 
fiers — decision trees, rules, or rule sets — using a genetic 
algorithm and a credit assignment module that awards 
good classifiers. A more detailed introduction to LCS 
can be found in [61.13]. Some bioinformatics applica- 
tions have been realized with GAssist algorithm [61.14] 
and its successor (BioHEL) (bioinformatics-oriented 
hierarchical evolutionary Learning) [61.15]. 

The rough set approach consists in finding approx- 
imation sets of features: a lower set, whose features 
allows us to identify objects that certainly belong to 
approximated set, and an upper set whose features de- 
scribe objects that probably belong to approximated 
set. Rules can be generated from these resulting sets. 
More complete information about rough set theory 
and applications can be found in [61.16]. In [61.17, 
18], Rosetta toolkit was used to solve bioinformatics 
problems; Vinterbo and Øhrn explained in [61.19] the 
implementation of a genetic algorithm for rough sets in 
Rosetta toolkit and show that this algorithm allows us 
to produce smaller rules with better predictability. They 
measured the predictability with the AUC score (area 
under ROC curve). The ROC curve (receiver operating 
characteristic) is often used in data mining to assess the 
performance of classification algorithms. It is plotted 
using true positive rate (known as sensitivity) and false 
positives rate (also called 7 — specificity) as axes. More 
details can be found in [61.20]. 

ROC curve is often used in data mining to assess 
the performance of classification algorithms, especially 
ranking algorithms. It is plotted using true positive 
rate (known as sensitivity) and false positives rate (also 
called 7 — specificity) as axes. More details can be found 
in [61.20].)). 

Genetic programming has been used to extract rules 
from biological data. For example, Pappa and Frei- 
tas proposed an original approach to predict protein 
postsynaptic activity [61.22]. Since there are a lot of 


classification algorithms, and the impact of the choice 
of the algorithm is important, they chose to design an 
algorithm that searches for a good classification al- 
gorithm. Therefore, they use grammar-based genetic 
programming (GGP). The main difference with the 
genetic programming approach implemented by Yang 
et al. [61.23] is the use of a grammar. 


Rule Design. When mining rules, two designs are 
available: in Michigan design, each solution is a rule, 
while in Pittsburgh design each solution is a rule set. 
Pittsburgh design has a larger search space; moreover, 
fitness and operators are harder to implement. However, 
with this design there is no need to use a covering algo- 
rithm to encourage rules from the same solution (rule 
set) to cover different objects. With the Michigan de- 
sign, without any covering algorithm, solutions (rules) 
can cover the same objects. This point may cause prob- 
lems when searching for classification rules. In [61.26], 
Bacardit and Butz compared Michigan and Pittsburgh 
LCS. They concluded that both are suitable for data 
mining. Michigan LCS tend to overfit the data — rules 
are too specific — while Pittsburgh LCS are sometimes 
too general and miss some search subspaces. 


Genetic Algorithm Design. As many genetic algo- 
rithms have been proposed to deal with rule mining 
(Table 61.1), this paragraph presents the main compo- 
nents used: 


© Initial population: Most of the time, the initial popu- 
lation is composed of random individuals. However, 
Cho et al. initialized their population with rules gen- 
erated by a neural network [61.21]. In [61.14], the 
population is initialized by iteratively choosing one 
object and generating a rule covering it. 

© Fitness function: It often contains an objective on 
the rule size to limit bloat effect and overfitting, 
both responsible of generating specific and compli- 
cated rules. This is an application of the minimum 
description length (MDL) principle. It is frequently 
associated to a performance measure: quality of 
hitting sets in rough sets, accuracy, coverage, con- 
fidence, or AUC. As many measures have been 
proposed, we will not detail all of them, but refer 
to the review of Geng and Hamilton on rule in- 
terestingness measures [61.27]. Pappa and Freitas 
recommend to use sensitivity x specificity as the fit- 
ness function when class distribution is unbalanced 
(in their protein data, only 6.04% of objects had the 
positive class) [61.22]. The majority of bioinformat- 
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Table 61.1 Overview of applications of evolutionary rule mining in bioinformatics 


Application EA Approach Design Evaluation Encoding Operators Reference 
function 

Protein structure GA LCS Pittsburgh Size accuracy Hyper Tournament selection, [61.15] 

prediction rectangle 1-point crossover 

Protein binding GA NN Michigan Confidence, size Fixed size RWS, 1|-point [61.21] 

(hybrid) value-vectors crossover 

Protein GGP Normalized Grammar [61.22] 

classification accuracy derivation tree 

Protein binding GA Rough sets Michigan Size, coverage Binary SUS [61.18] 

Microarray GP Michigan Size, coverage Binary Elite selection, cut & [61.23] 

splice crossover 

Microarray MOEA Michigan Support, jmeasure, RWS [61.11] 
interest surprise, 
confidence 

Microarray GA LCS Pittsburgh Accuracy, Binary Tournament selection, [61.24] 
misclassification 1-point crossover 
rate 

3-D-Microarray GA Rough sets Michigan Size, coverage Binary SUS [61.25] 


(time series) 


MOEA: multiobjective evolutionary algorithm, GA: genetic algorithm, GP: genetic programming, GGP: grammar-based genetic 
programming, SUS: stochastic universal sampling, LCS: learning classifier systems, NN: neural network, RWS: roulette wheel 


selection 


ics applications use an aggregation to combine these 
multiple objectives. Weights are sometimes intro- 
duced to balance between objectives. For exam- 
ple, Bacardit uses an automatic weighting function 
that changes weights while the algorithm is run- 
ning [61.28]. However, weights can be difficult to 
configure; therefore, multiobjective algorithms can 
overcome this problem [61.11, 29]. 

© Encoding: For the greater part, encoding is binary. 
In rough sets approach, it is fixed size and matches 
selected features of the approximated sets. Rules 
can be encoded as binary or list of values or fea- 
tures and values. In [61.15] a rule representation for 
real-values features is used: hyper-rectangle instead 


of fuzzy rules approaches or data preparation with 
discretization methods. Pappa and Freitas encoding 
differentiates itself from others because of a gram- 
mar derivation tree encoding [61.22]. This is needed 
since they search for classification algorithms, and 
not for rules. 

© Operators: Mostly used crossover operators are 1- 
point crossovers. Casillas et al. proposed adapted 
crossover and mutation operators to deal with rule 
overlapping when using rule sets [61.29]. Parents 
are mainly selected using fitness proportionate se- 
lection, as stochastic universal sampling (SUS) or 
roulette wheel. Less frequently, elite selection and 
tournament are used. 


61.3 Feature Selection for Classification and Regression by Evolutionary 


Algorithm in Bioinformatics 


61.3.1 Feature Selection 


Feature selection is an active research domain in statis- 
tics (variable selection) and data mining communities. 
Feature selection can, jointly used with classification 
(or clustering), significantly improve the comprehensi- 
bility of the resulting classifier models and often build 
a model that generalizes better unseen points. The main 


idea of feature selection is to choose a subset of input 
variables by eliminating features with little or no pre- 
dictive information. Hence, finding the correct subset 
of predictive features is an important problem in itself. 

Feature selection for classification can be classi- 
fied in three classes depending on how the selection 
process is combined with the classifier: the wrapper ap- 
proach, the filter approach, and the embedded approach. 
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The wrapper approach model uses learning algorithms 
during the feature selection process and assesses the se- 
lected features by the learning algorithm’s performance 
using, for example, accuracy, sensitivity, or specificity. 
The filter approach model considers statistical char- 
acteristics of a dataset directly without involving any 
learning algorithm. In the embedded approach model, 
the learning algorithm uses its own embedded feature 
selection algorithm (either explicit or implicit). Let us 
remark that an hybrid approach model is sometimes 
used to, first adopt a filter approach that will reduce the 
number of features to consider, and then realize, with 
the remaining features, a wrapper approach that will se- 
lect features in a more accurate way. 

The general task of feature selection can be for- 
mulated as an optimization problem. Binary values 
of the variables x; are used in order to indicate the 
presence (x; = 1) or the absence (x; = 0) of the fea- 
ture i in the optimal feature set. Then, the problem 
is formulated as max,—(.,,....x,)efo:1}" F(x) for a func- 
tion F that has to be determined regarding the context 
(filter, wrapper, or embedded approach and applica- 
tion under study). In filter approaches, many different 
statistical feature selection measures, such as the cor- 
relation feature selection (CFS) measure, the minimal- 
redundancy-maximal-relevance (nRMR) measure, the 
discriminant function, or the Mahalanobis distance have 
been used to assess to each feature a score. In wrapper 
approaches, classification algorithms may be used to as- 
sign to a selection of features a score that represents 
the ability of the selection to lead to a correct classi- 
fication. Such classical algorithms are KNNs (k nearest 
neighbors), SVMs (support vector machines), NN (neu- 
ral networks), etc. 

As reported by Kim etal., traditional approaches 
to feature selection with a single criterion have shown 
some limitations [61.30]. Therefore, they propose to 
consider this problem as a multiobjective one and 
present an adaptation of (ELSA) (evolutionary local se- 
lection algorithm) , inspired from artificial life models 
of adaptive agents to cope with this multiobjective prob- 
lem. Another multiobjective approach may be found in 
Garcia-Nieto et al. where a multiobjective genetic algo- 
rithm is used for cancer diagnosis from gene selection 
in microarray datasets [61.31]. 


Use in Bioinformatics 
As indicated by Saeys et al., feature selection in bioin- 
formatics is motivated by the high-dimensional na- 
ture of modeling tasks (sequence analysis over mi- 
croatray analysis, spectral analyses, literature mining, 


etc.) [61.32]. Let us remark, that in contrast with other 
dimensionality reduction techniques (based on projec- 
tion, or compression, for example), feature selection 
techniques do not modify data. Thus they preserve the 
original semantics of the variables which helps the in- 
terpretability of results. 

In their review of Feature selection techniques in 
bioinformatics, Saeys etal. identify three classes of 
problems where feature selection is involved [61.32]: 


@ Sequence analysis 
@ Microarray analysis 
@ Mass spectra analysis. 


Sequence analysis deals with the study of either the 
content of the sequence or its signal. As far as the con- 
tent is concerned, the prediction of subsequences that 
code for proteins requires a feature selection to cope 
with the large number of features that can be extracted 
from a sequence and the lack of samples available. 
Recently, feature selection approaches have also been 
used for other applications such as the recognition of 
promoter regions or the prediction of microRNA tar- 
get. Regarding the signal analysis, the aim is mainly 
to identify more or less conserved signals in the se- 
quence (motifs), representing binding sites for proteins. 
Therefore, regression approaches are proposed to relate 
motifs to gene expression levels and feature selection 
can be used to search for the best motif. 

Microarray analysis, as already said, poses a great 
challenge for computational techniques because of their 
large dimensionality. Saeys et al. give in their survey an 
overview of the most influential techniques [61.32]. In 
particular, genetic algorithms can be used to deal with 
microarray data in wrapper type approaches. 

Mass spectra analysis deals with the analysis of 
thousands of signal intensity measures. This context, 
even if data are different, is very similar to microarray 
analysis and feature selection is an important step to 
reduce the dimensionality of the problem. Genetic al- 
gorithms and nature inspired algorithms have been pro- 
posed to deal with such data, using wrapper approaches. 


61.3.2 Evolutionary Approaches for Feature 
Selection for Classification 
and Regression in Bioinformatics 


Motivations 
With the development of technologies, many bioinfor- 
matics applications deal with large datasets, often with 
more features than objects (samples). However, among 
these features, some are irrelevant or redundant. That is 
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why feature selection aims to select a subset of relevant 
features. Reducing the dimension of the problem, this 
method can reduce the computational time and improve 
prediction accuracy. Indeed, including nonsignificant 
features can induce a noise and may mask significant 
ones. 

Traditional feature subset selection methods are se- 
quential and based on greedy heuristics. For example, 
sequential forward selection (SFS) starts with an empty 
subset and iteratively adds some features, whereas the 
sequential backward selection (SBS) starts with the full 
feature set and iteratively removes features [61.47]. An 


important drawback of these methods is that they con- 
sider one feature at a time, ignoring possible interactions 
between features. Fairly recently, more advanced meth- 
ods such as evolutionary algorithms have been proposed 
to explore the space of feature subsets [61.35, 48]. 


Overview 
Evolutionary algorithms for feature selection in bioin- 
formatics are rarely used in a filter approach as such 
approaches ignore the effects of the selected feature 
subset on the performance of the classifier and do not 
consider existing correlations between features. Hence, 


Table 61.2 Overview of evolutionary feature selection applications in bioinformatics 


Application EA Approach Classifier Evaluation function Encoding Operators Reference 
Microarray (Cancer) GA wW KNN CA (LOOCYV) Binary [61.33] 
Mass spectra GA W KNN CA (LOOCV) Discrete [61.34] 
(Cancer) 
Microarray (Cancer) Hybrid W KNN CA (LOOCV), Binary Rank-based RWS, [61.35] 
GA # features m-point crossover 
Microarray GA WwW SVM Sensitivity, specificity, Binary RWS, 1-point/2-point [61.36] 
geometric mean crossover 
Microarray (Cancer) GA W AP/SVM_ CA (LOOCV) SUS and RWS, Uniform [61.37] 
and 1-point crossover 
Microarray (Cancer) GA W SVM CA (10-fold), Binary RWS, 2-point crossover, [61.38] 
# features, feature cost elitist replacement 
Microarray (Cancer) GA W SVM CA (10-fold) Binary Specific SSOCF crossover [61.39] 
Microarray (Cancer) GA E SVM CA (10-fold), Binary + SUS, Specific crossover, [61.40] 
# features coefficient Specific mutation 
vector 
Microarray (Cancer) GA H SVM CA (LOOCV) Binary RWS, random 1-point [61.41] 
crossover, multiuniform 
mutation 
Microarray (Cancer) GP CA (10-fold), Binary Reproduction, [61.42] 
# features homo(hetero)geneous 
crossover 
Microarray (Cancer) GP AUC-ROC Generational, tournament [61.43] 
selection 
Microarray (Cancer) MOEA W GS CA (LOOCY), Binary Elitist + ranking selection [61.44] 
# features 
Microarray (Cancer) MOEA WwW SVM Sensitivity, specificity Binary SSOCF crossover, bit-flip [61.31] 
(NSGATI) (10-fold) # features mutation (uniform, one 
reduction, zero reduction) 
Microarray Parallel EH-DIALL, CLUMP Discrete Uniform crossover, [61.45] 
(Diabetes/obesity) GA specific mutation 
Mass spectra GA WwW MLR PLS RMSEP Binary [61.46] 
(Regression) (data-splitting) 


EA: evolutionary algorithm, MOEA: multiobjective evolutionary algorithm, GA: genetic algorithm, GP: genetic programming, GPSO: geo- 
metric particle swarm optimization. Approach: wrapper (W), embedded (E), hybrid (H). Classifiers: KNN: k nearest neighbor, SVM: support 
vector machine, AP: all paired, MLR: multiple linear regression, PLS: partial least square. Evaluation functions: CA: classification accuracy, 
AUC: area under curve, LOOCYV: leave-one-out cross-validation, RMSEP: root-mean-square error of prediction. Operators: SUS: stochastic 
universal sampling, RWS: roulette wheel selection, SSOCF: subset size-oriented common features 
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considering jointly the feature subset selection and the 
classification (or regression) process is more promising. 
This can be performed by three different approaches: 
embedded [61.40], hybrid [61.41], or more frequently 
wrapper approaches. 

Any evolutionary algorithm can be used for feature 
selection. For example, some methods use genetic pro- 
gramming [61.42], but the vast majority uses genetic 
algorithms (GAs). Table 61.2 reports some works about 
evolutionary feature selection applications in bioinfor- 
matics. These works are described according to the 
application field, the evolutionary algorithm used, the 
approach (embedded, hybrid, wrapper), the classifier 
used (when one is used), the evaluation function and the 
specificities about encoding and operators. This table 
helps to identify tendencies of the use of evolutionary 
algorithms for feature selection in bioinformatics: 


@ Encoding: Solutions are mainly encoded with bi- 
nary vectors of size n (initial number of features), 
each bit indicating if a feature is selected or not. 
However, in their studies, Jourdan et al. [61.45] and 
Li etal. [61.34] propose to use a discrete vector 
encoding, where each solution is described by the 
list of the selected features that is particularly well 
adapted for large datasets as encountered in bioin- 
formatics. 

@ Evaluation functions: As explained before, evolu- 
tionary algorithms are mainly used jointly with 
a classification algorithm such as KNN [61.35] or 
SVM [61.36]. Using such a classifier, allows us to 
evaluate the potential of the selection to lead to 
a good classification by the computation of the clas- 
sification accuracy. This accuracy can be computed 
with various methodologies such as k-fold cross- 
validation (10-fold, for example), leave-one-out 


cross-validation (LOOCV) or bootstrap methodol- 
ogy. The 0.632 booststrap has been proven to be the 
best estimator in [61.49], but the drawback of this 
method is its computational cost in comparison to 
LOOCYV. For this reason, most of the authors use 
LOOCYV which is fast and almost unbiased [61.35]. 
When dealing with larger datasets, 10-fold cross- 
validation can also be used [61.38]. 
The evaluation function can also take into ac- 
count other parameters such as the number of fea- 
tures [61.39]. In this context (feature subset size 
minimization and performance maximization), fea- 
ture selection can be viewed as a multiobjective 
optimization problem [61.31, 44]. 

© Operators: In terms of operators, some works 
deal with specific ones [61.39], but classical op- 
erators are mainly used. For example, we may 
cite the SUS or the roulette wheel selection 
(RWS) [61.37] for the selection, the 1-point or 
2-point crossovers [61.36] for the evolution of so- 
lutions, the bit-flip mutation etc. 


Feature selection is often used for classification, in 
order to predict a discrete trait and to classify sam- 
ples (disease or not, for example). However, to predict 
a quantitative trait (a value indicating the good dispo- 
sition for a treatment, for example), regression is used 
instead of classification. The problem is the same, as 
if too many features are available, including nonsignif- 
icant ones, the regression method will have difficulties 
to give good results. Hence, feature selection may also 
be used in a regression context. For example, Broad- 
hurst et al. combined a genetic algorithm with a mul- 
tiple linear regression (MLR) or with a partial least 
square (PLS) regression on a mass spectrometry prob- 
lem [61.46]. 


61.4 Clustering by Evolutionary Algorithm in Bioinformatics 


61.4.1 Clustering 


Task Description 
Clustering or unsupervised classification aims at de- 
composing or partitioning a (usually multivariate) 
dataset into groups so that objects in a group are similar 
to each other, and are as different as possible from ob- 
jects of other groups. A survey of clustering algorithms 
can be found in [61.50]; thus we will just introduce gen- 
eralities below. 

Clustering techniques can be broadly divided into 
three main types: partitional, hierarchical, and overlap- 


ping. Partitional and hierarchical clusterings produce 
a hard partition of data as an object must belong to one 
and a single group, whereas in overlapping clustering 
objects may belong to several groups. In clustering, the 
number of groups can be known and fixed before real- 
izing the clustering or must be determined directly by 
the algorithm. 

For partitional-based methods, the most common 
algorithm is k-means [61.51], which is often described 
as a local search. For hierarchical clustering, two dis- 
tinct types of hierarchical methods are identifiable: The 
agglomerative ones and the divisive ones. 
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Use in Bioinformatics 
Clustering is the most popular method currently used in 
the first step of gene expression matrix analysis. Clus- 
tering is appropriate when there is no a priori knowl- 
edge about the data. In such circumstances, the only 
possible approach is to study the similarity between dif- 
ferent samples or experiments. There are two straight- 
forward ways to study the gene expression matrix: 
comparing expression profiles of genes by comparing 
rows in the expression matrix and comparing expres- 
sion profiles of samples by comparing columns in the 
matrix. By comparing rows, we may find similarities or 
differences between different genes and thus conclude 
about the correlation between the two genes. If we find 
that two rows are similar, we can hypothesize that the 
respective genes are co-regulated and possibly function- 
ally related. By comparing samples, we can find which 
genes are differentially expressed in different situations. 


61.4.2 Evolutionary Approaches 
for Clustering in Bioinformatics 


Motivations 

Evolutionary clustering has been particularly used in 
bioinformatics as datasets are particularly large and 
classical methods are inefficient as they often lead to 
suboptimal solutions. A good survey of the use of 
evolutionary algorithms for clustering can be found 
in [61.52]. The authors proposed a classification of al- 
gorithms taking into consideration different aspects of 
evolutionary data clustering: 


@ Fixed or variable number of clusters 
Cluster-oriented or nonoriented operators 
Context-sensitive or context-insensitive operators 
Binary, integer, or real encoding 

Centroid-based, medioid-based, label-based, tree- 
based, or graph-based representations. 


Other surveys can be found on genetic based [61.53] 
and on multiobjective clustering [61.54, 55]. 


Overview 
Evolutionary algorithms for clustering bioinformatics 
data are applied to both, fixed number of clusters and 
variable number of clusters. The majority of the appli- 
cations concerns microarray data. In Table 61.3, some 
works are presented through important components (ap- 
plication field, evolutionary algorithm used, fixed or 
variable number of clusters (k is known or not?), etc.). 
Here below an attempt to describe tendencies of exist- 
ing methods is proposed by separating the two cases: 


fixed or variables number of clusters. At the end, infor- 
mation will be given about biclustering that is more and 
more used in bioinformatics. 


Fixed Number of Clusters. The number of clusters 
can be fixed before finding a clustering model through 
evolutionary algorithms. Therefore, the number of clus- 
ters (often denoted by k) can be fixed by an expert of 
the domain (here a biologist for example) or by using 
some specific criteria like a naive criterion k ~ a 
where n is the number of objects, or more specific 
ones based on an information criterion approach such 
as the Akaike information criterion (AIC), Bayesian in- 
formation criterion (BIC), or the deviance information 
criterion (DIC): 


@ Encoding: For a fixed number of clusters, there ex- 
ist several possible encodings: binary, integer, and 
real. For binary encoding, both prototype or parti- 
tion could be realized. For integer encoding, two 
usual representations are used: label-based encod- 
ing where each gene represents an object and the 
value indicates the label of the cluster it is assigned 
to [61.60]; the medioid-based encoding represents 
the prototype of each cluster (the object that depicts 
the cluster). The real encoding is used to repre- 
sent the coordinates of the center of each cluster 
and corresponds to the centroid-based representa- 
tion [61.57, 62, 63]. 

@ Fitness function: One other specific component for 
evolutionary clustering of bioinformatics data is the 
fitness function. Many clustering validity criteria 
exist and can be adapted to measure the quality 
of a solution of an evolutionary algorithm. Some 
examples are: minimization of the sum of within- 
cluster distances, minimization of the sum of the 
squared Euclidian distance of the objects from their 
respective cluster means [61.57], minimization of 
the distortion of the cluster (intracluster diversity), 
sum of the within cluster distances, etc. 

@ Operators: Concerning operators, some authors use 
classical operators like the 1-point crossover but 
a lot of articles show the drawbacks of classical 
genetic operators [61.57] and prefer to use context- 
sensitive operators [61.56]. 


Variable Number of Clusters. 


@ Encoding: For a variable number of clusters, where 
evolutionary algorithms aim at optimizing both the 
number of clusters and the partition of objects, the 
previously mentioned representations can be used 
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Table 61.3 Overview of applications of evolutionary clustering in bioinformatics 


Application EA K? Evaluation function Encoding Operators Reference 
Microarrays GA Y __Interestingness measure Set of clusters + label Specific crossover, [61.56] 
of objects specific mutation 
Microarrays (HL-60, Memetic Y Minimum sum of square  Centroid Kmean (LS), Uniform + specific [61.57] 
HD-4cl, Yeast) (GA) crossover, split mutation 
Microarrays GA N XB Centroid RWS, specific crossover, muta- [61.58] 
(AD400_10_10, (Cluster validity index) tion value 
Yeast, Human, Rats) 
Microarrays MOEA N Overall deviation Set of clusters Binary tournament, specific [61.59] 
+ connectivity crossover, no mutation 
Microarrays GA N Silhouette + K Label Mutation: split + eliminate [61.60] 
a cluster, specific crossover 
Microarrays GA N | Silhouette + VRC Label + K Centroid-based crossover [61.61] 
(Lung + Leukemia) + distance + centroid 
Microarrays GA N Bayesian validation Centroid RWS, 1|-point crossover, muta- [61.62, 63] 
tion value 
Protein structure Chaotic N Max clustering Binary Specific crossover, [61.64] 
GA coefficient specific mutation 
Protein-Protein MOEA N Cluster size + 3 problem Centroid No crossover, mutation: split [61.65] 


functional interaction 


related functions 


+ delete, merge 


K?: is the number of clusters known? (Y: yes, N: no), MOEA: multiobjective evolutionary algorithm, GA: genetic algorithm, K: number of 
clusters, VCR: variance ratio criterion, XB: Xie-Beni cluster validity index, LS: local search, RWS: roulette wheel selection 


but there are also some new ones. For example, the 
number of clusters can be stored in the representa- 
tion [61.61]. There are also some rule-based repre- 
sentations [61.66], graph-based representations. . . 

© Operators: As the encodings can be more com- 
plex than in the case of a fixed number of clus- 
ters [61.61], operators are adapted to the represen- 
tation and the context of clustering. 

© Fitness function: Concerning the evaluation, the 
authors often use criteria of validity of cluster- 
ing [61.67]. We can also observe that the silhouette 
coefficient is often used to evaluate the quality of 
a clustering [61.60]. The authors can also add some 
problem-related functions as in [61.65]. 


Biclustering. Biclustering, (also called co-clustering 
or two-mode clustering) has for objective to com- 
pute biclusters (or co-clusters) that are associations 


61.5 Conclusion 


Bioinformatics research generates a lot of data and 
knowledge extracted from this data is still basic. Much 
more knowledge could be discovered with the proposi- 
tion of new data mining methods. Many of these data 
mining problems can be modelized as combinatorial 


of (possibly overlapping) sets of objects with sets of 
features. A biclustering algorithm computes simulta- 
neously linked partitions on both rows and columns. 
Many formulations of the biclustering problem have 
been proposed, such as hierarchical model, biclustering 
model, and pattern-based model. The term biclustering 
has been introduced by Cheng and Church in [61.68]. 
Up to now, in the context of bioinformatics, biclus- 
tering approaches have been proposed mainly to deal 
with microarray data [61.69]. As clusters may overlap 
in the two dimensions of the matrix and no constraint 
is given about their size, it may be possible to find 
a very large number of significant biclusters. Hence, 
to have a concise description of the data through bi- 
clusters, the size aspect is often considered as an 
additional objective. This leads to multiobjective mod- 
els for which MOEAs have been proposed [61.70, 
71]: 


optimization problems and efficient algorithms such as 
evolutionary algorithms can be used to explore the huge 
search space of these problems. Some research has been 
conducted in this sense and the aim of this chapter was 
to present the tendencies of these works. It shows that 
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some promising results have been obtained for several 
applications in bioinformatics. 

However, there is much room for future research 
since the problems in bioinformatics presented in this 
chapter require even more effective approaches to gain 
important knowledge from the biological and biomedi- 
cal experiments. In particular, information about the do- 
main under study is still underutilized within research 
methods. Biological aspects should be more present and 
this may be done thanks to a more accurate model- 
ing or the incorporation of biological concepts within 
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62. Integration of Metaheuristics 
and Constraint Programming 


Luca Di Gaspero 


A promising research line in the optimiza- 
tion community regards the hybridization of 
exact and heuristics methods. In this chap- 
ter we survey the specific integration of two 
complementary optimization paradigms, namely 
Constraint Programming, for the exact part, and 
metaheuristics. 
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62.1 Constraint Programming and Metaheuristics 


Constraint programming (CP) [62.1,2] is an effective 
methodology for the solution of combinatorial prob- 
lems that has been successfully applied in many do- 
mains. In a nutshell, CP is a declarative programming 
paradigm based on the idea of describing the relations 
(i. e., constraints) between variables that must hold in all 
solutions of the combinatorial problem at hand. For ex- 
ample, in the solution to a Sudoku puzzle, the numbers 
to be placed must be unique with respect to columns, 
rows, and blocks of the board. 

CP has an interdisciplinary nature, since it re- 
lies on contributions and methods from the communi- 
ties of logic programming (LP), artificial intelligence 
(AI), and operations research (OR). Indeed, the sim- 
ple declarative modeling language of CP, consisting 
of variables and constraints, is very similar to those 
available in classical LP languages such as Prolog. 
The solution method features constraint propagation 
which, in its essence, is a reasoning or inference proce- 
dure typical of AI. Finally, especially for optimization 
problems, the solution process makes use of OR in- 
spired branch and bound procedures and/or of dedicated 
OR solvers for specific types of variables/constraints 
(e.g., the simplex method for real variables and linear 
constraints). 


A CP model is an encoding of the problem state- 
ment using the basic CP building blocks, i. e., variables 
and constraints. Once a CP model of the problem un- 
der consideration has been stated, a CP solver is used to 
systematically search the solution space by alternating 
deterministic phases (constraint propagation) and non- 
deterministic phases (variable assignment, tree search), 
thus exploring implicitly or explicitly the whole search 
space. To this respect, CP belongs to the family of com- 
plete (or exact) solution methods. In other words, CP 
guarantees finding the (optimal) solution of the prob- 
lem or proving that the problem is not satisfiable. 

A different approach is usually taken by metaheuris- 
tics [62.3], such as local search [62.4], evolutionary 
algorithms [62.5], and ant colony optimization [62.6], 
just to name a few. These methods are incomplete, since 
they rely on heuristic information to focus on inter- 
esting areas of the search space and, in general, do 
not explore it entirely but are stopped after a given 
time limit. As a consequence, these algorithms do not 
guarantee finding the (optimal) solution, trading com- 
pleteness for a (possibly) greater (empirical) efficiency 
in the solution process. 

Just looking at completeness, it seems that the 
clear choice for solving combinatorial problems would 
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be to always prefer CP over metaheuristics as the 
solution method. However, in practice completeness 
is hindered by the high computational effort due to 
the worst case complexity of the problems consid- 
ered (usually NP-complete or NP-hard). Therefore, for 
practical purposes, also the execution of CP solvers 
is terminated before the whole search space has been 
explored and a number of heuristics is used to focus 
the search in the regions where it is more likely to 
find the solutions of the problem. Consequently, CP 
and metaheuristics could be seen as complementary 
approaches. 

Although these two kinds of methods are have been 
individually studied by separated scientific communi- 
ties (for historical reasons), in recent years we have 
witnessed an increasing interest in the integration of the 
methods. In many cases, indeed, each approach has its 
own strengths and weaknesses, and the general aim of 
method integration is to create hybrid algorithms that 
enhance the strengths of both approaches and (possi- 
bly) overcome some of the weaknesses. To this respect, 
Yunes maintains a web page listing a number of success 
stories of hybrid solution methods [62.7], that is, papers 
describing integrated approaches that outperform single 
optimization methods. 

A number of conferences and workshops specifi- 
cally aiming at bringing together researchers working 
on the integration of solution techniques for combinato- 
rial problems have also recently started. Notable exam- 
ples are the series of CP-AI-OR conferences [62.8, 9], 
started in 1999, and the Hybrid Metaheuristics work- 
shops [62.10-16], started in 2004. The scope of these 
conferences is not limited to the integration of CP 


techniques with metaheuristics, but they also consider 
hybridization among other methods. 

Additionally, a few surveys on the integration of 
complete methods with metaheuristics have appeared in 
the literature [62.17—19]. However, these surveys either 
deal with a particular class of metaheuristics (i. e., local 
search) [62.17, 19] and/or with a different class of com- 
plete methods (integer linear programming) [62.17, 
18]. Jourdan et al. [62.20] also took CP methods into 
account, but they provide mostly a taxonomy of coop- 
eration between optimization methods rather than sur- 
veying the specific integrations. Wallace and Azevedo 
et al. [62.21,22] surveys hybrid algorithms, but from 
a constraint programming viewpoint and mainly in the 
settings of hybrid exact methods. In their recent review 
of hybrid metaheuristics Blum et al. [62.23] include 
a section on the integration of CP with local search and 
ant colony optimization (ACO). However, to the best 
of our knowledge, at present no specific survey on the 
integration of metaheuristics and constraint program- 
ming has been published in the literature. This work 
tries to overcome this lack and to review the different 
approaches specifically employed in the integration of 
CP methods within metaheuristics. 

The chapter is organized as follows. In Sect. 62.2 
the basic concepts of the constraint programming 
paradigm are introduced. They include modeling 
(Sect. 62.2.1), solution methods (Sect. 62.2.2), and CP 
systems (Sect. 62.2.3). The integration of CP with meta- 
heuristics is presented in Sect. 62.3, which is organized 
on the basis of the metaheuristic type involved in the 
integration. Finally, in Sect. 62.4 some conclusions are 
drawn. 


62.2 Constraint Programming Essentials 


In this section, we will briefly describe the essential 
concepts of CP, which are needed to understand the 
following sections. The readers interested in a more de- 
tailed introduction to CP are referred to the book of 
Apt [62.1] and to the recent comprehensive Handbook 
of Constraint Programming [62.2]. 

In order to apply constraint programming to a com- 
binatorial problem one first needs to model it through 
the specific formalism of constraint satisfaction or con- 
strained optimization problems. Afterwards, the model 
can be solved by a CP solver, which alternates the anal- 
ysis of constraints with tree search. Let us review these 
basic concepts. 


62.2.1 Modeling 


Constraint satisfaction problems (CSPs) are a useful 
formalism for modeling many real-world problems, ei- 
ther discrete or continuous. Remarkable examples are 
planning, scheduling, timetabling, just to name a few. 
A CSP is generally defined as the problem of associ- 
ating values (taken from a set of domains) to variables 
subject to a set of constraints. A solution of a CSP is 
an assignment of values to all the variables so that the 
constraints are satisfied. In some cases not all solutions 
are equally preferable and we can associate a cost func- 
tion to the variable assignments. In these cases, we talk 
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about constrained optimization problems (COPs), and 
we are looking for a solution that (without loss of gen- 
erality) minimizes the cost value. These concepts are 
formally introduced in the following. 


Constraint Satisfaction Problems 


Given: 
@ X={x,...,x,} is a set of variables. 
© D={D,,..., Dx} is a set of domains associated to 


the variables. In other words, each variable x; can 
assume value d; if and only if d; € Dj. 

@ Cisa set of constraints, i. e., mathematical relations 
over Dom = D; x --- x Dx. 


We say that a tuple (d),...,d¢) € Dom satisfies 
a constraint C € C if and only if (d1, .. . , dk} € C. 

A constraint satisfaction problem (CSP) P, de- 
scribed by the triple (X, D, C), is the problem of 
finding the tuples d = (dı, . . . , dy) € Dom that satisfy 
every constraint C € C. Such tuples are called solutions 
of the CSP, and the set of solutions of P is denoted 
by sol(P). 

P is said to be consistent or satisfiable if and only if 
sol(P) # Ø. 

Notice that, depending on the modeling of the com- 
binatorial problem at hand, we could be interested in 
determining different properties of the CSP. In the ex- 
treme case, for example, one could just want to know 
whether the problem is satisfiable, regardless of the ac- 
tual solutions. The most common case is to search and 
provide a single solution to the problem, whereas some- 
times one could be interested in all the solutions. 


Constrained Optimization Problems 
A constrained optimization problem (COP) O= 
(X,D,C,f) is a CSP P = (X, D, C) with an associ- 
ated objective function f : sol(P) > E, where (E, <) 
is a well-ordered set (typically, Æ is one of the sets 
N, Z,R). 

Differently from the previous case, the tuples d € 
sol(©) that satisfy every constraint are called feasible 
solutions, and the set of these tuples is usually assumed 
to be non-empty. A solution of the COP © is a feasible 
solution € € sol() for which the value of the objective 
function f is minimized, i. e., 


Vd € sol(O) f (€) < f(d). 


Observations 
A few observations about this formalism are worth not- 
ing. First, notice that the general framework does not 


impose any restriction on either the type of domains 
and constraints or on the form of the objective func- 
tion that can be used to express the problem. The basic 
type of domain is a finite set of integer values (also 
known as a finite domain), but there are other possibili- 
ties that enhance the expressive power of the modeling 
framework and capture some combinatorial substruc- 
tures of the problem more naturally. For example, it 
is possible to deal with variables whose values are fi- 
nite (multi)sets, (hyper)graphs, real valued intervals, or 
resources of a scheduling problem. Moreover, also the 
kind of constraints that can be employed is quite rich 
and includes arithmetic constraints, set constraints, per- 
mutation, counting and other types of combinatorial 
constraints, resource scheduling constraints, path con- 
straints on graphs, and constraints expressible through 
regular expressions, just to name a few possibilities 
(see [62.24] for a comprehensive set of constraints and 
their implementation in actual CP systems). 

These features clearly make the modeling phase 
easier and more precise with respect to other for- 
malisms such as integer linear programming. Indeed, 
part of the combinatorial structure of the problem 
can be directly captured by the use of complex do- 
mains/constraints and, as for the objective function, 
there is no general limitation on its form, in particular, 
there is no assumption of linearity. 

Another important point to be noticed regards the 
role of constraints. Differently from other modeling 
formalisms, which distinguish between constraints that 
must be satisfied (called hard constraints) and that 
should preferably be satisfied (soft constraints), in the 
original CSP/COP framework constraints are all hard 
and the solution methods, described in the following 
section, consider it mandatory to satisfy all of them. 
There have been several attempts in the CP literature to 
include soft constraints in the general framework (see, 
e.g., [62.25] for a review) but the most common way to 
handle them is to include a measure of their violation in 
the objective function of the problem. 


62.2.2 Solution Methods 


CP solution methods basically exploit a form of tree 
search that interleaves a branching phase with an anal- 
ysis of constraints called constraint propagation. These 
two components are described in the following. 


Branching and Tree Search 
Once the combinatorial problem has been modeled as 
a CSP or a COP, CP solves it by constructing a solution 
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by a process that exploits a non-deterministic variable 
assignment, where one value is selected together with 
one value in its current domain. This phase is also called 
labeling using (constraint) logic programming termi- 
nology, and a solution to the problem is a complete 
labeling. The process proceeds by recursively checking 
whether the current labeling can be extended to a con- 
sistent solution or, in the negative case, undoing the 
current assignment. 

The pseudocode of the procedure, called (chrono- 
logical) backtracking, is given in Algorithm 62.1. 
The procedure is at first called with the full 
set of variables and empty labeling as follows 
Backtracking(X, Ø, C, Dom). The procedure performs 
an implicit form of tree search, where a branch is iden- 
tified by the selection of one variable (a node of the 
search tree) and all the possible values for that variable 
(the edges). 

Note that, at each step of the recursive procedure, 
the choice of the variable and the value to branch on is 
non-deterministic. Therefore, these choices are suscep- 
tible to heuristics to enhance performances. 

In addition, there are also other possibilities to de- 
fine a branching rule. For example, instead of selecting 
a possible value for the variable selected (i. e., the as- 
signment x; := v), the branching rule could split the 
domain of a given variable x; in two by selecting a value 
v € D; and adding the constraint x; < v on one branch 
and x; > v on the other. 


Consistency and Constraint Propagation 
The check for solution consistency does not need all 
the variables to be instantiated, in particular for de- 
tecting the unsatisfiability of the CSP with respect 
to some constraint. For example, in Algorithm 62.2, 
the most straightforward implementation of the proce- 
dure Consistent(L, C, Dom) is reported. The procedure 
simply checks whether the satisfiability of a given con- 
straint can be ascertained according to the current label- 
ing (i. e., if all of the constraint variables are assigned). 
However, the reasoning about the current labeling with 
respect to the constraints of the problem and the do- 
mains of the unlabeled variables does not necessarily 
need all the variables appearing in a constraint to be in- 
stantiated. Moreover, the analysis can prune (as a side 
effect) the domains of the unlabeled variables while 
preserving the set of solutions sol(P), making the ex- 
ploration of the subtree more effective. This phase is 
called constraint propagation and is interleaved with 
the variable assignment. In general, the analysis of each 
constraint is repeated until a fixed point for the current 


situation is achieved. In the case that one of the do- 
mains becomes empty consistency cannot be achieved 
and, consequently, the procedure returns a fail. 

Different notions of consistency can be employed. 
For example, one of the most common and most stud- 
ied notions is hyper-arc consistency [62.26]. For a k-ary 
constraint C it checks the compatibility of a value v in 
the domain of one of the variables with the currently 
possible combinations of values for the remaining k— 1 
variables, pruning v from the domain if no support- 
ing combination is found. The algorithms that maintain 
hyper-arc consistency have a complexity that is polyno- 
mial in the size of the problem (measured in terms of 
number of variables/constraints and size of domains). 
Other consistency notions have been introduced in the 
literature, each having different pruning capabilities and 
computational complexity, which are, usually, propor- 
tionally related to their effectiveness. 

One of the major drawbacks of (practical) consis- 
tency notions is that they are local in nature, that is, they 
just look at the current situation (partial labeling and 
current domains). This means that it would be impossi- 
ble to detect future inconsistencies due to the interaction 
of variables. A basic technique, called forward check- 
ing, can be used to mitigate this problem. This method 
exploits a one-step look-ahead with respect to the cur- 
rent assignment, i. e., it simulates the assignment of pair 
of variables, instead of a single one, thus evaluating the 
next level of the tree through a consistency notion. This 
technique can be generalized to several other problems. 


Algorithm 62.1 Backtracking (U, L, C, Dom) 
1: if U = Ø then 
2 return L 
3: end if 
4: pick variable x; € U 
/*possibly x; is selected non-deterministically*/ 


5: for v € Di /*Try to label x; with value v*/ do 
6: Dom’ < Dom 
7: if Consistent(L U {x := v}, C, Dom’) 
/*consistency notions can be different and have 
side effects on Dom*/ then 
8: r < Backtracking(U \ {x}, LU {x := 
v}, C, Dom’) 
9: if r Æ fail then 
10: return r /*a consistent assignment has 
been found for the variables in U \ {x;} with 
respect to x; := v*/ 
11: end if 
12: endif 
13: end for 
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14: return fail /*backtrack to the previous variable (no 
consistent assignment has been found for x;)*/ 


Algorithm 62.2 Consistent (L, C, Dom) 
1: for C € C do 
2: if all variables in C are labeled in LAC is not 
satisfied by L then 
return fail 
end if 
end for 
return true 


Do 


Algorithm 62.3 BranchAndBound (U, L, C, Dom, f, 
b, Ly) 
1: if U = Ø then 
2: if f(L) <b then 
3 b<f(L) L, -L 
4: endif 
5: else 
6: pick variable x; € U 
/*possibly x; is selected non-deterministically*/ 
7: for v € D; /*Try to label x; with value v*/ do 
8: Dom’ < Dom 
9: if Consistent(L U {x := v}, C, Dom’) A 
bound(f, LU {x := v}, Dom’) < b 
/*additionally verify whether the current 
solution is bounded*/ then 


10: BranchAndBound(U \ {x}, LU {x := v}, C, 
Dom’, f, b, Lp) 

11: end if 

12: end for 


13: end if 


Branch and Bound 

In the case of a COP, the problem is solved by exploring 
the set sol(O) in the way above, storing the best value 
for f found as sketched in Algorithm 62.3. However, 
a constraint analysis (bound(f, LU {x := v}, Dom’)) 
based on a partial assignment and on the best value 
already computed, might allow to sensibly prune the 
search tree. This complete search heuristic is called 
(with a slight ambiguity with respect to the same con- 
cept in operations research) branch and bound. 


62.2.3 Systems 


A number of practical CP systems are available. 
They mostly differ with regards to the targeted pro- 


gramming language and modeling features available. 
For historical reasons, the first constraint program- 
ming systems were built around a Prolog system. 
For example, SICStus Prolog [62.27], was one of 
the first logic programming systems supporting con- 
straint programming which is still developed and re- 
leased under a commercial license. Another Prolog- 
based system specifically intended for constraint pro- 
gramming is ECL'PS® [62.28], which differently from 
SICStus Prolog is open source. Thanks to their 
longevity, both systems cover many of the model- 
ing features described in the previous sections (such 
as different type of domains, rich sets of constraints, 
etc.). 

Another notable commercial system specifically de- 
signed for constraint programming is the ILOG CP 
optimizer, now developed by IBM [62.29]. This sys- 
tem offers modeling capabilities either by means of 
a dedicated modeling language (called OPL [62.30]) or 
by means of a callable library accessible from differ- 
ent imperative programming languages such as C/C++, 
Java, and C#. The modeling capabilities of the system 
are mostly targeted to scheduling problems, featuring 
a very rich set of constructs for this kind of prob- 
lems. Interestingly, this system is currently available 
at no cost for researchers through the IBM Academic 
Initiative. 

Open source alternatives that can be interfaced with 
the most common programming languages are the C++ 
libraries of Gecode [62.31], and the Java libraries of 
Choco [62.32]. Both systems are well documented and 
constantly developed. 

A different approach has been taken by other au- 
thors, who developed a number of modeling languages 
for constraint satisfaction and optimization problems 
that can be interfaced to different type of general 
purpose CP solvers. A notable example is MiniZinc 
[62.33], which is an expressive modeling language for 
CP. MiniZinc models are translated into a lower level 
language, called FlatZinc, that can be compiled and ex- 
ecuted, for example, by Gecode, ECL'PS® or SICStus 
prolog. 

Finally, a mixed approach has been taken by the 
developers of Comet [62.34]. Comet is a hybrid CP 
system featuring a specific programming/modeling lan- 
guage and a dedicated solver. The system has been 
designed with hybridization in mind and, among other 
features, it natively supports the integration of meta- 
heuristics (especially in the family of local search 
methods) with CP. 
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Differently from Wallace [62.21], we will review the 
integration of CP with metaheuristics from the perspec- 
tive of metaheuristics, and we classify the approaches 
on the basis of the type of metaheuristic employed. 
Moreover, following the categorization of Puchinger 
and Raidl [62.18], we are mostly interested in review- 
ing the integrative combinations of metaheuristics and 
constraint programming, i. e., those in which constraint 
programming is embedded as a component of a meta- 
heuristic to solve a subproblem or vice versa. 

Indeed, the types of collaborative combinations are 
either straightforward (e.g., collaborative-sequential ap- 
proaches using CP as a constructive algorithm for 
finding a feasible initial solution of a problem) or rather 
uninvestigated (e.g., parallel or intertwined hybrids of 
metaheuristics and CP). 


62.3.1 Local Search and CP 


Local search methods [62.4] are based on an iterative 
scheme in which the search moves from the current so- 
lution to an adjacent one on the basis of the exploration 
of a neighborhood obtained by perturbing the current 
solutions. 

The hybridization of constraint programming with 
local search metaheuristics is the most studied one and 
there is an extensive literature on this subject. 


CP Within Local Search 

The integration of CP within local search methods is 
the most mature form of integration. It dates back 
to the mid 1990s [62.35], and two main streams are 
identifiable to this respect. The first one consists in 
defining the search of the candidate neighbor (e.g., 
the best one) as a constrained optimization problem. 
The neighborhoods induced by these definitions can 
be quite large, therefore, a variant of this technique 
is known by the name of large neighborhood search 
(LNS) [62.36]. The other kind of integration, lately 
named constraint-based local search (CBLS) [62.34], 
is based on the idea of expressing local search algo- 
rithms by exploiting constraint programming primitives 
in their control (e.g., for constraint checks during the 
exploration of the neighborhood) [62.37]. In fact, the 
two streams have a non-empty intersection, since the 
CP primitives employed in CBLS could be used to 
explore the neighborhood in a LNS fashion. In the fol- 
lowing sections we review some of the work in these 
two areas. 


A few surveys on the specific integration between 
local search and constraint programming exist, for ex- 
ample [62.38, 39]. 


Large Neighborhood Search. In LNS [62.36, 40] an 
existing solution is not modified just by applying small 
perturbations to solutions but a large part of the prob- 
lem is perturbed and searched for improving solutions 
in a sort of re-optimization approach. This part can be 
represented by a set F C X of released variables, called 
fragment, which determines the neighborhood relation 
N . Precisely, given a solutions = (d,,...,d,) anda set 
F C{X,...,X;} of free variables, then 


N(s, F) = {(e1,.--, ex) € Sol(O) 
: (Xi E F) > (ei =d))} . 


Given F, the neighborhood exploration is performed 
through CP methods (i. e., propagation and tree search). 

The pseudocode of the general LNS procedure is 
shown in Algorithm 62.4. Notice that in the proce- 
dure there are a few hotspots that can be customized. 
Namely, one of the key issues of this technique con- 
sists in the criterion for the selection of the set F 
given the current solution s, which is denoted by 
SelectFragment(s) in the algorithm. The most straight- 
forward way to select it is to randomly release a per- 
centage of the problem variables. However, the vari- 
ables in F could also be chosen in a structured 
way, i. e., by releasing related variables simultaneously. 
In [62.41], the authors compare the effectiveness of 
these two alternative choices in the solution of a job- 
shop scheduling problem. 

Also the upper bounds employed for the branch and 
bound procedure can be subject to a few design alter- 
natives. A possibility, for example, is to set the bound 
value to f ($p), the best solution value found that far, so 
that the procedure is forced to search at each step only 
for improving solutions. This alternative can enhance 
the technique when the propagation on the cost func- 
tions is particularly effective in pruning the domains 
of the released variables. At the opposite extreme, in- 
stead, the upper bound could be set to an infinite value 
so that a solution is searched regardless whether or not 
it is improving the cost function with respect to the cur- 
rent incumbent. 

Moreover, another design point is the solution 
acceptance criterion, which is implemented by the 
AcceptSolution function. In general, all the classical lo- 


Integration of Metaheuristics and Constraint Programming | 62.3 Integration of Metaheuristics and CP 1231 


cal search solution acceptance criteria are applicable, 
obviously in dependence on the neighborhood selec- 
tion criterion employed. For example, in the case of 
randomly released variables a Metropolis acceptance 
criterion could be adequate to implement a sort of sim- 
ulated annealing. 

Finally, the TerminateSearch criterion is one of 
those usually adopted in non-systematic search meth- 
ods, such as the expiration of a time/iteration budget, 
either absolute or relative, or the discovery of an opti- 
mal solution. 


Algorithm 62.4 LargeNeighborhoodSearch (X, C, 
Dom, f) 
1: create a (feasible) initial solution sp = (d?, ae dł) 
/*possibly random or finding the first feasible 
solution of the full CP model*/ 
2: Sp = So 
3: i< 0 
4: while not TerminateSearch(i, 5;,f(5;), 5p) do 
5: F <SelectFragment(s;) 
/*strategy for selecting the released variables*/ 
6 Le (y= ding F} 
T: U< F 
8: BranchAndBound(U, L, C, Dom, f, 
ChooseBounds(s;, 5,)) 
/*neighborhood exploration*/ 
9: if AcceptSolution(L) then 


10: Sit <L 

11: if f(si+1) < f (Sp) then 
12: Sp — Sit 

13: end if 

14: else 

15: Sit = Si 

16: endif 

17: i<i+1 


18: end while 
19: return s, 


LNS has been successfully applied to routing prob- 
lems [62.36, 42—45], nurse rostering [62.46], university 
course timetabling [62.47], protein structure predic- 
tion [62.48, 49], and car sequencing [62.50]. 

Cipriano etal. propose GELATO, a modeling 
language and a hybrid solver specifically designed for 
LNS [62.51-53]. The system has been tested on a set 
of benchmark problems, such as the asymmetric travel- 
ing salesman problem, minimum energy broadcast, and 
university course timetabling. 

The developments of the LNS technique in the 
wider perspective of very large neighborhood search 


(VLNS) was recently reviewed by Pisinger and 
Ropke [62.54]. Charchrae and Beck [62.55] also pro- 
pose a methodological contribution to this area with 
some design principles for LNS. 


Constraint-Based Local Search. The idea of encod- 
ing a local search algorithm by means of constraint 
programming primitives was originally due to Pesant 
and Gendreau [62.35, 56], although in their papers they 
focus on a framework that allows neighborhoods to be 
expressed by means of CP primitives. The basic idea 
is to extend the original CP model of the problem with 
a sort of surrogate model comprising a set of variables 
and constraints that intentionally describe a neighbor- 
hood of the current solution. 

A pseudocode of CBLS defined along these lines 
is reported in Algorithm 62.5. The core of the proce- 
dure is at line 5, which determines the neighborhood 
model on the basis of the current solution. The main 
components of the neighborhood model are the new set 
of variables Y and constraints Cy y that act as an in- 
terface of the neighborhood variables Y with respect to 
those of the original problem X. For example, the classi- 
cal swap neighborhood, which perturbs the value of two 
variables of the problem by exchanging their values, 
can be modeled by the set Y = {y,, y2}, consisting of the 
variables to exchange, and with the interface constraints 


Qi =iAy2 =f) SS = FH AGH=Si) 
Vije {l,...,n}. 


Moreover, an additional component of the neighbor- 
hood model is the evaluator of the move impact Af, 
which can be usually computed incrementally on the 
basis of the single move. 

It is worth noticing that the use of different mod- 
eling viewpoints is common practice in constraint pro- 
gramming. In classical CP modeling the different view- 
points usually offer a convenient way to express some 
constraint in a more concise or more efficient manner. 
The consistency between the viewpoints is maintained 
through the use of channeling constraints that link the 
different modelings. Similarly, although with a different 
purpose, in CBLS the linking between the full problem 
model and the neighborhood model is achieved through 
interface constraints. 


Algorithm 62.5 ConstraintBasedLocalSearch (X, Cy, 
Domy, f) 
1: create a (feasible) initial solution Sọ = (d, sts , dp) 
/*possibly random or finding the first feasible 
solution of the original CP model*/ 
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2: Sp <= So 

3: i 0 

4: while not TerminateSearch(i, 5;,f(5;),5,) do 

5: (Y, Cyy, Domy, Af) <_— 
NeighborhoodModel(s;) 

6 Lø 

7 U<yY 

8: BranchAndBound(U, L, Cy,y, Domy, Af) 
/*neighborhood exploration*/ 

9: if AcceptSolution(L) then 


10: Si+1 <— Apply(Z, s;) 
Ti; if f(si+1) < f (Sp) then 
12; Sp = Sit 

13: end if 

14: else 

15: Sit <= Si 

16: endif 

17: i<i+l 


18: end while 
19: return s, 


This stream of research has been revamped thanks 
to the design of the Comet language [62.34, 57], the aim 
of which is specifically to support declarative compo- 
nents inspired from CP primitives for expressing local 
search algorithms. An example of such primitives are 
differentiable invariants [62.58], which are declarative 
data structures that support incremental differentiation 
to effectively evaluate the effect of local moves (i. e., the 
Af in Algorithm 62.5). Moreover, Comet support con- 
trol abstractions [62.59, 60] specifically designed for 
local search such as the neighbors construct, which 
aims at expressing the unions of heterogeneous neigh- 
borhoods. Finally, Comet has been extended also to 
support distributed computing [62.61]. 

The embedding of local search within a constraint 
programming environment and the employment of 
a common programming language makes it possible 
to automatize the synthesis of CBLS algorithms from 
a high-level model expressed in Comet [62.62, 63]. The 
synthesizer analyzes the combinatorial structure of the 
problem, expressed through the variables and the con- 
straints, and combines a set of basic recommendations, 
which are the basic constituents of the synthesized 
algorithm. 


Other Integrations. The idea of exploring with lo- 
cal search a space of incomplete solutions (i. e., those 
where not all variables have been assigned a value) 
exploiting constraint propagation has been pursued, 
among others, by Jussien and Lhomme [62.64] for 


an open-shop scheduling problem. Constraint prop- 
agation employed in the spirit of forward checking 
and, more in general, look-ahead has been effectively 
employed, among others, by Schaerf [62.65] and Prest- 
wich [62.66], respectively, for scheduling and graph 
coloring problems. 


Local Search Within CP 

Moving to the integration of local search within con- 
straint programming, the most common utilization of 
local search-like techniques consists in limiting the ex- 
ploration of the tree search only to paths that are “close” 
to a reference one. An example of such a procedure is 
limited discrepancy search (LDS) [62.67], an incom- 
plete method for tree search in which only neighboring 
paths of the search tree are explored, where the proxim- 
ity is defined in terms of different decision points called 
discrepancies. Only the paths (i. e., complete solutions) 
with at most k discrepancies are considered, as outlined 
in Algorithm 62.6. 


Algorithm 62.6 LimitedDiscrepancySearch (X, C, 
Dom, f, k) 

1: s* < FirstSolution(X, C, Dom, f) 

2: Sp < 5* 

3: forie {1,...,k} do 

4: forte {5:5 differs w.r.t. 5* for 

exactly i variables} do 

5 if Consistent(t, Dom) ^ f (t) < f ($p) then 

6: Sp <t 

T: end if 

8: end for 

9: end for 
0: return s, 


Another approach due to Prestwich [62.68] is called 
incomplete dynamic backtracking. Differently from 
LDS, in this approach proximity is defined among par- 
tial solutions, and when backtracking needs to take 
place it is executed by randomly unassigning (at most) 
b variables. This way, the method could be intended as 
a local search on partial solutions. In fact, the method 
also features other CP machinery, such as forward 
checking, which helps in boosting the search. 

An alternative possibility is to employ local search 
in constraint propagation. Local probing [62.69, 70] is 
based on the partition of constraints into the set of 
easy and hard ones. At each choice point in the search 
tree the set of easy constraints is dealt with a lo- 
cal search metaheuristic (namely simulated annealing), 
while the hard constraints are considered by classi- 
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cal constraint propagation. This idea generalizes the 
approach of Zhang and Zhang [62.71], who first pre- 
sented such a combination. Another similar approach 
was taken by Sellmann and Harvey [62.72], who used 
local search to propagate redundant constraints. 

In [62.73] the authors discuss the incorporation of 
the tabu search machinery within CP tree search. In 
particular, they look at the memory mechanisms for 
limiting the size of the tree and the elite candidate list 
for keeping the most promising choices in order to be 
evaluated first. 


62.3.2 Genetic Algorithms and CP 


A genetic algorithm [62.5] is an iterative metaheuris- 
tic in which a population of strings, which represent 
candidate solutions, evolves toward better solutions in 
a process that mimics natural evolution. The main com- 
ponents of the evolution process are crossover and 
mutation operators, which, respectively, combine two 
parent solutions generating an offspring and mutate 
a given solution. Another important component is the 
strategy for the offspring selection, which determines 
the population at the next iteration of the process. 

To the best of our knowledge, one of the first at- 
tempts to integrate constraint programming and genetic 
algorithms is due to Barnier and Brisset [62.74]. They 
employ the following genetic representation: given 
a CSP with variables {X,...,X;,}, the i-th gene in 
the chromosomes is related to the variable X; and it 
stores a subset of the domain D; that is allowed to be 
searched. Each chromosome is then decoded by CP, 
which searches for the best solution of the sub-CSP in- 
duced by the restrictions in the domains. The genetic 
operators used are a mutation operator that changes val- 
ues on the subdomain of randomly chosen genes and 
a crossover operator that is based on a recombination of 
the set-union of the subdomains of each pair of genes. 
The method was applied to a vehicle routing problem 
and outperformed both a CP and a GA solver. 

A different approach, somewhat similar to local 
probing, was used in [62.75] for tackling a production 
scheduling problem. In this case, the problem variables 
are split into two sets, defining two coupled subprob- 
lems. The first set of variables is dealt with by the 
genetic algorithm, which determines a partial schedule. 
This partial solution is then passed to CP for complet- 
ing (and optimizing) the assignment of the remaining 
variables. 

Finally, CP has been used as a post-processing 
phase for optimizing the current population in the spirit 


of memetic algorithms. In [62.76] CP actually acts 
as an unfeasibility repairing method for a university 
course timetabling problem, whereas in [62.77] the op- 
timization performed by CP on a flow-shop scheduling 
problem is an alternative to the classical local search 
applied in memetic algorithms. This approach is illus- 
trated in Algorithm 62.7. 


Algorithm 62.7 A Memetic Algorithm with CP for 
Flow-Shop scheduling (adapted from [62.77]) 
1: generate an initial population P = {p;,...,p;} of 
permutations of n jobs (each composed of k tasks 
Ti; whose start time and end time are denoted by oj 
and nj respectively) 


2: g<0 
3: while not TerminateSearch(g, P, min,cpf(p)) do 
4: select pı and p2 from P by binary tournament 
5: c< pı ® p2 /*apply crossover*/ 
6: if f(c) > min epf (p) then 
R mutate c under probability pm 
8: endif 
9: decode c = (c1, .. . , Cn) to the set of precedence 
constraints C = {Nka < Olgy J51,- n—1} 
10: L<@ 
11: U < {oy nyi:i=l,...,kj=1,...,n} 
12: BranchAndBound(U, L, C U {ny < oj41; : i = 
. , k}, Dom, f) 
13: iff(c) > max epf (p) then 
14: discard c 
15: else 
16: select r by reverse binary tournament 
17: c replaces r in P 
18: endif 


19: g<g+l 
20: end while 
21: return arg min epf (p) 


62.3.3 ACO and CP 


Ant colony optimization [62.6] is an iterative construc- 
tive metaheuristic, inspired by ant foraging behavior. 
The ACO construction process is driven by a proba- 
bilistic model, based on pheromone trails, which are 
dynamically adjusted by a learning mechanism. 

The first attempt to integrate ACO and CP is due 
to Meyer and Ernst [62.78], who apply the method for 
solving a job-shop scheduling problem. The proposed 
procedure employs ACO to learn the variable and value 
ordering used by CP for branching in the tree search. 
The solutions found by the CP procedure are fed back 
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to the ACO in order to update its probabilistic model. 
In this approach, ACO can be conceived as a master 
online-learning branching heuristic aimed at enhancing 
the performance of a slave CP solver. 

A slightly different approach was taken by 
Khichane etal. [62.79, 80]. Their hybrid algorithm 
works in two phases. At first, CP is employed to sample 
the space of feasible solutions, and the information col- 
lected is processed by the ACO procedure for updating 
the pheromone trails according to the solutions found 
by CP. In the second phase, the learned pheromone in- 
formation is employed as the value ordering used for 
CP branching. This approach, differently from the pre- 
vious one, uses the learning capabilities of ACO in an 
offline-learning fashion. 

More standard approaches in which CP is used to 
keep track of the feasibility of the solution constructed 
by ACO and to reduce the domains through constraint 
propagation have been used by a number of authors. 
Khichane etal. apply this idea to job-shop schedul- 
ing [62.78] and car sequencing [62.79, 81]. Their gen- 
eral idea is outlined in Algorithm 62.8, where each ant 
maintains a partial assignment of values to variables. 
The choice to extend the partial assignment with a new 
variable/value pair is driven by the pheromone trails 
and the heuristic factors in lines 7—8 through a standard 
probabilistic selection rule. Propagation is employed at 
line 10 to prune the possible values for the variables not 
included in the current assignment. 


62.4 Conclusions 


In this chapter we have reviewed the basic concepts of 
constraint programming and its integration with meta- 
heuristics. Our main contribution is the attempt to give 
a comprehensive overview of such integrations from the 
viewpoint of metaheuristics. 

We believe that the reason why these integrations 
are very promising resides in the complementary mer- 
its of the two approaches. Indeed, on the one hand, 
metaheuristics are, in general, more suitable to deal 
with optimization problems, but their treatment of con- 
straints can be very awkward, especially in the case of 
tightly constrained problems. On the other hand, con- 
straint programming is specifically designed for finding 


Another work along this line is due to Benedettini 
et al. [62.82], who integrate a constraint propagation 
phase for Boolean constraints to boost a ACO approach 
for a bioinformatics problem (namely, haplotype infer- 
ence). Finally, in the same spirit of the previous idea, 
Crawford et al. [62.83, 84] employ a look-ahead tech- 
nique within ACO and apply the method to solve set 
covering and set partitioning problems. 


Algorithm 62.8 Ant Constraint Programming 
(adapted from [62.79]) 


1: initialize all pheromone trails to Tmax 


2: g<0 
3: repeat 
4 fork €{1,...,n}do 
5 Ax < @ 
6: repeat 
T select a variable x € X so that x; ¢ var( Ax) 
according to the pheromone trail 7 
8: choose a value v €D; according to the 
pheromone trail t; and a heuristic factor nj, 
9: add {x := v} to A, 
10: Propagate( Ax, C) 
11: until var(.A;) = X or Failure 
12: update pheromone trails using {A1,..., An} 
13: end for 
14: until var(A;)=Xfor somei€ {1,(...),n} or 


TerminateSearch(g, A;) 


feasible solutions, but it is not particularly effective 
for handling optimization. Consequently, a hybrid al- 
gorithm that uses CP for finding feasible solutions and 
metaheuristics to search among them has good chances 
to outperform its single components. 

Despite the important steps made in this field during 
the last decade, there are still promising research oppor- 
tunities, especially in order to investigate topics such 
as collaborative hybridization of CP and metaheuristics 
and validate existing integration approaches in the yet 
uninvestigated area of multiobjective optimization. We 
believe that further research should devote more atten- 
tion to these aspects. 
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63. Graph Coloring and Recombination 


Rhyd Lewis 


It is widely acknowledged that some of the most 
powerful algorithms for graph coloring involve the 
combination of evolutionary-based methods with 
exploitative local search-based techniques. This 
chapter conducts a review and discussion of such 
methods, principally focussing on the role that re- 
combination plays in this process. In particular we 
observe that, while in some cases recombination 
seems to be usefully combining substructures in- 
herited from parents, in other cases it is merely 
acting as a macro perturbation operator, helping 
to reinvigorate the search from time to time. 


63.1 Graph Coloring 


Graph coloring is a well-known NP-hard combinato- 
rial optimization problem that involves using a minimal 
number of colors to paint all vertices in a graph such 
that all adjacent vertices are allocated different col- 
ors. The problem is more formally stated as follows: 
given an undirected simple graph G = (V, E), with ver- 
tex set V and edge set E, our task is to assign each vertex 
v € V an integer c(v) € {1,2,...,k} so that: 


© civ) AcWWV{y, uy EE 
@ kis minimal. 


Though essentially a theoretical problem, graph 
coloring is seen to underpin a wide variety of 
seemingly unrelated operational research problems, 
including satellite scheduling [63.1], educational 
timetabling [63.2,3], sports league scheduling [63.4], 
frequency assignment problems [63.5,6], map color- 
ing [63.7], airline crew scheduling [63.8], and compiler 
register allocation [63.9]. The design of effective algo- 
rithms for graph coloring thus has positive implications 
for a large range of real-world problems. 

Some common terms used with graph coloring are 
as follows: 
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© A coloring of a graph is called complete if all ver- 
tices v € V are assigned a color c(v) € {1,...,k}; 
else it is considered partial. 

@ A clash describes a situation where a pair of adja- 
cent vertices u,v € V are assigned the same color 
(that is, {u,v} € E and c(v) = c(u)). If a coloring 
contains no clashes, then it is considered proper; 
else it is improper. 

© A coloring is feasible if and only if it is both com- 
plete and proper. 

© The chromatic number of a graph G, denoted x(G), 
is the minimal number of colors required in a feasi- 
ble coloring. If a feasible coloring uses x(G) colors, 
it is considered optimal. 

@ An independent set is a subset of vertices I C V 
that are mutually non-adjacent. That is, Vu,v € J, 
{u,v} € E. Similarly, a clique is a subset of ver- 
tices C C V that are mutually adjacent: Vu,v € C, 
{u,v} E€ E. 


Given these definitions, we might also view graph 
coloring as a type of partitioning/grouping problem 
where the aim is to split the vertices into a set of subsets 
U = {U1, ... , Ug} such that V;N U; = 9 <i<j<k). 
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Fig. 63.1 A simple graph (left) and a feasible five-coloring (right) 


If We U; = V, then the partition represents a com- 
plete coloring. Moreover, if all subsets U;,..., Uy are 
independent sets, the coloring is also feasible. 

To exemplify these concepts, Fig. 63.1 shows an 
example graph with ten vertices, together with a cor- 
responding coloring. In this case the presented color- 
ing is both complete and proper, and therefore fea- 
sible. It is also optimal because it uses just five 
colors, which happens to be the chromatic number 
in this case. The graph also contains one clique 
of size 5 (vertices vj,v3,V4,V6, and v7), and nu- 
merous independent sets, such as vertices v2, V3, Vg, 


63.2 Algorithms for Graph Coloring 


Graph coloring has been studied as an algorithmic prob- 
lem since the late 1960s and, as a result, an abundance 
of methods have been proposed. Loosely speaking, 
these methods might be grouped into two main classes: 
constructive methods, which build solutions step-by- 
step, perhaps using various heuristic and backtracking 
operators; and stochastic search-based methods, which 
attempt to navigate their way through a space of can- 
didate solutions while optimizing a particular objective 
function. 

The earliest proposed algorithms for graph coloring 
generally belong to the class of constructive meth- 
ods. Perhaps the simplest of these is the first-fit (or 
greedy) algorithm. This operates by taking each ver- 
tex in turn in a specified order and assigning it to the 
lowest indexed color where no clash is induced, creat- 
ing new colors when necessary [63.12]. A development 
on this method is the DSATUR algorithm [63.13, 14] in 


and vo. As a partition, this coloring is represented U = 
{{V1, Vio}, {V7, Va}, (V3, Vs}, (V2, Va, Vo}, {V6}. 

It should be noted that various subsidiary prob- 
lems related to the graph coloring problem are also 
known to be NP-hard. These include computing the 
chromatic number itself, identifying the size of the 
largest clique, and determining the size of the largest 
independent set in a graph [63.10,11]. In addition, 
the decision variant of the graph coloring prob- 
lem, which asks: given a fixed positive integer k, is 
there a feasible k-coloring of the vertices? is NP- 
complete. 


which the ordering of the vertices is determined dy- 
namically — specifically, by choosing at each step the 
uncolored vertex that currently has the largest number 
of different colors assigned to adjacent vertices, break- 
ing ties by taking the vertex with the largest degree. 
Other constructive methods have included backtracking 
strategies, such as those of Brown [63.15] and Kor- 
man [63.16], which may ultimately perform complete 
enumerations of the solution space given excess time. 
A survey of backtracking approaches was presented by 
Kubale and Jackowski [63.17] in 1985. 

Many of the more recent methods for graph coloring 
have followed the second approach mentioned above, 
which is to search a space of candidate solutions and 
attempt to identify members that optimize a specific 
objective function. Such methods can be further classi- 
fied according to the composition of their search spaces, 
which can comprise: 
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(a) The set of all feasible solutions (using an undefined 
number of colors) 

(b) The set of complete colorings (proper and im- 
proper) for a fixed number of colors k 

(c) The set of proper solutions (partial and complete), 
also for a fixed number of colors k. 


Algorithms following scheme (a) have been con- 
sidered by, among others, Culberson and Luo [63.18], 
Mumford [63.19], Erben [63.20], and Lewis [63.21]. 
Typically, these methods consider different permuta- 
tions of the vertices, which are then fed into a construc- 
tive method (such as first-fit) to form feasible solutions. 
An intuitive cost function in such cases is simply the 
number of colors used in a solution, though other more 
fine-grained functions have been suggested, such as the 
following due to Erben [63.20] 


_ Eueu [Erev des] 
lul 


fi (63.1) 


Here, the term (J ey, deg(v)) gives the sum of the de- 
grees of all vertices assigned to a color class U;. The aim 
is to maximize f, by making increases to the numerator 
(by forming large color classes that contain high-degree 
vertices), and decreases to the denominator (by reduc- 
ing the number of color classes). 

On the other hand, algorithms following scheme (b) 
operate by first proposing a fixed number of colors k. At 
the start of a run, each vertex will be assigned to one of 
the k colors using heuristics, or randomly. However, this 
may involve the introduction of one or more clashes, 
resulting in a complete, improper k-coloring. The cost 
of such a solution might then be evaluated using the 
following cost function, which is simply a count on the 
number of clashes 


h= 5 g(v,u) where 


V{v, u} EE 
_ fl ifc@)= c(u) 
80,u) = b otherwise . (63:2) 


The strategy in such approaches is to make alterations 
to a solution such that the number of clashes is re- 
duced to zero. If this is achieved k can be reduced; 
alternatively if all clashes cannot be eliminated, k can 
be increased. This strategy has been quite popular in 
the literature, involving the use of various stochas- 
tic search methodologies, including simulated anneal- 
ing [63.22, 23], tabu search [63.24], greedy randomized 
adaptive search procedure (GRASP) methods [63.25], 


iterated local search [63.26, 27], variable neighborhood 
search [63.28], ant colony optimization [63.29], and 
evolutionary algorithms (EA) [63.30-35]. 

Finally, scheme (c) also involves using a fixed 
number of colors k; however in this case, rather than 
allowing clashes to occur in a solution, vertices that 
cannot be feasibly assigned to a color are placed into 
a set of uncolored vertices S. The aim is, therefore, to 
make changes to a solution so that these vertices can 
eventually be feasibly colored, resulting in S = Ø. This 
approach has generally been less popular in the litera- 
ture than scheme (b), though some prominent examples 
include the simulated annealing approach of Morgen- 
stern [63.36], the tabu search method of Blochliger and 
Zufferey [63.37], and the EA of Malaguti et al. [63.38]. 
More recently, Hertz et al. [63.39] also suggested an 
algorithm that searches different solution spaces dur- 
ing different stages of a run. The idea is that when the 
search is deemed to have stagnated in one space, a pro- 
cedure is used to alter the current solution so that it 
becomes a member of another space (e.g., clashing ver- 
tices are uncolored by transferring them to S). Once this 
has been done, the search can then be continued in this 
new space where further improvements might be made. 


63.2.1 EAs for Graph Coloring 


In this section we now examine the ways in which EAs 
have been applied to the graph coloring problem, partic- 
ularly looking at issues surrounding the recombination 
of solutions. 


Assignment-Based Operators 
Perhaps the most intuitive way of applying EAs to 
the graph coloring problem is to view the task as one 
of assignment. In this case, a candidate solution can 
be viewed as a mapping of vertices to colors c: V > 
{1,..., k}, and a natural chromosome representation is 
a vector (c(v1), c(v2),..., c(vıv|)), where c(v;) gives the 
color of vertex v; (the solution given in Fig. 63.1 would 
be represented by (1,4,3,4,3,5,2,2,4, 1) under this 
scheme). However, it has long been argued that this 
sort of approach brings disadvantages, not least because 
it contradicts a fundamental design principle of EAs: 
the principle of minimum redundancy [63.40], which 
states that each member of the search space should be 
represented by as few distinct chromosomes as possi- 
ble. To expand upon this point, we observe that under 
this assignment-based representation, if we are given 
a solution using l < k colors, the number of different 
chromosomes representing this solution will be ‘P; due 
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to the arbitrary way in which colors are allocated labels. 
(For example, swapping the labels of colors 2 and 4 
in Fig. 63.1’s solution would give a new chromosome 
(1,2,3,2,3,5,4,4,2,1), but the same solution.) Of course, 
this implies a search space that is far larger than neces- 
sary. 

Furthermore, authors such as Falkenauer [63.41] 
and Coll et al. [63.42] have also argued that tradi- 
tional recombination schemes such as 1, 2, and n-point 
crossover with this representation have a tendency to 
recklessly break up building-blocks that we might want 
promoted in a population. As an example, consider a re- 
combination of the two example chromosomes given 
in the previous paragraph using two-point crossover: 
(1,4,3,4,3,5,2,2,4,1) crossed with (1,2,3,2,3,5,4,4,2,1) 
would give (1,4,3,4,3,5,4,4,4, 1) as one of the offspring. 
Here, despite the fact that the two parent chromosomes 
actually represent the same feasible solution, the resul- 
tant offspring seems to have little in common with its 
parents, having lost one of its colors, and seen a number 
of clashes having been introduced. Thus, it is concluded 
by these authors that such operations actually consti- 
tute more of a random perturbation operator, rather than 
a mechanism for combining meaningful substructures 
from existing solutions. Nevertheless, recent algorithms 
following this scheme are still reported in the litera- 
ture [63.43]. 

In recognition of the proposed disadvantages of 
the assignment-based representation, Coll et al. [63.42] 
proposed a procedure for relabeling the colors of one 
of the parent chromosomes before applying crossover. 
Consider two (not necessarily feasible) parent solutions 
represented as partitions: U; = {U11,...,U1.,}, and 
Un = {U2,1,..., U24}. Now, using U; and Us, a com- 
plete bipartite graph Kx, is formed. This bipartite graph 
has k vertices in each partition, and the weights between 
two vertices i,j from different partitions are defined as 
wij = |U1 i N U2,;|. Given K;,,, a maximum weighted 
matching can then be determined using any suitable 
algorithm (e.g., the Hungarian algorithm [63.44] or auc- 


Parent 1 


tion algorithm [63.45]), and this matching can be used 
to re-label the colors in one of the chromosomes. 

Figure 63.2 gives an example of this procedure and 
shows how the second parent can be altered so that 
its color labelings maximally match those of parent 1. 
In this case, we note that the color classes {v1, vio}, 
{v3, vs}, and {v6} occur in both parents and will be 
preserved in any offspring produced via a traditional 
crossover operator. However, this will not always be the 
case and will depend very much on the best matching 
that is available in each case. 

A further scheme for color relabeling that also ad- 
dresses the issue of redundancy was proposed by Tucker 
et al. [63.46]. This method involves representing solu- 
tions using the assignment-based scheme, but under the 
following restriction 


cmv) =1, (63.3) 
c(vi+1) < max{c(vı),... cv} +1. (63.4) 


Chromosomes obeying these labeling criteria might, 
therefore, be considered as being in their canonical 
form such that, by definition, vertex vı is always col- 
ored with color 1, v2 is always colored with color | or 2, 
and so on. (The solution given in Fig. 63.1 would be 
represented by (1,2,3,2,3,4,5,5,2,1) under this scheme.) 
However, although this ensures a one-to-one correspon- 
dence between the set of chromosomes and the set of 
vertex partitions (thereby removing any redundancy), 
research by Lewis and Pullin [63.47] demonstrated that 
this scheme is not particularly useful for graph color- 
ing, not least because minor changes to a chromosome 
(such as the recoloring a single vertex) can lead to major 
changes to the way colors are labeled, making the prop- 
agation of useful solution substructures more difficult to 
achieve when applying traditional crossover operators. 


Partition-Based Operators 
Given the proposed issues with the assignment-based 
approach, the last 15 years or so have also seen a num- 


Parent 2 lUi, 0 Up, | 

(1, 4, 3, 4, 3, 5, 2, 2, 4, 1) (372.15 20.1,.55-45:45453) 123465 
Partition Partition loona oa 
U11 = {V1, Vio} U21 = {v3, V5} 210 0 D 20 
U12 = {v7, vs} U22 = {v2, va} — 5; = 
Ui3 = {v3, v5} U23 = {v1, vio} 7 F : : r ; Fig. 63.2 Example of the relabel- 
a ee V4, Vo} ae He Vg, Vo} / s|0 000 1 ing procedure proposed by Coll 

k i aA manhole et al. [63.42]. Here, parent 2 is rela- 


(44,324 BS, 2, 2, i) 


beled as 1 > 3,2 —> 4,3 —> 1,4 —> 
2,and 5—5 
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ber of articles presenting recombination operators fo- 
cussed on the partition (or grouping) interpretation of 
graph coloring. The philosophy behind this approach 
is that it is actually the color classes (and the vertices 
that are assigned to them) that represent the underly- 
ing building blocks of the graph coloring problem. In 
other words, it is not the color of individual vertices 
per se, but the way in which vertices are grouped that 
form the meaningful substructures. Consequently, the 
focus should be on the design of operators that are 
successfully able to combine and promote these within 
a population. 

Perhaps the first major work in this area was due 
to Falkenauer [63.48] in 1994 (and later [63.41]) who 
argued in favor of the partition interpretation in the jus- 
tification of his grouping genetic algorithm (GGA) — 
an EA methodology specifically designed for use with 
partitioning problems. Falkenauer applied this GGA 
to two important operational research problems: the 
bin-packing problem and bin-balancing problem, with 
strong results being reported. In subsequent work, Er- 
ben [63.20] also tailored the GGA for graph coloring. 
Erben’s approach operates in the space of feasible col- 
orings and allows the number of colors in a solution 
to vary. Solutions are then stored as partitions, and 
evaluated using (63.1). In this approach, recombination 
operates by taking two parent solutions and randomly 
selecting a subset of color classes from the second. 
These color classes are then copied into the first par- 
ent, and all color classes coming from the first parent 
containing duplicate vertices are deleted. This opera- 


tion results in an offspring solution that is proper, but 
most likely partial. Thus uncolored vertices are then 
reinserted into the solution, in this case using the first-fit 
algorithm. A number of other recombination operators 
for use in the space of feasible solutions have also been 
suggested by Mumford [63.19]. These operate on per- 
mutations of vertices, which are again decoded into 
solutions using the first-fit algorithm. 

Another recombination operator that focusses on 
the partition interpretation of graph coloring is due 
to Galinier and Hao, who in 1999 proposed an EA 
that, at the date of writing, is still understood to be 
one of the best performing algorithms for graph color- 
ing [63.33, 38, 49, 50]. Using a fixed number of colors k, 
Galinier and Hao’s method operates in the space of 
complete (proper and improper) k-colorings using cost 
function fọ (63.2). A population of candidate solu- 
tions is then evolved using local search (based on tabu 
search) together with a specialized recombination oper- 
ator called greedy partition crossover (GPX). The latter 
is used as a global operator and is intended to guide the 
search over the long term, gently directing it towards fa- 
vorable regions of the search space (exploration), while 
the local search element is used to identify high quality 
solutions within these regions (exploitation). 

The idea behind GPX is to construct offspring 
using large color classes inherited from the parent so- 
lutions. A demonstration of how this is done is given 
in Fig. 63.3. As is shown, the largest (not necessarily 
proper) color class in the parents is first selected and 
copied into the offspring. Then, in order to avoid dupli- 


Parent 1 Parent 2 Offspring 

a) U= {vi v2, v3} {V3, V4, V5, V7} } Select the color with most vertices and copy to the 
U= {v4 V5, Vo, V7} {v1, Ve, Vo} {} child (U; from parent 1 here). 
U3 {vs, Vo, Vio} {v2, Vg, Vio} } Delete copied vertices from both parents. 

b) U= {vi v2, v3} {v3} {v4, V5, Ve, V7} Select the color with most vertices in parent 2 and 
U=  {} {v1, Vo} } copy to child. 
U3 {vs, Vo, Vio} {v2, vs, Vio} } Delete copied vertices from both parents. 

c) U= {vrv} {v3} {V4, V5, V6, V7} Select the color with most vertices in parent 1 and 
U= f} {v1, vo} {v2, Vs, Vio} copy to the child. 
U3 {vo} {} } Delete copied vertices from both parents. 

d) U= ff} {} {Va Vs, Vo, V7} Having formed k colors, assign any missing vertices 
U= {} {vo} V2, V8, Vio} to random colors. 
U3 {vo} {} {vi v3} 

e) Ga {} {vo} {Va Vs, V6, V7} A complete (though not necessarily proper) solution 
U= {vo} {} V2, V8, Vio, Vo} results. 
U; {} {} V1, V3} 


Fig. 63.3 Demonstration of the GPX operator using k = 3 
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cate vertices occurring in the offspring at a later stage, 
these copied vertices are removed from both parents. 
To form the next color, the other (modified) parent is 
then considered and, again, the largest color class is 
selected and copied into the offspring, before again re- 
moving these vertices from both parents. This process 
is continued by alternating between the parents until the 
offspring’s k color classes have been formed. At this 
point, each color class in the offspring will be a subset 
of a color class existing in one or both of the parents. 
That is 

VU; € Ue AU; € (UU U2) : UJ; CU;, (63.5) 
where Uc, U1, and U, represent the offspring, and par- 
ents | and 2, respectively. 

One feature of the GPX operator is that on produc- 
tion of an offspring’s k color classes, some vertices may 
be missing (this occurs with vertex vo in Fig. 63.3). 
Galinier and Hao [63.33] suggest assigning these un- 
colored vertices to random classes, which of course 
could introduce further clashes. This element of the 
procedure might, therefore, be viewed as a type of per- 
turbation (mutation) operator in which the number of 
random assignments (the size of the perturbation) is de- 
termined by the construction stages of GPX. However, 
Glass and Prugel-Bennett [63.49] observe that GPX’s 
strategy of inheriting the largest available color class at 
each step (as opposed to a random color class) generally 
reduces the number of uncolored vertices. This means 
that the amount of information inherited directly from 
the parents is increased, reducing the potential for dis- 
ruption. Once a complete offspring is formed, it is then 
modified and improved via a local search procedure be- 
fore being inserted into the population. 

Since the proposal of GPX by Galinier and 
Hao [63.33], further recombination schemes based on 
this method have also been suggested, differing primar- 
ily in the criteria used for selecting the color classes 
that are inherited by the offspring. Lii and Hao [63.34], 
for example, extended the GPX operator to allow more 


63.3 Setup 


The EA used in the following experiments operates in 
the same manner as Galinier and Hao’s [63.33]. To 
form an initial population, a modified version of the 
DSATUR algorithm is used. Specifically, each individ- 
ual is formed by taking the vertices in turn according to 


than two parents to play a part in producing a single 
offspring (Sect. 63.5). On the other hand, Porumbel 
et al. [63.35] suggest that instead of choosing the largest 
available color class at each stage of construction, 
classes with the least number of clashes should be 
prioritized, with class size (and information regarding 
the degrees of the vertices) then being used to break 
ties. Malaguti et al. [63.38] also use a modified version 
of GPX with an EA that navigates the space of par- 
tial, proper solutions. In all of these cases the authors 
combined their recombination operators with a local 
search procedure in the same manner as Galinier and 
Hao [63.33] and, with the problem instances consid- 
ered, the reported results are generally claimed to be 
competitive with the state of the art. 


Assessing the Effectiveness of EAs 

for Graph Coloring 
In recent work carried out by the author of this chap- 
ter [63.50], a comparison of six different graph coloring 
algorithms was presented. This study was quite broad 
and used over 5000 different problem instances. Its con- 
clusions were also rather complex, with each method 
outperforming all others on at least one class of prob- 
lems. However, a salient observation was that the GPX- 
based EA of Galinier and Hao [63.33] was by far the 
most consistent and high-performing algorithm across 
the comparison. 

In the remainder of this chapter we pursue this mat- 
ter further, particularly focussing on the role that GPX 
plays in this performance. Under a common EA frame- 
work, described in Sect. 63.3, we first evaluate the 
performance of GPX by comparing it to two other re- 
combination operators (Sect. 63.4). Using information 
gained from these experiments, Sect. 63.5 then looks 
at how the performance of the GPX-based EA might 
be enhanced, particularly by looking at ways in which 
population diversity might be prolonged during a run. 
Finally, conclusions and a further discussion surround- 
ing the virtues of recombination in this problem domain 
are presented in Sect. 63.6. 


the DSATUR heuristic and then assigning it to the lowest 
indexed colori € {1,...,k} where no clash occurs. Ver- 
tices for which no clash-free color exists are assigned 
to random colors at the end of this process. Ties in the 
DSATUR heuristic are broken randomly, providing di- 
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Table 63.1 Details of the five problem instances used in our analysis 


Vertex degree Best known 
#: Name IVI Density min; med; max Mean SD (colors) 
1: Random 1000 0.499 450; 499; 555 499.4 16.1 83 
2: Flat(10) 500 0.103 36; 52; 61 ST 4.4 10 
3: Flat(100) 500 0.841 393; 421; 445 420.7 7.6 100 
4: TT(A) 682 0.128 0; 77; 472 87.4 62.0 27, 
5: TT(B) 2419 0.029 0; 47; 857 Wi) 92.3 32 


versity in the initial population. Each individual is then 
improved by the local search routine. 

The EA evolves the population using recombina- 
tion, local search, and replacement pressure. In each 
iteration two parent solutions are selected at random, 
and the selected recombination operator is used to pro- 
duce one offspring. This offspring is then improved via 
local search and inserted into the population by replac- 
ing the weaker of its two parents. 

The local search element of this EA makes use of 
tabu search — specifically the TABUCOL algorithm of 
Hertz and de Werra [63.24], run for a fixed number of 
iterations. In this method, moves in the search space 
are achieved by selecting a vertex v whose assignment 
to color 7 is currently causing a clash, and moving it 
to a new color j Æ i. The inverse of this move is then 
marked as tabu for the next f steps of the algorithm 
(meaning that v cannot be re-assigned to color į until 
at least ¢ further moves have been performed). In each 
iteration, the complete neighborhood is considered, and 
the non-tabu move that is seen to invoke the largest de- 
crease in cost (or failing that, the smallest increase) is 
performed. Ties are broken randomly, and tabu moves 
are also carried out if they are seen to improve on the 
best solution observed so far in the process. The tabu 
search routine terminates when the iteration limit is 
reached (at which case the best solution found during 
the process is taken), or when a zero cost solution is 
achieved. Further descriptions of this method, includ- 
ing implementation details, can be found in [63.51]. 

In terms of parameter settings, in all cases we use 
a population size of 20 (as in [63.34,35]) and set the 
tabu search iteration limit to 16|V|, which approximates 
the settings used in the best reported runs in [63.33]. 
As with other algorithms that use this local search 
technique [63.29, 33, 37], the tabu tenure ¢ is made pro- 
portional to the current solution cost: specifically, t = 
[0.6f2| +r, where r is an integer uniformly selected 
from the range 0—9 inclusive. 

Finally, because this algorithm operates in the space 
of complete k-colorings (proper and improper), values 
for k must be specified. In our case, initial values are 


determined by executing DSATUR on each instance and 
setting k to the number of colors used in the resultant so- 
lution. During runs, k is then decremented by 1 as soon 
as a feasible k-coloring is found, and the algorithm is 
restarted. Computational effort is measured by count- 
ing the number of constraint checks carried out by the 
algorithm, which occur when the algorithm requests in- 
formation about a problem instance, including checking 
whether two vertices are adjacent (by accessing an ad- 
jacency list or matrix), and referencing the degree of 
a vertex. In all trials a cut-off point of 5 x 10!! checks 
is imposed, which is roughly double the length of the 
longest run performed in [63.33]. In our case, this led 
to run times of ~ 1h on our machines (algorithms were 
coded in C++ and executed on a PC under Windows XP 
using a 3.0 GHz processor with 3.18 GB of RAM). 


63.3.1 Problem Instances 


For our trials a set of five problem instances is con- 
sidered. Though this set is quite small, its members 
should be considered as case studies that have been 
deliberately chosen to cover a wide range of graph 
structure — a factor that we have found to be very im- 
portant in influencing the relative performance of graph 
coloring algorithms [63.50]. The first three graphs are 
generated using the publicly available software of Cul- 
berson [63.52], while the remaining two are taken from 
a collection of real-world timetabling problems com- 
piled by Carter et al. [63.53]. Names and descriptions 
of these graphs now follow. Further details are also 
given in Table 63.1: 


#1: Random. This graph features |V| = 1000 and is 
generated such that each of the (| W) pairs of vertices 
is linked by an edge with probability 0.5. Graphs of 
this nature are nearly always considered in compar- 
isons of coloring algorithms. 

#2: Flat(10). Flat graphs are generated by partition- 
ing the vertices into K equi-sized groups and then 
adding edges between vertices in different groups 
with probability p. This is done such that the vari- 
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Parent 1 

Ui = {v1, Vio} 
Uy = {v7, vs} 
U3 = {v3, vs} 
U, = {V2, Va, Vo} 
Us = {v6} 


ance in vertex degrees is kept to a minimum. It 
is well known that feasible K-colored solutions 
to such graphs are generally easy to achieve ex- 
cept in cases where p is within a specific range of 
values, which results in problems that are notori- 
ously difficult. Such ranges are commonly termed 
phase transition regions [63.54]. This particular in- 
stance is generated so that it features a relatively 
small number of large color classes (using V = 
500 and K = 10, implying ~ 50 vertices per color). 
A value of p = 0.115 is used, which has been ob- 
served to provide very difficult instances for a range 
of different graph coloring algorithms [63.50]. 

#3 Flat(100). This graph is generated in the same man- 
ner as the previous one, using |V| = 500, K = 100, 
and p= 0.85. Solutions thus feature a relatively 
large number of small color classes (~ 5 vertices 
per color). 


63.4 Experiment 1 


Our first set of experiments looks at the performance of 
GPX by comparing it to two additional recombination 
operators. To gauge the advantages of using a global op- 
erator (recombination in this case), we also consider the 
performance of TABUCOL on its own, which iterates on 
a single solution until the run cut-off point is met. 

Our first additional recombination operator follows 
the assignment-based scheme discussed in Sect. 63.2.1 
and, in each application, utilizes the procedure of Coll 
et al. [63.42] (Fig. 63.2) to relabel the second parent. 
Offspring are then formed using the classical n-point 
crossover, with each gene being inherited from either 
parent with probability 0.5. 

Our second recombination operator is based on 
the grouping genetic algorithm (GGA) methodology 
(Sect. 63.2.1), adapted for use in the space of k- 
colorings. An example is given in Fig. 63.4. Given 


Parent 2 
U; = {V1, Vo} 
U2 = {v7} 

> U3 = {v3, Vs} 

© U4 = {V2, Va, Vo} 
Us = {V6, Vio} 


Offspring 

U1 = {v1, Vio} 

Uz = {v7, #8} 

U3 = {V3, Vs} 

U4 = {V2, Va, Vo} 

Us = {vo} Uncolored = {vo} 


Fig. 63.4 Demonstration of the GGA recombination operator. 
Here, color classes in parent 2 are labeled to maximally match those 
of parent 1 


#4: TT(A). This graph is named car_s_9J in the original 
dataset of Carter et al. [63.53]. It is chosen be- 
cause it is quite large and, unlike the previous three 
graphs, the variance in vertex degrees is quite high. 
This problem’s structure is also much less regular 
than the previous three graphs, which are generated 
in a fairly regimented manner. 

#5: TT(B). This graph, originally named pur_s_93, is 
the largest problem in Carter’s dataset, with |V| = 
2419. It is also quite sparse compared to the previ- 
ous graph, though it still features a high variance in 
vertex degrees (Table 63.1). 


The rightmost column of Table 63.1 also gives in- 
formation on the best solutions known for each graph. 
These values were determined via extended runs of our 
algorithms, or due to information provided by the prob- 
lem generator. 


two parents, the color classes in the second parent 
are first relabeled using Coll et al.’s procedure. Using 
the partition-based representations of these solutions, 
a subset of colors in parent 2 is then chosen randomly, 
and these replace the corresponding colors in a copy 
of parent 1. Duplicate vertices are then removed from 
color classes originating from parent 1 and uncolored 
vertices are assigned to random color classes. Note that 
like GPX, before uncolored vertices are assigned, the 
property defined by (63.5) is satisfied by this operator; 
however, unlike GPX there is no requirement to inherit 
larger color classes or to inherit half of its color classes 
from each parent. 

A summary of the results achieved by the three 
recombination operators (together with TABUCOL) is 
given in Table 63.2. For each instance the same set of 20 
initial populations was used with the EAs, and entries 
in bold signify samples that are significantly different 
to the non-bold EA entries according to a Wilcoxon 
signed-rank test at the 0.01 significance level. For graph 
#1 we see that GPX has clearly produced the best re- 
sults — indeed, even its worst result features two fewer 
colors than the next best solution. However, for graphs 
#2 and #5, no significant difference between the EAs 
is observed, while for #3 and #4, better results are pro- 
duced by the GGA and the n-point crossover. 

Figure 63.5 shows run profiles for two example 
graphs. We see that in both cases TABUCOL provides 
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a) Colors b) Colors 
102 — GPX — GPX 
n-point Lie n-point 
100 — GGA — GGA 
98 —— TabuCol 112 —— TabuCol 


0 1 2 3 4 5 
Checks (x10!) 


0 1 2 3 4 5 
Checks (x10!!) 


Fig. 63.5a,b Run profiles for the instances (mean of 20 runs): (a) #1 (random). (b) #3 (Flat 100) 


the fastest rates of improvement, though it is eventu- 
ally overtaken by at least one of the EAs. Table 63.2, 
however, also reveals that TABUCOL performs very 
poorly with graphs #4 and #5. This seems due to the 
high degree variance in these cases, which we observe 
makes the cost of neighboring solutions in the search 
space vary more widely. This suggests a more spiky cost 
landscape in which the use of local search in isolation 
exhibits a susceptibility for becoming trapped at local 
optima (see also [63.50]). 

An important factor behind the differing perfor- 
mances of these EAs is the effect that recombination 
has on the population diversity. To examine this, we first 
define a metric for measuring the distance between two 
solutions: Given a solution U, let Py = {{u, v} : c(u) = 
c(v)}, for Yu, v € V, u# v. The distance between two 
solutions U; and U, can then be defined, 


|Pu, UPu,|—|Pu, N Pus| 
Pu, U Pup | 


D(U;, U2) = (63.6) 
This measure gives the proportion of vertex pair- 
ings (assigned to the same color) that exist in 
just one of the two solutions. Consequently, if Uı 


and Uz are identical, then Py, U Pu, = Pu, Pw, 
giving D(U,, U2) = 0. Conversely, if no vertex pair 
is assigned the same color, Py, O Pu, = Ø, imply- 
ing D(U,, U2) = 1. Population diversity can also be 
defined as the mean distance between each pair of so- 
lutions in the population. That is, given a set of m 
individuals U = {U,, Un, ..., Um} 


1 
Diversity (U) = 5 5X Duu). 
27 YU,UEU:i<j 


(63.7) 


Considering our results, the two scatter plots of 
Fig. 63.6 demonstrate the positive correlation that exists 
between parental distance and the number of uncol- 
ored vertices that result in applications of the GPX and 
GGA operators. This data was derived from graph #4, 
though similar patterns were observed for the other in- 
stances. Note that the correlation is weaker for GGA 
due to two reasons. First, unlike GPX, which requires 
half of the color classes to be inherited from each par- 
ent, with GGA this proportion can vary. Thus if the 
majority of color classes are inherited from just one par- 


Table 63.2 Number of colors in the best feasible solution achieved at the cut-off point (mean (min; median; max) of 20 


runs) 
GPX n-point GGA TABUCOL 
#1 87.00 (87; 87; 87) 93.35 (93; 93; 94) 91.55 (91; 92; 92) 89.10 (89; 89; 90) 
#2 12.95 (12; 13; 13) 13.00 (13; 13; 13) 13.00 (ie 1332 1133) 13.00 (138 113% 1133) 
#3 105.60 (105; 106; 106) 105.05 (105; 105; 106) 105.05 (105; 105; 106) 105.90 (105; 106; 106) 
#4 29.05 (28; 29; 30) 28.00 (28; 28; 28) 27.90 (27; 28; 29) 38.20 (32; 37.5; 46) 
#5 33.30 (33; 33; 34) 33.15 (32; 33; 34) 33.10 (32; 33; 34) 52.05 (47; 52; 56) 
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Fig. 63.6a-d Relationship between parental distance and number of uncolored vertices with the GPX (a) and GGA (b) 
operators. Also shown is the number of uncolored vertices in the first 10000 applications of GPX (c) and GGA (d) 


ent, it is possible to have two very different parents, but 
only a small number of uncolored vertices. Second, as 
mentioned earlier GGA shows no bias towards inher- 
iting larger color classes, meaning that the number of 
uncolored vertices can also be higher than GPX, partic- 
ularly when inheriting around half of the color classes 
from each parent. An effect of these patterns is shown 
in the lower graphs of Fig. 63.6, where throughout the 
evolutionary process, the number of uncolored vertices 
occurring during recombination is fewer and less varied 
with GPX. In comparison to GGA, this behavior leads 
to a more rapid loss of diversity, as is demonstrated in 
Fig. 63.7 for two example graphs. 

Whether sustained diversity is a help or hindrance 
with these EAs thus seems to depend on the type of 
graph being tackled. As can be seen in Fig. 63.7, for 


graph #1 GPX is the only recombination operator that 
leads to any sort of population convergence, and it is 
also the algorithm that produces the best solutions given 
sufficient time, suggesting that is suitably homing in on 
high-quality regions of the search space. On the other 
hand, for graphs #3 and #4, GGA’s more sustained di- 
versity (caused and perpetuated by the greater number 
of uncolored vertices that occur during recombination) 
causes the operator to be more disruptive. However, in 
these cases this factor also seems to provide a useful 
diversification mechanism, allowing the algorithm to 
sample wider areas of the search space, leading to better 
results. An extreme case of diversity loss occurs with 
graph #5, which we recall has a low density and high 
degree variance. In this case, when using GPX large 
color classes of low-degree vertices that are formed in 
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Fig. 63.7a,b Population diversity during the first 10000 recombinations with (a) the random (#1) and (b) TT(A) (#4) 


instances 


early stages of the algorithm quickly come to dominate 
the population limiting the exploration that then takes 
place — indeed, in many runs the algorithm was actu- 
ally unable to improve on costs achieved in the initial 
population. 

Figure 63.7 also shows that n-point crossover tends 
to maintain diversity for longer periods than GPX in this 
case, allowing it to produce superior results for graphs 
#3 and #4. However, the sustained diversity is not due 


63.5 Experiment 2 


In this section we now consider ways in which the 
results of the GPX operator might be improved, partic- 
ularly looking at how we might encourage diversity to 
be sustained in the population. 

As mentioned in Sect. 63.2.1, Lii and Hao [63.34] 
previously proposed extending the GPX operator to al- 
low offspring to be produced using m > 2 parents. In 
this operator, which we call MULTIX, offspring are 
constructed in the same manner as GPX, except that at 
each stage the largest color class from multiple parents 
is chosen to be copied into the offspring. The intention 
behind this increased choice is that larger color classes 
will be identified, resulting in fewer uncolored vertices 
once the k color classes have been constructed. In order 
to prohibit too many colors being inherited from one 
particular parent, Lii and Hao also make use of a pa- 
rameter q, specifying that if the i-th color class in an 
offspring is copied from a particular parent, then this 


to uncolored vertices (which do not occur with this op- 
erator); rather, it seems due to the naturally occurring 
disruption that results from the color labeling issues 
mentioned in Sect. 63.2.1. 

Finally, we also mention that during our runs with 
these EA’s, the local search element was observed to be 
by far the most expensive part of the algorithm, with 
none of the recombination operators consuming more 
than 1.8% of the available run time. 


parent should not be considered for further g colors. In 
our application of MULTIX we follow the recommenda- 
tions of the Lii and Hao, choosing m randomly from the 
set {2,...,6} in each application and using q = |m/2]. 
Note also that GPX is simply an application of MULTIX 
using m = 2 and q = 1. 

Though having the potential to produce good re- 
sults [63.34], an issue with MULTIX is that it could 
result in diversity being lost even more rapidly than with 
GPX, particularly if fewer vertices need to be randomly 
recolored at the end of each application. In [63.34], Lü 
and Hao attempt to deal with this using a mechanism 
whereby offspring are only inserted into the population 
if they are seen to be sufficiently different or better than 
existing members. However, in our case, we suggest 
two alternative methods. 

The first of these involves altering the MULTIX 
operator so that it works exclusively with proper col- 
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Fig. 63.8 Example Kempe chain involving, e.g., vertex v7 and color 4 (left), and the resultant coloring due to a color 


interchange (right) 


orings. As noted, GPX and MULTIX currently operate 
on colorings in which clashes are permitted; however, 
this could in theory result in large color classes that 
feature many clashes being unduly promoted in the pop- 
ulation, when perhaps the real emphasis should be on 
the promotion of large color classes that are indepen- 
dent sets. The ISETS approach thus operates by first 
iteratively removing clashing vertices from each parent 
(in a random order, until proper colorings are achieved), 
and then using the MULTIX operator to produce an off- 
spring as before. This implies that, before recoloring 
missing vertices, offspring will also be proper, since 
subsets of independent sets are themselves independent 
sets. A further effect is that a greater number of vertices 
might need to be recolored, since vertices originally re- 
moved from the parents could also be missing in the 
resultant offspring. 

Our second proposal for prolonging diversity is to 
make changes directly to an offspring to try to increase 
its distance from its parents before reinsertion into the 
population. One way of doing this would be to in- 
crease the iteration limit of the local search procedure, 
as demonstrated by Galinier and Hao [63.33]. How- 
ever, we find that such an approach can slow the algo- 
rithm unnecessarily, particularly because as the proce- 
dure progresses, movements in the search space (due 
to improving or sideways moves) become less frequent. 
An alternative in this case is to exploit the structure of 
the graph coloring problem via the use of a Kempe chain 
interchange operator. Kempe chains define connected 
sub-graphs that involve exactly two colors, and can be 
generated by taking an arbitrary vertex v and color i, 
such that c(v) Æ i. An example is given in Fig. 63.8. 
Note that when interchanging the colors of vertices in 


a Kempe chain, if the original coloring is proper, then 
so is the new coloring. Thus we have the opportunity 
to quickly alter colorings without compromising their 
quality. 

Our KEMPE approach operates in the same man- 
ner as ISETS, except that before reassigning uncolored 
vertices, a series of randomly selected Kempe chain 
interchanges are performed on the existing proper col- 
oring. In our case, 2k such moves are applied. 

The results achieved by our three modifications are 
summarized in Table 63.3, where bold entries signify 
samples that are significantly different to GPX at signif- 
icance level 0.01. We see that improvements over GPX 
were only obtained on graph #1, where all three vari- 
ants were successful, and graph #4 using the KEMPE 
variant. In practice, we found that MULTIX causes di- 
versity to be lost more quickly than GPX with these 
graphs — however, the ISETS mechanism did not seem 
to alter this behavior a great deal, usually because the 
number of clashing vertices needing to be removed was 
quite small (less than 10). 

Surprisingly, we also found that the KEMPE vari- 
ant was only able to maintain higher levels of diversity 
with instances #4 and #5. For graphs #1, #2, and #3, 
it turns out that when using a suitably low number of 
colors k, the bipartite graphs induced by most pairs of 
color classes in a solution are connected. In these cases, 
all of the vertices belonging to the two color classes are 
included in the Kempe chain, meaning that a color inter- 
change does not alter the structure of the solution, but 
merely produces a relabeling of the two color classes. 
(An example of such a Kempe chain would occur in 
Fig. 63.8 using vertex v3 and color 2.) This is not the 
case for the less structured graphs #4 and #5, where we 


Table 63.3 Number of colors in the best feasible coloring achieved at the cut-off point (mean (min; median; max) from 


20 runs) 
GPX MULTIX ISETS KEMPE 
#1 87.00 (87; 87; 87) 85.00 (85; 85; 85) 85.05 (85; 85; 86) 85.15 (85; 85; 86) 
#2 12.95 (12; 13; 13) 13.00 (13; 13; 13) 13.00 (UBB s 13) 12.90 (2B i138 1133) 
#3 105.60 (105; 106; 106) 105.55 (105; 106; 106) 105.85 (105; 106; 106) 105.30 (105; 105; 106) 
#4 29.05 (28; 29; 30) 29.10 (29; 29; 30) 29.00 (28; 29; 30) 28.00 (28; 28; 28) 
#5 33.30 (33; 33; 34) 33.30 (33; 33; 34) 33.30 (33; 33; 34) 33.30 (33; 33; 34) 
a) Colors b) Diversity 
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Fig. 63.9 (a) Run profile for TT(A) (graph #4, left), and (b) its diversity over the first 10 000 recombinations 


found that diversity could be maintained for longer peri- 
ods. However, this only led to significant improvements 
in the results for graph #4, whose run profiles are shown 


63.6 Conclusions and Discussion 


In this chapter we have examined the relative per- 
formance of a number of different graph coloring 
recombination operators. Using a common evolution- 
ary framework, we have seen that this performance 
varies, particularly due to the underlying structures of 
the graphs being tackled. 

A desirable property of recombination is that it 
should be able to combine meaningful substructures of 
existing candidate solutions (parents) in the production 
of new, hopefully fitter, offspring. However, does that 
process actually occur with any of these operators? Or, 
by involving the random reassignment of some vertices, 
do the operators simply provide a mechanism by which 
large random perturbations are periodically applied to 
a solution, helping to re-invigorate the search process? 


in Fig. 63.9. Also note that these enhanced results still 
fail to beat those of the GGA and n-point operators, as 
shown in Table 63.2. 


Again, the answer to such a question seems to de- 
pend on the problem instance at hand. In Table 63.4 
we compare the costs of solutions achieved by the best 
available recombination operator for each instance, to- 
gether with those produced by a corresponding random 
perturbation operator. Specifically, for each graph we 
identified the best run from the EA’s sample of 20 and 
recorded the number of uncolored vertices that resulted 
in each application of recombination. We then used 
these figures, together with the same k-value, to specify 
the number of vertices that would be randomly selected 
and reassigned in each corresponding application of our 
random perturbation operator. In each iteration this al- 
gorithm then operated by selecting two parents, making 
a copy of parent 1, randomly perturbing this copy, ap- 
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Table 63.4 Comparison of the best EA and corresponding random perturbation operator. (Cost of best solutions using f2 
(63.2); (mean, (min; median; max) from 20 runs), and proportion of runs where fọ = 0 (feasibility) was achieved) 


EA Random 
k Type Cost Feas. Cost Feas. 
#1 85 MULTIX 0.00 (0; 0; 0) 1.00 16.80 (4; 17.5; 31) 0.00 
#2 12 GPX 2.40 (0; 2; 4) 0.05 7.60 (5; 8; 10) 0.00 
#3 105 GPX 0.90 (0;1;2) 0.40 1.75 (0; 2; 3) 0.15 
#4 2 GGA 1.10 (0; 1; 2) 0.15 1.35 (0; 1; 2) 0.05 
#5 32 GGA 1.75 (0; 2; 3) 0.05 1.50 (0; 1.533) 0.15 


plying local search, and finally replacing the weaker of 
the two parents. 

The results in Table 63.4 indicate that, for graph 
#1, recombination is clearly doing more than just ran- 
domly perturbing solutions since all runs have resulted 
in feasible 85-colorings. However, although recombina- 
tion has achieved significantly lower costs with graph 
#2, the proportion of runs where feasibility has been 
achieved shows no significant difference for any of the 
graphs #2 to #5 (according to McNemar’s test at signif- 
icance level 0.01). We find this observation compelling 
as it might suggest that better results might ultimately 
be achieved using schemes that make more informed 
decisions about the size and frequency of perturbations. 
Indeed, currently the size of random perturbations tends 
to fall as the run progresses (Fig. 63.6); however, it may 
be useful to allow this trend to be reversed, particularly 
if improvements are not achieved for a lengthy period of 
time. In addition, the way in which vertices are chosen 
for random reassignment might also influence perfor- 
mance — for example, we might target those belonging 
to a specific color, those that are causing clashes, those 
that have been assigned to a particular color for the 
longest, and so on. This requires further research. 

An interesting point regarding the structure of solu- 
tions was raised previously by Porumbel et al. [63.35], 
who considered the sizes of the color classes. Specifi- 
cally, they propose that when solutions involve a small 
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64. Metaheuristic Algorithms and Tree Decomposition 


Thomas Hammerl, Nysret Musliu, Werner Schafhauser 


This chapter deals with the application of evo- 
lutionary approaches and other metaheuristic 
techniques for generating tree decompositions. 
Tree decomposition is a concept introduced by 
Robertson and Seymour [64.1] and it is used to 
characterize the difficulty of constraint satisfaction 
and NP-hard problems that can be represented 
as a graph. Although, in general, no polynomial 
algorithms have been found for such problems, 
particular instances can be solved in polyno- 
mial time if the treewidth of their corresponding 
graph is bounded by a constant. The process of 
solving problems based on tree decomposition 
comprises two phases. First, a decomposition with 
small width is generated. Basically in this phase 
the problem is divided into several subproblems, 
each included in one of the nodes of the tree de- 
composition. The second phase includes solving 
a problem (based on the generated tree decompo- 
sition) with a particular algorithm such as dynamic 
programming. The main idea is that by decompos- 
ing a problem into subproblems of limited size, 
the whole problem can be solved more efficiently. 
The time for solving the problem based on its tree 
decomposition usually depends on the width of 
the tree decomposition. Thus, it is of high inter- 
est to generate tree decompositions having small 
widths. 

Finding the treewidth of a graph is an NP-hard 
problem [64.2]. In order to solve this problem, 
different algorithms have been proposed in the 
literature. Exact methods such as branch and 
bound techniques can be used only for small 
graphs. Therefore, metaheuristic algorithms based 
on genetic algorithms [64.3], simulated anneal- 
ing [64.4], tabu search [64.5], iterated local 
search [64.6], and ant colony optimization 
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(ACO) [64.7, 8] have been proposed in the literature 
to generate good upper bounds for larger graphs. 
Such techniques have been applied very success- 
fully and they are able to find the best existing 
upper bounds for many benchmark problems in 
the literature. 

In this chapter, we will first introduce the 
concept of tree decomposition, and then give 
a survey on metaheuristic techniques used to 
generate tree decompositions. Three approaches 
based on genetic algorithms, iterated local search, 
and ACO that were proposed in the literature 
will be described in detail. Finally, we will also 
mention briefly two recent approaches that ex- 
ploit tree decompositions within metaheuristic 
search. 
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64.1 Tree Decompositions 


We start with an informal description of tree decompo- 
sition. Suppose that we have to find solutions for the 
graph coloring problem (GCP), which is a well-known 
constraint satisfaction problem (CSP) in the literature. 
For this problem, we have to find a coloring of vertices 
of a given graph in such a way that no two vertices con- 
nected by an edge share the same color. An instance of 
the GCP is shown on the left-hand side of Fig. 64.1. The 
task is now to find a valid coloring just using the colors 
red, green, and blue. 

A naive approach to solve this problem might be 
to try out all possible combinations of variable assign- 
ments and see which ones are valid. In general, there 
are d” possible combinations, where d is the number of 
available colors and n is the number of vertices. 

To solve this problem by tree decomposition, first 
we generate the tree decomposition of the correspond- 
ing problem graph. Informally, a tree decomposition is 
a tree containing a group of graph vertices where each 
tree node fulfils the following conditions: each vertex of 
the graph appears in one of the nodes of the tree; if two 
vertices are connected in the graph, they must appear 
together in some of the tree nodes; connectedness con- 
dition must be fulfilled, i. e., if a vertex appears in two 
different nodes of the tree, it must appear also in other 
nodes between these two nodes. The formal definition 
of tree decomposition is given in the next section. 

The corresponding constraint graph of a coloring 
problem and a possible tree decomposition is shown 
in Fig. 64.1a,b. If we want to solve the graph color- 
ing problem based on this tree decomposition, we can 
start out by solving the subproblems given by each node 
in the tree decomposition. Using a naive approach of 
trying out all possible combinations of variable assign- 
ments, one has to generate 3° (27) different solution 
candidates for the vertex containing A, B, and C. Be- 
cause of the constraints A Æ B, A Æ C, and B Æ C only 
six of them are valid. For the subproblem containing the 


a) b) [cp 


o 6 ©) 


A,B,C 


Fig. 64.1a,b Instance of the graph coloring problem and 
a possible tree decomposition 


vertices C and D we generate 3° (9) solution candidates 
and rule out three of them because of the constraint 
C # D. We can now get all solutions to the whole prob- 
lem by joining the subproblem solutions. Therefore, we 
will take a look at the variables that both subproblems 
have in common. In this case, that is the variable C. 
Each solution for the subproblem A, B, C is joined with 
the solutions for the subproblem C, D sharing the same 
color for the vertex C. By using the tree decomposition, 
we have to generate 36 combinations of variable assign- 
ments in order to determine all solutions compared to 
the 81 combinations we would have to generate with- 
out the tree decomposition. This difference increases 
very quickly with the size of the graph coloring prob- 
lem and constraint satisfaction problems in general. The 
smaller the subproblems in the tree decomposition the 
more efficiently we can solve a particular problem. This 
motivates our interest in finding tree decompositions of 
small width. 

Note that tree decompositions have been applied 
for several applications, like combinatorial optimiza- 
tion problems, expert systems, computational biology 
etc. The use of tree decomposition for inference prob- 
lems in probabilistic networks is shown in [64.9]. 
Koster et al. [64.10] propose the application of tree de- 
compositions for frequency assignment problem. Tree 
decomposition has also been applied for the vertex 
cover problem on planar graphs [64.11]. Furthermore, 
solving partial constraint satisfaction problems (e.g. 
MAX-SAT) with tree-decomposition-based method has 
been investigated in [64.12]. In computational biology 
tree decompositions has been used for protein struc- 
ture prediction [64.13]. Recently, the application of tree 
decomposition in Answer-Set Programming has been 
investigated in [64.14]. 
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4,5,6,7 


Fig. 64.2a,b A graph G (a) and a tree decomposition 
of G (b) 
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64.1.1 Formal Definitions 


The concept of tree decompositions has been first in- 
troduced by Robertson and Seymour [64.1]. The for- 
mal definition of tree decomposition is given as fol- 
lows [64.1, 15]. 


Definition 64.1 

Let G = (V, E) be a graph. A tree decomposition of G 
is a pair (T, x), where T = (I, F) is a tree with node 
set J and edge set F, and y = {y;: i € I} is a family of 
subsets of V, one for each node of T, such that: 


L Ujerxi=V, 

2. for every edge (v, w) € E, there is ani € Z with v € 
Xi and w € x;, and 

3. for all i,j,k € I, if j is on the path from i to k in T, 
then x; N Xk E %5- 


The width of a tree decomposition is maxje; | %;|— 1. 
The treewidth of a graph G, denoted by tw(G), is the 
minimum width over all possible tree decompositions 
of G. 


Figure 64.2 shows a graph G and a possible tree 
decomposition of G. The width of shown tree decom- 
position is 3. 

For the given graph G, the treewidth can be found 
from its triangulation. In the following, we will give ba- 
sic definitions, explain how the triangulation of graph 
can be constructed, and give lemmas which give rela- 
tion between the treewidth and the triangulated graph. 

Two vertices u and v of graph G(V, E) are neigh- 
bors, if they are connected by an edge e€ E. The 
neighborhood of a vertex v is defined as N (v) := {w|w € 
V, (v,w) € E}. A set of vertices is clique if they are 
fully connected. An edge connecting two nonadjacent 
vertices in the cycle is called chord. The graph is trian- 
gulated if there exists a chord in every cycle of length 
larger than 3. 

A vertex of a graph is simplicial if its neighbors 
form a clique. An ordering of nodes o(1,2,...,n) of 
V is called a perfect elimination ordering for G if 
for any i € {1,2,...,m}, o (i) is a simplicial vertex in 
Glo(i),...,0(n)] [64.16]. In [64.17] it is proved that 
the graph G is triangulated if and only if it has a perfect 
elimination ordering. Given an elimination ordering of 
nodes the triangulation H of graph G can be constructed 
as following. Initially H = G, then in the process of 
elimination of vertices, the next vertex in order to be 
eliminated is made simplicial vertex by adding of new 


edges to connect all its neighbors in current G and H. 
The vertex is then eliminated from G. This process is 
repeated for all vertices in the ordering. 

The treewidth of a triangulated graph can be cal- 
culated based on its cliques. For the given triangulated 
graph, the treewidth is equal to its largest clique minus 
1 [64.18]. Moreover, the largest clique of a triangu- 
lated graph can be calculated in polynomial time. The 
complexity of calculating the largest clique for the tri- 
angulated graphs is O(|V| + |E|) [64.18]. For every 
graph G=(V,E), there exists a triangulation of G, 
G= (V,EVUE,), with tw(G) = tw(G). Thus, finding 
the treewidth of a graph G is equivalent to finding a tri- 
angulation G of G with the minimum clique size (for 
more information see [64.15]). 

The process of elimination of nodes from the given 
graph G is illustrated in Fig. 64.3. Suppose that we have 
given the following elimination ordering: 10, 9, 8, 7, 2, 
3, 6, 1, 5, 4. The vertex 10 is first eliminated from G. 
When this vertex is eliminated no new edges are added 
to the graph G and H (graph H is not shown in the fig- 
ure), as all neighbors of node 10 are connected. From 
the remained graph G the vertex 9 is eliminated. To 
connect all neighbors of vertex 9, two new edges are 
added in G and H (edges (5, 7) and (6, 7)). The process 
of elimination continues until the triangulation H is ob- 
tained. A more detailed description of the algorithm for 
constructing a graph’s triangulation for a given elimina- 
tion ordering is found in [64.15]. 

For generating the tree decomposition during the 
vertex elimination process, first the nodes of the tree de- 
composition are created. This is illustrated in Fig. 64.3. 
When vertex 10 is eliminated a new tree decompo- 
sition node is created. This node contains the vertex 
10 and all other vertices which are connected with 
this vertex in the current graph G. Further, the next 
tree node with vertices {5,6,7,9} is created when the 
vertex 9 is eliminated. To the end of elimination pro- 
cess all tree decomposition nodes will be created. The 
created tree nodes should be connected, such that the 
connectedness condition for vertices is fulfilled. This 
is the third condition in the tree decomposition defi- 
nition. To fulfil this condition, the tree decomposition 
nodes are connected as following. The tree decomposi- 
tion node with vertices {7,9, 10} that is created when 
vertex 10 is eliminated, is connected with the tree de- 
composition node which will be created when the next 
vertex which appears in {7, 9, 10} is eliminated. In this 
case, the node {7,9, 10} should be connected with the 
node created when vertex 9 is eliminated, because this 
is the next vertex in the ordering that is contained in 
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Fig. 64.3 Elimination of 
vertices 10, 9, 8, 7, 2, 3, 6, 1, 
5, 4. When a vertex is elimi- 
nated a tree node containing 
eliminated vertex and its 
neighbors is created 


{7,9, 10}. This rule is further applied for connection 
of other tree decomposition nodes, and from the graph 
the tree decomposition in Fig. 64.2 will be constructed. 
Note that some of tree nodes that are created in the 
elimination process are not presented in the tree de- 
composition, because they are contained in larger tree 


nodes. For example, the node {4,5,6} which is cre- 
ated by eliminating vertex 6 is already contained in the 
node {4,5, 6,7} which is created by eliminating vertex 
7. Moreover, tree nodes which are created by eliminat- 
ing vertices 1,5, 4 are also contained in other larger tree 
nodes. 


64.2 Generating Tree Decompositions by Metaheuristic Techniques 


As described in the previous section, the width of the 
tree decomposition depends on the elimination ordering 
of vertices. Therefore, the task of finding tree decom- 
position with minimal width consists of finding the best 
permutation of graph vertices. This problem is similar 
to the traveling salesman problem, but with a different 
objective function. 

In the last two decades, researchers have been 
proposing different techniques to find tree decomposi- 
tions for different benchmark examples. This includes 
the exact techniques based on tree search and branch 
and bound, the simple greedy techniques and meta- 
heuristic techniques. In this chapter, we focus on meta- 
heuristic techniques. At the end of this section, we will 
also shortly describe other approaches used for tree de- 
compositions. 

The metaheuristic techniques applied for tree de- 
composition can be divided in two groups: population 
based/nature inspired techniques, and local search tech- 
niques. Regarding nature inspired techniques the ap- 


plication of genetic algorithms has been investigated 
in [64.19,20], and ACO has been used in [64.21]. 
Examples of local search techniques for tree decompo- 
sitions are [64.16, 22, 23]. 


64.2.1 Genetic Algorithms 
for Tree Decomposition 


Application of genetic algorithm for tree decomposi- 
tions has been first investigated in [64.19]. This algo- 
rithm tried to minimize a weight associated with the 
decompositions of Bayesian networks which is not ex- 
actly the same as the width of the tree decomposition. 
In [64.20], this algorithm has been extended for gener- 
ating hypertree decompositions and with some changes 
in fitness function (the width of tree decompositions has 
been used as a objective function) has been tested on 
different problems from the literature. The following 
description of genetic algorithm for tree decomposition 
is based on our previous work in [64.20]. 
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Genetic algorithms (GAs) were developed by 
[64.3]. They try to find a good solution for an optimiza- 
tion problem by imitating the principle of evolution. 
Genetic algorithms alter and select individuals from 
a population of solutions for the optimization problem. 
In the following, we describe frequently used terms 
within the field of genetic algorithms: 


Population ... set of candidate solutions. 

Individual ... a single candidate solution. 

Chromosome ... set of parameters determining the 
properties of a solution. 

Gene ... Single parameter. 


A genetic algorithm tends to optimize the value of 
an objective function of an optimization problem, in 
terms of genetic algorithms also called fitness func- 
tion. At the beginning a genetic algorithm creates an 
initial population containing randomly or heuristically 
created individuals. These individuals are evaluated and 
assigned a fitness value, which is the value of the fitness 
function for the solution represented by the individual. 
The population is evolved over a number of generations 
until a halting criterion is satisfied. At each generation, 
the population undergoes selection and recombination, 
also called crossover and mutation. 

During the selection process, the genetic algorithm 
decides which individuals from the current population 
are allowed to enter the next population. This decision 
is based on the fitness value of the individuals and indi- 
viduals of better fitness should enter the next population 
with higher probability than individuals of lower fit- 
ness. Not selected individuals are discarded and will not 
be evolved further. 

The recombination process or crossover combines 
different properties of several parent solutions within 
one or more children solutions, also denoted as off- 
springs. Crossover exchanges properties between the 
individuals with the aim of increasing the average qual- 
ity of the population. 

During the mutation process, individuals are 
slightly altered. Mutation is used to explore new regions 
of the search space and to avoid early convergence to 
local optima. 

In practice, parameters are used in order to control the 
behavior of a genetic algorithm. Typical control param- 
eters are mutation rate, crossover rate, population size, 
and parameters for selection techniques. The choice of 
the control parameters has a crucial effect on the quality 
of the best solution found by a genetic algorithm. 

The genetic algorithm for tree decomposition pre- 
sented below is named GA-tw and was implemented 


in [64.20]. Algorithm 64.1 presents algorithm GA-tw in 
pseudo code notation. 

The algorithm takes as input a graph and several 
control parameters. Individual solutions are vertex or- 
derings. Each individual is assigned the width of the 
tree decomposition returned from the corresponding 
vertex ordering as its fitness value. 

Initially GA-tw generates a population consisting 
of randomly created individuals. Tournament selection 
was chosen as the selection technique. Tournament 
selection selects an individual by randomly choosing 
a group of several individuals from the former popula- 
tion. The individual of highest fitness (smallest width) 
within this group is selected to join the next popula- 
tion. This process is applied until enough individuals 
have entered the next population. Finally, after a certain 
number of generations, algorithm GA-tw will return the 
best fitness (smallest width) of an individual found dur- 
ing the search process. 


Crossover and Mutation Operators 
Within the genetic algorithms in [64.20] nearly all types 
of crossover operators and all mutation operators were 
implemented. The same operators were also applied 
in [64.19] for decomposing the moral graph of Bayesian 
networks. 


Algorithm 64.1 Genetic Algorithm for Tree Decom- 
positions — GA-tw 
Input: a graph G = (V, E) 
control parameters for the GA n, Pm, De, S 
and max_iterations 
Output: an upper bound on the treewidth 
of the graph 
t=0 
initialize (population(t),n) 
evaluate population(t) 
while t < max_iterations do 
t=t+1 
population(t) = tournament_selection 
(population(t — 1), s) 
recombine (population(t), pe) 
mutate (population(t), Pm) 
evaluate population(t) 
end while 
return the smallest width found during the search 


Crossover Operators 
@ Partially mapped crossover (PMX) 
@ Cycle crossover (CX) 
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Order crossover (OX1) 

Order-based crossover (OX2) 
Position-based crossover (POS) 
Alternating-position crossover (AP). 


Mutation Operators 

Displacement mutation operator (DM) 
Exchange mutation operator (EM) 
Insertion mutation operator (ISM) 
Simple-inversion mutation operator (SIM) 
Inversion mutation operator (IVM) 
Scramble mutation operator (SM). 


We will describe the crossover and mutation opera- 
tors which returned the best results of algorithm GA-tw 
in more detail. 


Order Crossover (0X1) 
The order crossover operator determines a crossover 
area within the parents by randomly selecting two posi- 
tions within the ordering. The elements in the crossover 
area of the first parent are copied to the offspring. Start- 
ing at the end of the crossover area all elements outside 
the area are inserted in the same order in which they 
occur in the second parent. 


Order-Based Crossover (0X2) 
The order-based crossover operator selects at random 
several positions in the parent orderings by tossing 
a coin for each position. The elements of the first par- 
ent at these positions are deleted in the second parent. 
Afterward they are reinserted in the order of the second 
parent. 


Position-Based Crossover (POS) 

The position-based crossover operator also starts with 
selecting a random set of positions in the parent strings 
by tossing a coin for each position. The elements at the 
selected positions are exchanged between the parents in 
order to create the offsprings. The elements missing af- 
ter the exchange are reinserted in the order of the second 
parent. 


Exchange Mutation Operator (EM) 
The exchange mutation operator randomly selects two 
elements in the solution and exchanges them. 


Insertion Mutation Operator (ISM) 
The insertion mutation operator randomly chooses an 
element in the solution and moves it to a randomly se- 
lected position (Fig. 64.5). 


The genetic algorithm implemented in [64.19] was 
applied to two artificial graphs. This genetic approach 
returned competitive results when compared to results 
obtained by simulated annealing [64.22]. The algo- 
rithm implemented in [64.20] was evaluated on 62 
graphs of the second DIMACS graph coloring chal- 
lenge [64.24]. Different experiments were performed to 
find the best parameter values for parameters of the ge- 
netic algorithm and it turned out that the position-based 
crossover operator(POS) and the insertion mutation 
operator(ISM) were best suited for finding tree de- 
compositions of small width. Existing upper bounds 
for treewidth for several DIMACS instances could be 
improved. 


64.2.2 Ant Colony Optimization 
for Tree Decomposition 


Ant colony optimization (ACO has been applied 
for tree decompositions in [64.21,25]. The current 
section is based on [64.21] and describes differ- 
ent ant colony optimization variants applied for tree 
decomposition. 

ACO is a population-based metaheuristic intro- 
duced by Dorigo et al. [64.7, 8]. As the name suggests, 
the technique was inspired by the behavior of real 
ants. Ant colonies are able to find the shortest path be- 
tween their nest and a food source just by depositing 
and reacting to pheromones while they are explor- 
ing their environment. The basic principles driving 
this system can also be applied to many combina- 


Parents Offsprings 
oxı [1[2 BBB] 6|7[s 817] 3]4]5[1[2]6 
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Fig. 64.4 Selected crossover operators for vertex order- 
ings 
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Fig. 64.5 Selected mutation operators for vertex orderings 
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torial optimization problems. For a detailed descrip- 
tion of different ACO algorithms and their applica- 
tions the reader is referred to the book ant colony 
optimization [64.26]. 

The following variants of ACO algorithms for 
finding good upper bounds for tree decompositions 
were investigated in [64.21,25]: simple ant sys- 
tem [64.7,8], elitist ant system [64.7,8], rank-based 
ant system [64.27], max—min ant system [64.28, 
29], and ant colony system [64.30]. Two differ- 
ent pheromone update strategies were proposed and 
two stagnation measures were implemented that in- 
dicate the degree of diversity of the solutions con- 
structed by the ants. Furthermore, two constructive 
heuristics (Min-Degree, Min-Fill) were implemented 
and incorporated alternatively into every ACO vari- 
ant as a guiding function, and the combination of 
ACO with two existing local search methods: Hill 
Climbing and Iterated Local Search [64.23] were 
investigated. 

A simple constraint graph and the corresponding 
ACO construction tree are shown in Fig. 64.6. The con- 
struction tree can be obtained from the constraint graph 
as follows: 


1. Create a root node s that will be the starting point of 
every ant in the colony. 

2. For every vertex of the constraint graph append 
a child node to the root node s. 

3. To every leaf node append a child node for every 
vertex of the constraint graph that is neither repre- 
sented by the leaf node itself nor by an ancestor of 
this node. 

4. Repeat step 3 until there are no nodes left to append. 


All possible elimination orderings for the constraint 
graph can now be represented as a path from the root 
node s to one of the leaf nodes in the construction tree. 
Therefore, each of the ants finds such a path and at 
each node on its way the ant decides where to move 
next probabilistically based on the pheromone trails 
and a heuristic value both associated with the outgoing 
edges. 


Pheromone Trails 
A pheromone trail gives information how favorable it is 
to eliminate a certain vertex x after another vertex y. 
The more pheromone is located on a trail the more 
likely the corresponding vertex will be chosen by the 
ant. A way to represent the pheromone trails of the con- 
struction tree in Fig. 64.6 is the matrix as shown in the 


Fig. 64.6a,b Constraint graph G and the ACO construc- 
tion tree 


following, 


Txixy Tx Txx3 


T= Txax1 Taxa Tax (64.1) 


Tx3x1 Tx3x2  Tx3x3 
Tsx Tsx Tsx3 


In this matrix, each row contains the amounts of 
pheromone located on the trails connecting a certain 
node with all the other nodes. For example, the first row 
contains the pheromone levels related to the node x, 
describing the desirability of eliminating x2(t,,x,), re- 
spectively, x3(T;,,,) immediately after xı. The last row 
is related to the root node s that is the starting point for 
every ant. 

All pheromone trails are initialized to the same 
value in the beginning of the algorithm that is computed 
according to the following equation, 


m 


SS, VujeT. 64.2 
W, i (64.2) 


Tij 
where W, is the width of the decomposition obtained 
using the guiding heuristic (min-degree or min-fill) 
while m is the size of the ant colony. 


Heuristic Information 
The ants make their decision about which vertex to 
eliminate next not solely based on the pheromone ma- 
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trix but also consider a guiding heuristic. Two different 
heuristics have been implemented. In order to compute 
them, a separate graph in addition to the construction 
tree is maintained. This graph is called the elimination 
graph because it is obtained from the original constraint 
graph by successively eliminating the vertices traversed 
by the ant in the construction tree. Further, this graph 
is denoted as E(G, o) where G is the original constraint 
graph and o is a partial elimination ordering. 


Min-Degree. The value for the min-degree heuristic is 
computed according to this equation 


1 


T GEG oN +1 = 


Nij 


The node i represents the last eliminated node, whereas 
jis a node which is not eliminated yet. The expres- 
sion d(j, E(G,0o)) represents the degree of vertex j in 
the elimination graph E(G, 0). 


Min-Fill. The value for the min-fill heuristic is com- 
puted according to this equation 


1 


= GEG a 


Nij 


The expression f (j, E(G,o)) represents the number of 
edges that would be added to the elimination graph due 
to the elimination of vertex j. 


Probabilistic Vertex Elimination 

In the following it is shown how exactly the ants move 
from node to node on the construction tree. All of the 
ACO variants with the exception of ant colony system 
use (64.5) alone to compute the probability pj of mov- 
ing from a node i to another node j where œ and f 
are parameters that can be passed to the algorithm in 
order to weight the pheromone trails and the heuristic 
values 


ie (cil [ni]? l 
© DO teal f 


IEE(G.0) 


ifje E(G,o). (64.5) 


This probability is computed for each vertex left in the 
elimination graph. According to these probabilities, the 
ant decides which vertex to eliminate next. 

Ant colony system introduces an additional pa- 
rameter qo that constitutes the probability that the ant 


makes a greedy move instead of making a probabilistic 
decision 


arg max {[ti|*[nu]®}. ifq < qo; 
IEE(G.0o) 


(64.5), 


otherwise . 
(64.6) 


If a randomly generated number q in the interval of 
[0, 1] is less or equal go then the ant moves to the node 
that otherwise would have the highest probability to be 
chosen. Ties are broken randomly. 

Ant colony system also introduces a so-called lo- 
cal pheromone update. After an ant has constructed its 
solution it removes pheromone from the trails belong- 
ing to its solution according to the following equation 
whereas € is a variant-specific parameter and To is the 
initial amount of pheromone 


ty + 1 —&)ty + Et. (64.7) 
The motivation is to diversify the search so that subse- 
quent ants will more likely choose other branches of the 
construction tree. 


Pheromone Update 

After each of the ants has constructed an elimina- 
tion ordering (that optionally has been improved by 
a local search thereafter) the values in the pheromone 
matrix are updated reflecting the quality of the con- 
structed solutions which will enable the subsequent 
ants in the following iteration to make decisions in 
a more informed manner. Moreover, pheromone is 
gradually removed from the pheromone trails so that 
solutions that might have been the best known so- 
lutions in earlier iterations of the algorithm can be 
forgotten. 


Pheromone Deposition 
In this step for an elimination ordering op that was con- 
structed by an ant k the amount of pheromone that 
will be deposited for each (i,j) in og is determined. 
An edge-independent and an edge-specific pheromone 
update strategy were considered. The first adds the 
same amount of pheromone to all trails belonging to ox 
while the latter adds more or less pheromone to in- 
dividual trails depending on the quality of a certain 
elimination. 

The edge-independent pheromone update strategy 
adds the reciprocal value of the tree decomposition’s 
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width to all pheromone trails that are part of op 


1 
—— , if (i,/) belongs to og ; 
Ark = 4 WO) a a T 
0, otherwise . 


In contrast to the edge-independent update strategy 
the edge-specific update strategy deposits different 
amounts of pheromone onto the trails belonging to the 
same elimination ordering 


1 1 
MEG. oy) /EG op] Won ’ 
tj = if (i, j) belongs to ox ; 


0, otherwise. 
(64.9) 


This amount depends on the ratio between the degree 
of the vertex j when it was eliminated d(j, E(G, ox;)) 
and the number of vertices left in the elimination graph 
|E(G, o;;)| at that time (øx; is the partial elimination or- 
dering that is obtained from op by omitting j and all 
vertices that are eliminated after /). 

The selection of ants that deposit pheromone and 
the weighting of this pheromone varies between the dif- 
ferent ACO variants. The reader is referred to [64.26] 
for description of these variants. 


Pheromone Evaporation 
After the pheromone has been added to the trails, a cer- 
tain amount of pheromone is removed. This amount is 
determined based on the pheromone evaporation rate p 

Tij = ad — p)tij , Vii eT. (64.10) 
Ant colony system only removes pheromone from the 
trails belonging to the best known elimination order- 
ing Obs 

Tij = ad — p)tij : v(i, j) E Obs - (64.11) 

Hybridization with Local Search 
All ACO variants were extended with two local search 
methods for tree decompositions. Both of these algo- 
rithms try to improve the quality of the solutions that 
were constructed by the ant colony by changing the po- 
sitions of certain vertices in the elimination orderings. 
Two local search techniques were used: an hill climb- 
ing algorithm and an iterated local search similar to the 
algorithm proposed in [64.23]. 


Stagnation Measures 
If the distribution of the pheromone on the trails be- 
comes too unbalanced due to the pheromone depo- 
sitions, the ants will generate very similar solutions 
causing the search to stagnate. In order to enable 
the algorithm to detect such situations two stagna- 
tion measures were implemented (variation coefficient 
and A branching factor) proposed by Dorigo and Stiit- 
zle [64.26] that indicate how explorative the search 
behavior of the ants is. A detailed description of stag- 
nation measures is given in [64.25, page 67]. 

All described ACO variants in [64.21] were eval- 
uated experimentally with DIMACS Graph Coloring 
Challenge instances. Max—Min ant system and ant 
colony system performed slightly better than the other 
variants. Although the ant colony optimization in gen- 
eral could not compete with iterated local search and 
genetic algorithms, it could improve the upper bound 
for one of problems. 


64.2.3 Iterated Local Search 
for Tree Decomposition 


The application of iterated local search for generating 
tree decompositions has been investigated in [64.23, 
31]. In this section, we give the description of this al- 
gorithm based on these references. 

The algorithm is based on the iterated local search 
framework and it includes a simple local search heuris- 
tic to generate good orderings, and an iterative process 
in which the algorithm calls a local search technique 
with the initial solution produced in the previous it- 
eration. The algorithm also includes a mechanism for 
acceptance of a candidate solution for the next itera- 
tion. Although the constructing phase is very important, 
choosing the appropriate perturbation at each iteration 
as well as the mechanism for acceptance of solution are 
also crucial to obtain good results for an iterative local 
search algorithm. The iterated local search algorithm 
for tree decomposition is presented below. 


Algorithm 64.2 Iterative Heuristic Algorithm — IHA 
Generate initial solution S1 
BestSolution = S1 
while Termination Criteria is not fulfilled do 
S2 = ConstructionPhase(S1) 
if Solution S2 fulfils the acceptance criteria then 


S1= $2 
else 

S1 = BestSolution 
end if 
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Apply perturbation in solution $1 
Update BestSolution if solution $2 has better 
(or equal) width than the current best solution 
end while 
return BestSolution 


The algorithm starts with an initial solution which 
takes an order of nodes as they appear in the input. 
Better initial solutions can also be constructed by us- 
ing other heuristics which run in polynomial time, such 
as maximum cardinality search, min-fill heuristic, etc. 
However, as the proposed method usually finds a solu- 
tion produced by these heuristics in a very short time, 
the algorithm starts with an ordering of nodes given in 
the input. 

After constructing the initial solution the iterative 
phase starts. In this phase, the local search method is 
called iteratively, and then the selected solution is per- 
turbed. Two different local search techniques that can 
be used in the construction phase were proposed. The 
solution returned from the construction phase is ac- 
cepted for the next iteration if it fulfils the specific 
criteria determined by the solution acceptance mech- 
anism. Experiments with different possibilities for the 
acceptance of the solution returned from the construc- 
tion phase were performed. If the solution does not fulfil 
the acceptance criteria this solution is discarded and the 
currently best solution is selected. In the selected so- 
lution, the perturbation mechanism is applied. Different 
possibilities are used for perturbation. The perturbed so- 
lution is given as an input solution in the next call of the 
construction phase. This process continues until the ter- 
mination criterion is fulfilled. 

Two local search methods were proposed for gener- 
ating a good solution which is used as an initial solution 
with some perturbation in the next call of the same local 
search algorithm. Both techniques are based on the idea 
of moving only vertices in the ordering which cause the 
largest clique during the elimination process. The mo- 
tivation for using this method is to reduce the number 
of solutions that should be evaluated. The first proposed 
technique named LS 1 is presented below. 


Algorithm 64.3 Local Search Algorithm 1 - LS1 
(InputSolution) 
BestLSSolution = InputSolution 
NrNotImprovments = 0 
while NrNotlmprovments < MAXNotImprovments 
do 
In the current solution (nputSolution) select a ver- 
tex in the elimination ordering which causes the 


largest clique when eliminated — ties are broken 
randomly if there are several vertices which cause 
the clique equal with the largest clique 
Swap this vertex with another vertex located in 
a randomly chosen position 
if the current solution is better than BestLSSolution 
then 
BestLSSolution = InputSolution 
NrNotImprovments = 0 
else 
NrNotlmprovments = NrNotIlmprovements + 1 
end if 
end while 
return BestLSSolution 


The proposed algorithm applies a simple heuris- 
tic. In the current solution a vertex is chosen randomly 
among the vertices that produce the largest clique in the 
elimination process. Then the selected vertex is moved 
from its position. Two types of moves were used. In the 
first variant, the vertex is inserted in a random position 
in the elimination ordering, while in the second vari- 
ant the vertex is swapped with another vertex located 
in a randomly selected position, i.e., the two chosen 
vertices change their position in the elimination order- 
ing. The swap move was shown to give better results. 
The heuristic stops if the solution does not improve for 
a certain number of iterations. Experiments with differ- 
ent MAXNotIimprovments were performed. LS1 alone is 
a simple heuristic and usually cannot produce good re- 
sults for tree decompositions. However, by using this 
heuristic as a local search heuristic in the iterated local 
search algorithm good results for tree decompositions 
are obtained. 

The second proposed heuristic (LS2) is similar to 
algorithm LS1. However, this technique differs from 
LS1 regarding the exploration of the neighborhood. In 
LS2 in some of iterations the neighborhood of solu- 
tion consists of only one solution which is generated by 
swapping a vertex (that causes the largest clique) in the 
elimination ordering with another vertex located in the 
randomly chosen position. This neighborhood is used 
in a particular iteration with probability p. Experiments 
with different values for parameter p were performed. 
With probability 1 — p, the other type of neighborhood 
will be explored. The neighborhood of current solution 
in this case consists of all solutions which can be ob- 
tained by swapping of a vertex (which causes the largest 
clique) in the elimination ordering with its neighbors. 
The best solution from the generated neighborhood is 
selected for the next iteration in the LS2. Note that 
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in this technique the number of solutions that have to 
be evaluated is much larger than in LS1. In particular, 
in the first phase of search the node which causes the 
largest clique usually has many neighbors and therefore 
the number of solutions to be evaluated when the sec- 
ond type of neighborhood is used is equal to the size 
of the largest clique produced during the elimination 
process. 


Perturbation 

During the perturbation phase the solution obtained 
by local search procedure is perturbed and the newly 
obtained solution is used as an initial solution for 
the new call of the local search technique. The main 
idea is to avoid the random restart. Instead of ran- 
dom restart the solution is perturbed with a bigger 
move(s) as those applied in the local search technique. 
This enables some diversification that helps to escape 
from the local optimum, but avoids beginning from 
scratch (as in the case of random restart), which is very 
time consuming. Three perturbation mechanisms were 
proposed: 


@ RandPert: N vertices are chosen randomly and they 
are moved into new random positions in the order- 
ing. 

@ MaxCliquePer: All nodes that produce the maxi- 
mal clique in the elimination ordering are inserted 
in a new randomly chosen positions in the order- 
ing. 

@ DestroyPartPert: All nodes between two positions 
(selected randomly) in the ordering are inserted 
in the new randomly chosen positions in the 
ordering. 


The perturbation RandPert just perturbs the solu- 
tion with a larger random move and would be kind 
or random restart if N is very large. Keeping N 
smaller avoids restarting from completely new solu- 
tion, and the perturbed solution does not differ much 
from the previous solution. MaxCliquePer concen- 
trates on moving only vertices which produce maxi- 
mal clique in the elimination ordering. The basic idea 
for this perturbation is to apply a technique similar 
to min-conflict heuristic, by moving only the vertices 
that cause large treewidth. DestroyPartPert is simi- 
lar to RandPert, except that the selected nodes to be 
moved are located near each other in the elimination 
ordering. 

Determining the number of nodes N that will be 
moved is complex and may be dependent on the prob- 


lem. To avoid this problem an adaptive perturbation 
mechanism was proposed that takes into consideration 
the feedback from the search process. The number of 
nodes N varies from 2 to some number y (determined 
experimentally), and the algorithm begins with small 
perturbation (N = 2). If during the iterative process 
(for a determined number of iterations) the local search 
technique produces solutions with same tree width for 
more than 20% of cases, the size of perturbation is in- 
creased by 1, otherwise the size of N will be decreased 
by 1. This enables an automatic change of perturbation 
size based on the repetition of solutions with the same 
width. 

The combination of two perturbations was consid- 
ered. The mixed perturbation applies two perturbations: 
RandPert and MaxCliquePer. The algorithm starts with 
RandPert, and switches alternatively between two per- 
turbations if the solution is not improved for a deter- 
mined number of iterations. Experiments with different 
sizes of perturbation sizes for each type of perturbation 
were performed. 


Acceptance Criterion 
Different techniques can be applied for accepting the 
solution obtained by the local search technique. Fol- 
lowing variants for acceptance of solution for the next 
iteration were used: 


@ Solution returned from the construction phase is ac- 
cepted only if it has a better width than the best 
current existing solution. 

@ Solution returned from the construction phase is al- 
ways accepted. 

@ Solution is accepted if its treewidth is smaller than 
the treewidth of the best yet found solution plus x, 
where x is an integer. 


The first variant for accepting a solution is very 
restrictive. In this variant, the solution from the con- 
struction phase is accepted only if it improves the best 
existing solution. Otherwise, the best existing solution 
is perturbed and it is used as input solution for next call 
of the construction phase. In the second variant, the it- 
erated local search applies the perturbation in a solution 
returned from the construction phase, independently 
from the quality of produced solution. The third vari- 
ant is between the first and the second variant, and in 
this case the solution which does not improve the best 
existing solution can be accepted for the next iteration, 
if its width is smaller than the best found width plus 
some bound. 
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64.2.4 Other Techniques 
for Tree Decomposition 


This section gives a short overview on other approaches 
applied for tree decomposition. Examples of com- 
plete algorithms for tree decompositions are [64.32- 
34]. Gogate and Dechter [64.33] reported good results 
for tree decompositions by using branch and bound 
algorithms. They showed that their algorithm is supe- 
rior compared to the algorithm proposed in [64.32]. 
The branch and bound algorithm proposed in [64.33] 
applies different pruning techniques, and provides any- 
time solutions, which are good upper bounds for tree 
decompositions. The algorithm proposed in [64.34] in- 
cludes several other pruning and reduction rules and is 
successful on small graphs. The complete techniques 
described earlier have exponential running time in the 
worst case and can only be used to find the optimal 
width for not too large graphs. 

To generate good upper bounds (which can be suf- 
ficient for many applications) for treewidth several 
greedy heuristic techniques that run in polynomial time 
have been proposed. These heuristics select the order- 
ing of nodes step by step based on different criteria, 
such as the degree of the nodes, the number of edges to 
be added to make the node simplicial etc. Most popu- 
lar techniques are maximum cardinality search (MCS), 
min-fill heuristic and minimum degree heuristic. 

MCS [64.35] initially selects a random vertex of the 
graph to be the first vertex in the elimination ordering 
(the elimination ordering is constructed from right to 
left). The next vertex will be picked such that it has 
the highest connectivity with the vertices previously 
selected in the elimination ordering. Ties are broken 
randomly. MCS repeats this process iteratively until all 
vertices are selected. 

The min-fill heuristic first picks the vertex which 
adds the smallest number of edges when eliminated 
(ties are broken randomly). The selected vertex is made 
simplicial (a vertex of a graph is simplicial if its neigh- 
bors form a clique) and it is eliminated from the graph. 
The next vertex in the ordering will be any vertex that 
adds the minimum number of edges when eliminated 
from the graph. This process is repeated iteratively un- 
til the whole elimination ordering is constructed. 

The minimum degree heuristic picks first the vertex 
with the minimum degree. The selected vertex is made 
simplicial and it is removed from the graph. Further, 
the vertex that has the minimum number of unselected 
neighbors will be chosen as the next node in the elimi- 
nation ordering. This process is repeated iteratively. 


MCS, min-fill, and min-degree heuristics run in 
polynomial time and usually produce tree decompo- 
sitions in a reasonable amount of time. According 
to [64.33], the min-fill heuristic performs better than 
MCS and min-degree heuristic. Although these heuris- 
tics sometimes give good upper bounds for tree decom- 
positions, more advanced techniques usually provide 
better upper bounds for most problems. Min-degree 
heuristic has been improved by Clautiaux et al. [64.16] 
by adding a new criterion based on the lower bound 
of the treewidth for the graph obtained when the node 
is eliminated. Recently, Kask et al. [64.36] proposed 
an iterative greedy variable ordering algorithm to im- 
prove the greedy heuristics given earlier. We refer 
to [64.15,37] for a survey of different upper bounds 
algorithms. 


64.2.5 Comparison of Algorithms 
for Tree Decomposition 


In this section, we compare results obtained with meta- 
heuristic aproaches described in this chapter and other 
existing methods in the literature. The results of these 
methods for 62 DIMACS vertex coloring instances are 
given. These instances have been used for testing sev- 
eral methods for tree decompositions proposed in the 
literature. The compared methods have been executed 
in different computers and we give here only results re- 
garding the width of the tree decomposition. The reader 
is referred to [64.15, 16, 20, 23, 25, 33], for the informa- 
tion about the computers used and the time needed to 
generate solutions. 

In Tables 64.1 and 64.2, the results for DIMACS 
graph coloring instances are presented. First and sec- 
ond columns of the tables present the instances and 
the number of nodes and edges for each instance. 
In column KBH are shown the best results obtained 
by algorithms in [64.15]. The TabuS column presents 
the results reported in [64.16], and column BB shows 
the results obtained with the branch and bound al- 
gorithm proposed in [64.33]. Finally, columns GA, 
IHA, and ACO represent, respectively, results ob- 
tained with a genetic algorithm [64.20], iterated local 
search [64.23], and ant colony optimization [64.21, 
25]. 

Based on the results given in Tables 64.1 and 64.2 
we conclude that regarding the width of tree de- 
composition, the metaheuristic techniques described 
in this paper give very good results and for many 
instances the best existing upper bounds for the 
treewidth. 
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Table 64.1 Algorithms comparison regarding treewidth for DIMACS graph coloring instances 


Instance IVI/IE| KBH TabuS 
anna 138/986 12 12 
david 87/812 13 13 
huck 74/602 10 10 
homer 561/3258 Sill Sil 
jean 80/508 9 9 
games120 120/638 Si 33 
queen5_5 25/160 18 18 
queen6_6 36/290 26 25 
queen7_7 49/476 35 35 
queen8_8 64/728 46 46 
queen9_9 81/1056 59 58 
queen10_10 100/1470 73 72 
queenl1_11 121/1980 89 88 
queen12_12 144/2596 106 104 
queen13_13 169/3328 125 122 
queenl4_14 196/4186 145 141 
queen15_15 225/5180 167 163 
queen16_16 256/6320 191 186 
fpsol2.i.1 269/11 654 66 66 
fpsol2.i.2 363/869 1 Sil 31 
fpsol2.i.3 363/8688 31 31 
inithx.i.1 519/18 707 56 56 
inithx.i.2 558/13 979 35 35 
inithx.i.3 559/13 969 35 35 
miles1000 128/3216 49 49 
miles1500 128/5198 77 Til 
miles250 125/387 9 9 
miles500 128/1170 UD Da 
miles750 128/2113 37 36 
mulsol.i.1 138/3925 50 50 
mulsol.i.2 173/3885 32 32 
mulsol.i.3 174/3916 32 32 
mulsol.i.4 175/3946 32 32 
mulsol.i.5 176/3973 31 si 
myciel3 11/20 5 5 
myciel4 23/71 11 10 
myciel5 47/236 20 19 
myciel6 95/755 35 35 
myciel7 191/2360 74 66 


64.2.6 Application of Tree Decomposition 
in Metaheuristic Techniques 


Traditionally, tree decompositions have been used to 
solve constraint satisfaction problems exactly by dy- 
namic programming algorithms. Recently, researchers 
have been investigating the incorporation of tree de- 
composition within metaheuristics techniques. The 
work in this direction is just in the starting phase and to 
the best of our knowledge only two papers investigated 


BB GA IHA ACO 
12 12 12 12 
13 13 13 13 
10 10 10 10 
31 il 31 30 

9 9 9 9 

= 32 32 37 
18 18 18 18 
25 26 25 25 
35 35 35 35 
46 45 45 46 
59 58 58 59 
72 w 12 @ 
89 87 87 89 

110 104 103 109 

125 121 121 128 

143 141 140 150 

167 162 162 174 

205 186 186 201 
66 66 66 66 
31 32 31 31 
31 31 31 31 
56 56 56 56 
31 35 35 31 
al 35 35 31 
49 50 49 50 
77 77 77 W 

9 10 9 9 
o) 24 22 25 
37 si 36 38 
50 50 50 50 
32 32 32 32 
32 32 32 32 
32 32 32 32 
Sill 31 31 31 

5 5 5) 3 
10 10 10 10 
19 19 19 19 
35 35 3) 35 
54 66 66 66 


yet the application of tree decomposition in metaheuris- 
tic search. 

In [64.38] tree-decomposition-based heuristics have 
been developed for the two-dimensional bin packing 
problem with conflicts. The aim is to find a conflict- 
free packing of given items by using minimal number 
of bins. Tree decomposition is applied to decompose 
a problem instance into subproblems which can be 
solved independently. First a tree decomposition is ob- 
tained, and then each item is assigned to a specific 
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Table 64.2 Algorithms comparison regarding treewidth for DIMACS graph coloring instances 


Instance IVI/IE| KBH TabuS 
school1 385/19 095 244 188 
schooll_nsh 352/14 612 192 162 
zeroin.i.1 126/4100 50 50 
zeroin.i.2 157/3541 33 32 
zeroin.i.3 157/3540 33 32 
le450_5a 450/5714 310 256 
1e450_5b 450/5734 38} 254 
1e450_Sc 450/9803 340 272 
1e450_5d 450/9757 326 278 
1e450_15a 450/8168 296 272 
1e450_15b 450/8169 296 270 
1e450_15c 450/16680 376 359 
1e450_15d 450/16750 375 360 
1e450_25a 450/8260 255 234 
1e450_25b 450/8263 251 233 
1e450_25c 450/17343 355 327 
1e450_25d 450/17425 356 336 
dsjc125.1 125/736 67 65 
dsjc125.5 125/3891 110 109 
dsjc125.9 125/6961 119 119 
dsjc250.1 250/3218 179 173 
dsjc250.5 250/15 668 233 232 
dsjc250.9 250/,897 243 243 


cluster (this phase is called cluster separation). Then 
these clusters are considered as subproblems which are 
solved iteratively. Finally, the partial solutions from 
subproblems are merged to obtain solutions for the 
whole problem. 

Another application of tree decomposition includes 
the approach introduced by Fontaine etal. [64.39] 
where tree decomposition is used to guide the explo- 


64.3 Conclusion 


Several metaheuristic approaches based on nature in- 
spired strategies and local search have been used suc- 
cessfully in the literature for generating tree decomposi- 
tions. Among these approaches, genetic algorithms and 
iterated local search-based algorithms provide best up- 
per bounds for many benchmark instances. 

Although metaheuristic techniques currently pro- 
vide state-of-the-art upper bounds for most problems, 
the runtime of such algorithms for large graphs is 
still high. Greedy heuristic approaches generate slightly 
worse upper bounds, but are more efficient. Therefore, 
developing more efficient metaheuristics for tree de- 


BB GA THA ACO 
= 185 178 228 
= 157 152 185 
= 50 50 50 
= 32 32 33 
= 32 32 33 
307 243 244 304 
309 248 246 308 
315 265 266 309 
303 265 265 290 
= 265 262 288 
289 265 258 292 
372 351 350 368 
371 353 355 371 
255 225 216 249 
251 227 219 245 
349 320 322 346 
349 327 328 355 
64 61 60 63 
109 109 108 108 
119 119 119 119 
176 169 167 174 
231 230 229 231 
243 243 243 243 


ration for the search space. Authors propose a method 
called decomposition guided VNS (variable neighbor- 
hood search) that exploits the graph of clusters to build 
neighborhood structures. By using clusters better inten- 
sification and diversification is achieved. For example, 
the moves are favored in regions that are closely linked 
and the search is diversified by selecting new clusters 
and therefore exploring new regions of the search space. 


compositions is still a challenging task. Moreover, for 
many problems the treewidth is still not known, and the 
question is if the current metaheuristics can still be im- 
proved to find new upper bounds for such problems. To 
obtain better upper bounds, it would be interesting to 
investigate some other approaches such as memetic al- 
gorithms, large neighborhood search, and other hybrid 
techniques. Furthermore, the iterative improvement of 
the initial generated tree decomposition (based on ver- 
tex ordering) is an interesting question. 

Finally, in some applications, the treewidth is not 
the only important parameter for solving problems 
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based on tree decompositions efficiently. Therefore, 
the development of metaheuristics for generating tree 
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decompositions which optimize other features of tree 
decomposition would be of interest in the future. 
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65. Evolutionary Computation 


Jano I. van Hemert 


In this chapter we will focus on the combination 
of evolutionary computation (EC) techniques and 
constraint satisfaction problems (CSPs). Constraint 
programming (CP) is another approach to deal 
with constraint satisfaction problems. In fact, it is 
an important prelude to the work covered here as 
it advocates itself as an alternative approach to 
programming [65.1]. The first step is to formulate 
a problem as a CSP such that techniques from 
CP, EC, combinations of the two, often referred to 
as hybrids [65.2,3], or other approaches can be 
deployed to solve the problem. The formulation 
of a problem has an impact on its complexity in 
terms of effort required to either find a solution or 
that proof no solution exists. It is, therefore, vital 
to spend time on getting this right. 

CP defines search as iterative steps over a search 
tree where nodes are partial solutions to the prob- 
lem where not all variables are assigned values. 
The search then maintains a partial solution that 
satisfies all variables with assigned values. Instead, 
in EC algorithms sample a space of candidate so- 
lutions where for each sample point variables are 
all assigned values. None of these candidate so- 
lutions will satisfy all constraints in the problem 
until a solution is found. Such algorithms are often 
classified as Davis-Putnam—Logemann-Loveland 
(DPLL) algorithms, after the first backtracking al- 
gorithm for solving CSP [65.4]. 

Another major difference is that many con- 
straint solvers from CP are sound, whereas EC 
solvers are not. A solver is sound if it always finds 


65.1 Informal Introduction to CSP 


For a formal definition please skip to the next sec- 
tion. A constraint satisfaction problem consists of a set 
of variables and each variable must be assigned one 
value from its finite set of values, called its domain. 
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a solution if it exists. Furthermore, most constraint 
solvers from CP can easily be made complete, al- 
though this is often not a desired property for 
a constraint solver. A constraint solver is complete 
if it can find every solution to a problem. 


A set of constraints restricts certain simultaneous as- 
signments. In most CSPs, the objective is to search 
for a simultaneous assignment of all the variables such 
that all constraints are satisfied, i.e., no forbidden si- 
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multaneous assignment from the set of constraints is 
used. 

A famous example is the SEND MORE MONEY 
puzzle, where each letter must be replaced by a unique 
number such that the following sum holds [65.5] 


In this CSP, the variables are S, E, N, D,M, O,R,Y 
and the domains are {1,...,9} for S,M and {0,...,9} 
for E,N,D,O,R,Y. The constraint can be also writ- 
ten as 1000 x S+ 100x E+ 10x N +D + 1000x M+ 
100x0+10xR+E = 10000 xM + 1000x 0+ 100x 
N+10xE+Y. Every CSP A can be rewritten into 
an another CSP B where a bijective mapping ex- 
ists between the solutions of A and B, which follows 


65.2 Formal Definitions 


Slightly different, but equivalent, formal definitions of 
CSP exist. The most common definition is: 


Definition 65.1 (Constraint Satisfaction Problem) 
is a triple (V, D, C}: 


@ Visan n-tuple of variables V = (v1, v2,..., Vn), 

@ Each v € V has a corresponding m-tuple of values 
called its domains, D, = (d\,do,...,dm) of which 
it can be assigned one and 

© C=(C,...,C;) is a t-tuple of constraints where 
each c € C restricts certain simultaneous variable 
assignments to occur. 


The definition of a constraint is often reversed in the 
literature, where generic CSP is discussed in that con- 
straints are defined as the set of assignments that are 
allowed rather than restricted. Note, in generic CSP lit- 
erature, variables are often denoted with X, whereas in 
graph-oriented problem domains such as graph coloring 
and maximum clique, V is adopted. 


Definition 65.2 (Solution to a CSP) 
is an assignment of variables (d,,...,d,) € D1 x---x 
D,, such that for every constraint c € C on x;,,.. 


(di, pessy di,,) € C: 


+ Xim: 


In the context of one constraint c, we say an as- 
signment of variables satisfies the constraint c if the 


from the reducibility theorem from complexity the- 
ory [65.6]. The solution to this CSP is the assign- 
ment S=9,E=5,N=6,D=7,M =1,0=0,R 
8, Y = 2, which uniquely satisfies the constraint. 

Other very well-known constraint satisfaction prob- 
lems are map coloring, more commonly known as 
vertex coloring (Sect. 65.5.2), and the recreational game 
Sudoku, which is equivalent to completing a graph 
9-coloring problem on a given specific graph with 
81 vertices. A specific EC solution is provided by 
Lewis [65.7]. Quite a lot of constraint satisfaction 
problems exist; we will first look at CSP in general 
within the context of EC as problem solvers. Then 
we will discuss several specific constraint satisfaction 
problems and the particular EC approaches applied to 
these problems. Last, we will provide a brief overview 
on using EC for generating problem instances for 
CSP. 


assignment is in c or violates the constraint c if the 
assignment is not in c. A CSP can be insoluble — 
more commonly written as insolvable, which means 
every assignment of variables will violate at least one 
constraint. 

A constraint solver is an algorithm that takes as in- 
put a CSP and produces as output either a solution or 
a proof that no solution exists or a notification of failure. 
The input is often referred to as a problem instance, as 
a CSP is often defined to cover a class of problems such 
as, 3-satisfiability. The output can be more than one so- 
lution, in fact it could be every solution. However, as 
EC techniques are based on sampling, in principle they 
cannot proof that every solution has been found, which 
is referred to as not complete. Moreover, they cannot 
proof no solution exists, which is referred to as not 
sound. Therefore, constraint solvers based on EC and 
other heuristic approaches often terminate after a cer- 
tain criterion is met, e.g., a predefined elapsed time is 
reached in terms of the number of solutions evaluated, 
the computation time spent, or a certain convergence of 
the population reached. 

We recommend the following books for further 
reading on constraint satisfaction. For the foundations 
of the problem and basic algorithms, Tsang [65.8]; for 
an introduction with comprehensive overview of con- 
straint programming techniques, Dechter [65.9] and 
Lecoutre [65.10]; and for a more theoretical approach 
Apt [65.1] and Chen [65.11]. 
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65.3 Solving CSP with Evolutionary Algorithms 


In this chapter we will restrict ourself to covering 
the conceptual mapping required to solve a CSP with 
an evolutionary algorithm. This mapping will con- 
sist of choosing a representation for the problem and 
a corresponding fitness function to determine the qual- 
ity of a solution. Once this mapping is complete, 
the evolutionary algorithm will require other compo- 
nents, such as appropriate variation operators, selection 
mechanisms, and a suitable initialization method for 
the population and termination criteria. All these, and 
other optional variants can be found elsewhere in the 
handbook. 

We will explain the two most common mappings 
using the well-known n-queens on an n x n-chessboard 
problem. These mappings are direct encoding and in- 
direct encoding. First we introduce a conceptual defini- 
tion of the problem. 

The n-queens problem requires the placing of n 
queens on an n xn chessboard such that no queen at- 
tacks any of the other n—1 queens. Thus, a solution 
requires that no two queens share the same row, col- 
umn, or diagonal. Several common formal definitions 
of the problem exist. The most common is to define n 
variables {q1,...,4n}, where each variable q has a do- 
main that consists of the row position the queen will 
be placed on in its corresponding unique column, i. e., 


q€{1,...,n}Yi= 1,...,n. The set of constraints con- 
sists of q; A qj (i. e., not in the same row) and |q; —q| 4 
|i—j|Vi,j =1,...,n(i-e., not in the same diagonal). 


The n-queens problem is no longer considered 
a challenging problem as it has a structure that can be 
exploited to solve very large problems of over 9 million 
queens by repeating a pattern [65.12]. It is, however, an 
excellent problem for explaining characteristics of con- 
straint satisfaction problems and their solvers due to the 
simple 2-D spatial nature of the problem. For instance, 
to explain symmetry in CSP, the 8-queens problem can 
be used to show it has 12 unique solutions, as shown in 
Fig. 65.1 out of the 92 distinct solutions when removing 
variants due to rotational and reflection symmetry. 


65.3.1 Direct Encoding 


With a direct encoding the genotype consists of a vec- 
tor where each element corresponds uniquely to one 
variable of the CSP; an element g; contains values di- 
rectly from the domain of its corresponding variable D;. 
A wide variety of genetic operators both for mutation 
and recombination are applicable to this encoding and 


can be found in [65.13]. Most of these operators will be 
called discrete or mixed-integer operations. 

The genotype is mapped to the phenotype by taking 
into consideration the constraints; it requires a measure- 
ment for determining the quality of candidate solutions. 
Thus, we need to introduce a fitness function. The most 
common fitness function takes the sum of all constraints 
violated by a candidate solution 


fitness(#) = X violated(c) , 
cEC 
1 ifc violated by g 


where violated(c) = . : ie 
0 ifc satisfied by g 


The fitness should be minimized and once it reaches 
zero, a solution has been found. 


65.3.2 Indirect Encoding 


With an indirect encoding the genotype first needs to be 
transformed into a full or partial assignment of the vari- 
ables of the CSP. It is also referred to as local search 
depending on the level of sophistication; these transfor- 
mations range from as simple as a greedy assignment all 
the way to sound search algorithms evaluating a small 
part of the CSP. 

The most common approach for this representa- 
tion takes as a genotype the permutation of variables 
of the CSP. Many genetic operators are designed to 
maintain a permutation and several are explained in the 
Handbook of Evolutionary Computation [65.13]. The 
permutation is the input to the local search and de- 
termines the order in which variables are processed; 
processing a variable involves trying to assign a value 
such that no constraint is violated and perhaps further 
steps if no value can be assigned without violating at 
least one constraint. 

More advanced encodings may also include the or- 
dering in which to consider values from each variable’s 
domain. From constraint programming we know that 
the order in which variables and values are considered 
has a huge impact on the efficiency of search algo- 
rithms [65.14]; more often it is the search method that 
determines the order using a particular heuristics such 
as choosing the next vertex with the maximum satura- 
tion degree, as is used in DSatur [65.15]. The saturation 
degree for a vertex is defined as the total number of 
colors used for coloring its neighbors. The principle 


€°S9 | J Hed 


1274 PartE 


Evolutionary Computation 


€°S9 | J Hed 


8 8 
7 7 
6 6 
5 5 
4 4 
3 3 
2l 2 
1 w iw 


W 
W 


-NUU KMD I 
PNW MN DAA C 


W 


Ab @ al @ i g In 
W 


Ail ec ade i gy in 


FNwW KUNDAN © 
FNwWKUNDYA 


Ww 


Al) € ale ia In 


one 


a bc de f ¢ h 


W 


W 


PNW RAAN 
PNW RAAN 


WwW 


awene ig In 


abie deeh 


8 8 

7 7 

oi 6 

5 5 

4 4i 
3 3 

2 2 

1 iw 


Aj) © al tf i e In 


A by @ ol © i fy Im 


=. NU RARUA 
rPnNwWRADNYA © 


Ab @ al @ i g In 


awede i & In 


a lb € ale i w In 


Ab) @ ol © if fy im 


Fig. 65.1 The 12 unique solutions under symmetry via rotations and reflections for the 8-queens problem 


has been used in many algorithms since its introduction 
in 1979. 

The most common fitness function used with indi- 
rect encoding simply counts the number of unassigned 
variables after the local search terminates. Note that 
two different strategies will influence the resolution of 
this function. If the local search terminates after it first 
encounters a variable it cannot assign, then many candi- 
date solutions will have the same fitness but can still be 
very different. On the other hand, terminating after all 
variables have been considered will give a richer land- 
scape to consider but may incur more computational 
effort. See [65.16] for a comprehensive theoretical and 
empirical analysis of sampling in EC. 


65.3.3 General Techniques 
to Improve Performance 


Over the past two decades, many techniques were de- 
veloped to improve the efficiency and/or the effective- 
ness of EC for solving constraint satisfaction problems. 


Only a handful of these techniques were evaluated on 
more than one problem. Hence, we cannot draw any 
general conclusions about the success of these tech- 
niques. Even worse is that many studies will show 
improvement only compared to their previous results 
or compare their results with an algorithm that has 
already been superseded in terms of performance by 
many other techniques. Often the set of competitor al- 
gorithms is chosen to fall within EC, which severely 
limits the strength of the competition. Therefore, we 
will discuss techniques for improving performance in 
the context of the problems they were developed for. 
Section 65.5 reviews several popular CSPs used for 
developing more efficient and effective evolutionary 
algorithms. 

One approach that has been applied to several CSPs 
with varying success is that of assigning weights to 
constraints to allow biasing the search towards satis- 
fying certain constraints; in the first experiments this 
approach was referred to as penalty functions [65.17]. 
Moreover, the search can be influenced dynamically 
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by adapting weights according to heuristics, such as 
increasing the weight of the constraint that has been sat- 
isfied the least number of times recently [65.18]. The 
origin of this idea can be found in the self-adaptation 
used in evolution strategies [65.19]. 

With penalty functions, the optimization objectives 
replacing the constraints are traditionally viewed as 
penalties for constraint violation, hence to be mini- 
mized [65.20]. There are two basic types of penalties: 


1. Penalty for violated constraints 
2. Penalty for wrongly instantiated variables. 


Formally, let us assume that we have constraints 
c; (i= {1,...,m}) and variables y (j= {1,...,n}). 
Let C’ be the set of constraints involving variable vj. 
Then the penalties relative to the two options described 
above can be expressed as follows: 


1. fils) = OL, wi x (s, ci), where 


1 ifs violates c; 


X(8, ci) = | 


0 otherwise 
2. f(s) = } j= w x X05, CÌ), where 


; 1 ifs violates at least one c € C 
x(s,C) = 


0 otherwise, 


where the w; and w; are weights that correspond to 
a constraint and a variable, respectively. These will 
be important later on, for now we assume all these 
weights equal to 1. 


65.4 Performance Indicators 


An understanding of the efficiency and effectiveness is 
vital when choosing which solver to use or when de- 
veloping an algorithm to deal with a specific CSP. In 
this section we briefly explain measures for determining 
these properties in the context of solving CSP. How- 
ever, these properties must be measured using a suite 
of benchmark instances and, as EAs are generally ran- 
domized algorithms, with multiple independent runs of 
the algorithm on each instance. Choosing an appro- 
priate suite of benchmark instances is paramount to 
making decisions on which algorithm, parameter set- 
ting, or next algorithmic feature to add. 


Obviously, for each of the above functions f € 
{fi,f2} and for each se S we have that $(s) = true 
if and only if f(s) = 0. For instance, in the graph 3- 
coloring problem the vertices of a given graph G = 
(V, E), E C V x V, have to be colored by three colors 
in such a way that no neighboring vertices, i. e., graph 
nodes connected by an edge, have the same color. This 
problem can be formalized by means of a CSP with n = 
|V| variables, each with the same domain D = {1, 2, 3}. 
Furthermore, we have m = |E| constraints, one for each 
edge e = (k,l) € E, with ce(s) = true if and only if sk # 
sı. Then the corresponding CSP is (S,@), where S = 
D" and $(s) = Neeg Ce. Using the constraint-oriented 
penalty function fı with w; = 1 for all i= {1,...,m} 
we count the incorrect edges that connect two vertices 
with the same color. The variable-oriented penalty func- 
tion f with w; = 1 for all i= {1,...,m} amounts to 
counting the incorrect vertices that have a neighbor with 
the same color. 

Advantages of indirect encoding: 


@ Introduces in general, e.g., f\,f2 are problem- 
independent penalty functions 

@ Reduces problem to simple optimization 

e@ Allows user preferences by weights. 


Disadvantages of indirect encoding: 


© Loss of information by packing everything in a sin- 
gle number 

@ In the case of constrained optimization (as opposed 
to CSP as we are handling here) fı, f2 are reported 
to be weak [65.21]. 


In a sense, the search for a good algorithm is in 
itself an optimization problem. The suite of bench- 
mark instances represents only the problem, just like 
training data in a machine learning problem represents 
all data possibly encountered. Changing an algorithm 
and tuning its parameters on the same small suite of 
instances could lead to over-fitting [65.22,23], which 
in turn means the algorithm will have a poorer per- 
formance in the general case. Therefore, the first step 
should be to characterize the problem well and have 
a good representation, e.g., spread, of the instances pos- 
sibly encountered when deployed. 
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65.4.1 Efficiency 


The time taken by an algorithm to provide a solution is 
an important factor. Even more so in situations where 
solutions are required in real time. Much research is 
devoted to speeding up algorithms, either by cleverly 
exploiting properties of the problem, by parallelization, 
or by balancing aspects of the quality of the solution. 

The most common approach to measuring the ef- 
ficiency of evolutionary algorithms is by counting the 
number of evaluations, i.e., the number of times the 
fitness function is executed. This approach has several 
drawbacks. First, the approach allows comparison only 
with algorithms that use the exact same fitness func- 
tion and spend the most significant part of their time 
on computing that function. Second, the computational 
complexity of the evolutionary algorithm may not be 
dependent on the fitness function. For instance, with the 
indirect encoding described in Sect. 65.3.2, much com- 
putational effort will go into the local search, whereas 
the computation of the fitness is trivial. 

Another common approach is to measure time spent 
as reported by the operating system. This has even 
more drawbacks as the reported numbers will depend 
on the computer programming language used for im- 
plementing the algorithm, the compiler and its setting 
for translating the implementation into machine code, 
the architecture of the computer for executing the ma- 
chine code, and the operating system for hosting the 
execution environment. Variations of these will have an 
affect on the reported results and, moreover, as these 
environments themselves change over time, future stud- 
ies will find it hard to reproduce results accurately 
or even create meaningful comparisons to reported 
results. 

A more meaningful solution is to count all the 
atomic operations that are directly related to the prob- 
lem. The operations that must be included should be 
those that in theory increase exponentially in numbers 
with larger problems, as CSP fall under the class of non- 
polynomial deterministic problems. The most common 
operation will be a conflict check; this is also referred 
to as a constraint check, but in the strictest sense, a con- 
straint check consists of multiple conflict checks [65.8]. 
For example, when solving the n-queens problem, ev- 
ery time the algorithm checks q; Æ qj for any q; and qj, 
this should be recorded as one check. The same proce- 


dure should be followed for the constraint concerning 
diagonal attacks |q;— q;| # |i—j|. The sum of all checks 
when the algorithm terminates is the computational ef- 
fort spent. 

By reporting the number of conflict checks we as- 
sure future studies can compare with current results as 
this measurement will not be affected by future changes 
in hardware and software environments. We are mea- 
suring a property of the algorithm here as opposed to 
a property of one implementation of the algorithm run- 
ning in one particular environment. 

It is important to note that there are subtle differ- 
ences in the reporting used in different studies. Some 
studies report the average number of operations over all 
independent runs, including runs that are unsuccessful, 
i. e., where no solution was found. Other studies report 
the average number of operations to a solution, where 
only the runs that yield a solution are taken into ac- 
count. The former method will produce higher averages 
than the latter if the success rate is less than 1. 


65.4.2 Effectiveness 


Efficiency is only one aspect of which to measure the 
success of a constraint solver. The other most impor- 
tant aspect is that of effectiveness, which measures how 
successful an algorithm is in finding or approximating 
a solution. The easiest and most commonly used mea- 
surement is that of the success rate, which is defined 
for an experiment as the number of runs in which an 
algorithm finds a solution divided by the total of num- 
ber of runs of the same algorithm in that experiment. As 
no prior knowledge is required about whether problem 
instances are insolvable, this measurement is straight- 
forward to implement. 

Another popular measurement in combinatorial op- 
timization is distance to the optimal solution. This 
measurement poses two challenges in the context of 
constraint satisfaction. Unlike a combinatorial opti- 
mization problem, which has the function to optimize, 
a CSP has no such function. As an alternative we could 
use the fitness function, but that is not an inherent 
property of the problem. Also, we often do not know 
whether a CSP has a solution and when it does not, then 
we do not know the optimal fitness function. Distance 
to the optimal solution is rarely used when solving CSP 
due to these impracticalities. 
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65.5 Specific Constraint Satisfaction Problems 


Many specific constraint satisfaction problems have 
been addressed in the literature. A full overview 
of these would not provide much benefit, as the 
most likely scenario is that one is looking for pa- 
pers that provide descriptions of algorithms and re- 
sults with those algorithms on a certain problem. 
The exceptions to this are several problems that in 
the literature are used to drive the development of 
algorithms in terms of efficiency and effectiveness. 
These core problems are used over and over to test 
whether new algorithms are better than existing algo- 
rithms. 

Several reasons exist for the choice of these prob- 
lems. Their compact definition means that the problem 
is easy to replicate by everyone and quick to introduce 
in papers. The most popular problems were used in the 
1970s when the theory on non-polynomial determinis- 
tic problems was developed, which were consequently 
seen as important intelligent building blocks. Also, test 
sets and later problem generators were released in the 
public domain, thereby providing easy access to test 
suites. 

We will use several of these core problems to de- 
scribe the progress of development in evolutionary 
computation for constraint satisfaction problems. For 
each problem we will provide a quick introduction, 
a justification of its importance in terms of practical ap- 
plications, and a set of pointers to problem suites before 
describing the approaches used. 


65.5.1 Boolean Satisfiability Problem 


Given a Boolean formula ¢ determine whether an as- 
signment of the variables in @ exists that makes it 
TRUE. It is often referred to as satisfiability and ab- 
breviated to SAT [65.24]. In SAT variables are often 
referred to as literals. Most often the problem is studied 
in conjunctive normal form (CNF) where ¢ is a con- 
junction of clauses where each clause is a disjunction 
of variables. Every SAT problem can be reduced to 
a 3-CNF-SAT (three variables/clause-conjunctive nor- 
mal form-satisfiability) [65.25], where each clause has 
three literals. 

3-CNF-SAT was the first problem to be shown to 
be NP-complete [65.26]. It serves as an important basis 
to proving that other problems are NP-complete, such 
as the maximal clique problem. Such a proof involves 
a polynomial-time reduction from 3-CNF-SAT to the 
other problem [65.6]. 


The following is an example of 3-CNF-SAT: 


© d= (x1 V7%3 V x4) A (x2 V x1 V =x6) A (X3 V X2 V 
7X5) 

© A solution: x; = 1, x2 = 0, x3 = 1, x4 = 0, x5 = 0, 
X6 = 0. 


Important practical applications of SAT are model 
checking [65.27], for example, in mathematical proof 
planning [65.28], generic planning problems, espe- 
cially using the planning domain definition language 
(PDDL) [65.29], test pattern generation [65.30], and 
haplotyping in the scientific field of bioinformat- 
ics [65.31]. 

As far as the development of efficient and effective 
CSP solvers go, SAT is the most active field. It has 
an annual conference — The International Conference 
on Theory and Applications of Satisfiability Testing, 
which also hosts an annual competition to determine 
the current best solvers. The latter also ensures that new 
problem instances are continuously added, which pre- 
vents what is called overfitting [65.32] of the solvers to 
an existing set of problem instances. 

The general approach to solve satisfiability with EC 
is to directly represent the variables in # and assign 
these either TRUE or FALSE, i.e., these form the do- 
main. The fitness function used is the number of clauses 
violated, which should be minimized. 

The earliest evolutionary algorithm for SAT was re- 
ported in 1994 by [65.33] and was soon followed by 
the work of Gottlieb and Voss [65.34,35], who were 
looking to improve its performance. Soon after, inde- 
pendent efforts led to parallelized algorithms [65.36, 
37]. In 2000, the first adaptive evolutionary algorithms 
were applied [65.38], which was 3 years after they were 
applied to graph coloring (Sect. 65.5.2). 

The introduction of hybrid evolutionary algorithms 
with local search created a real boost of research ac- 
tivity [65.39-43]. However, a major issue remains with 
research on solving satisfiability with EC, as all studies 
include only local search and evolutionary algorithms 
without comparing to the state-of-art DPLL and heuris- 
tic solvers from the annual satisfiability community. 
This holds true even for recent studies such as [65.44]. 
Due to this major gap between the two communities of 
EC and CP, we do not comment on the comparison in 
terms of effectiveness and efficiency. 

New research [65.45] focusses on using EC to 
evolve parameter settings for existing sound SAT 
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solvers, mostly ones based on the Davis—Putnam— 
Logemann—Loveland algorithm [65.46]. All modern 
SAT solvers have many parameters to tune how the 
search is organized. These parameters are often tuned 
manually, which allows for only a small exploration. 
Using EC, a much larger space can be explored in order 
to create fast SAT solvers for a given benchmark. 


65.5.2 Graph Coloring 


Graph coloring has several variants. The most com- 
monly used definition is that of graph k-coloring, also 
known as the vertex coloring problem. Given a graph of 
vertices and edges (V, E) the goal is to find a coloring 
of the vertices V of the graph such that no two adja- 
cent vertices have the same coloring. If c(v) provides 
the color assigned to v, then Vv, w E€ V : c(v) # c(w) iff 
(v, w) € E. The objective is to make use of k or less col- 
ors. The problem is known to be NP-complete for k > 3 
and to be decidable in linear time for k < 2. 

Graph coloring is an abstract problem that lies at 
the core of many applications. Well-known applications 
are scheduling, most specifically timetabling [65.47], 
register allocation in compilers [65.48], and frequency 
assignment in wireless communication [65.49]. It is 
a well-studied problem as is shown by the number of 
entries in the best-kept bibliography source until April 
2010 with over 450 publications contributing to vertex 
coloring [65.50]. 

The Second DIMACS Implementation Challenge 
in 1992-1993 focused on maximum clique, graph col- 
oring, and satisfiability. The challenge provided not 
only a standard format for graph k-coloring prob- 
lem instances, but also provided a set of problem 
instances that is still popular today. Soon after, in 
1994, Culberson and Luo [65.51] created a problem in- 
stance generator, which can create problem instances 
with a known k and various other properties. Sev- 
eral other generators exist with specific goals, such as 
to hide cliques [65.52], to create register-interference 
graphs [65.53], and to create timetabling problems 
(Sect. 65.5.4). 

The most straightforward approach to solving graph 
k-coloring with EC is to represent a genome as a vec- 
tor of all variables of the problem. This vector can then 
undergo genetic operators suitable for integer represen- 
tations. The fitness function is simply the number of 
violated constraints, which should be minimized until 
a solution is found when the fitness is equal to zero. Un- 
fortunately, this approach leads to algorithms that are 
inefficient and ineffective [65.54]. 


To make EC more efficient and effective for solving 
graph k-coloring, new algorithms have been developed; 
these broadly fall into two categories. The first cate- 
gory consists of adding mechanisms that prevent the 
stagnation of search due to premature convergence. 
The second category consists of alternative representa- 
tions that make use of decoders to map genotypes to 
phenotypes. The two categories are not mutually exclu- 
sive, and studies have included algorithms that combine 
mechanisms from both categories. 

The earliest work on solving graph k-coloring with 
EC includes the following. Fleurent and Ferland suc- 
cessfully considered various hybrid evolutionary algo- 
rithms [65.55] with Tabu search and extended their 
work into a general implementation of heuristic search 
methods in [65.56]. Von Laszewski looked at structured 
operators and used adaption to improve the convergence 
rate of a genetic algorithm [65.57]. Davis designed an 
algorithm [65.58] to maximize the total of weights of 
nodes in a graph colored with a fixed number of col- 
ors. Coll et al. [65.59] discussed graph coloring and 
crossover operators in a more general context. 

Juhos and van Hemert introduced several heuris- 
tics [65.60, 61] for guiding the search of an evolutionary 
algorithm. All these heuristics depend on their novel 
representation that collapses the graph by combining 
nodes assigned with the same color into one hypernode, 
which speeds up further constraint checking as edges 
are merged into hyperedges [65.62]. This representation 
benefits both complete and heuristic methods. 

Moreover, as shown in the results in Fig. 65.2, 
the evolutionary algorithms developed by Juhos and 
van Hemert are able to outperform a complete method 
(Backtracking-DSatur) on very difficult problem in- 
stances where the chromatic number is 10 or 20. These 
algorithms are unable to compete with the complete 
method for smaller chromatic numbers of 3 and 5. 


65.5.3 Binary Constraint 
Satisfaction Problems 


A binary constraint satisfaction problem (BINCSP) is 
a CSP where every constraint c € C restricts at most 
two variables [65.63]. Often, network graphs are used 
to visualize (CSP) instances. In Fig. 65.3, we provide 
an example of a restricting hypergraph of a BINCSP. 
It consists of three variables V = {v1, v2, v3}, all of 
which have domain D = {a, b}. In a hypergraph every 
vertex corresponds to a possible variable assignment, 
i.e., (v,d), where v € V and d € D,. Every edge indi- 
cates the variable assignments that are forbidden by the 
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Fig. 65.2a-d Results of several evolutionary algorithms against the complete method Backtracking-DSatur; average 


minimum number of colors used through the phase transition 


set of constraints C. In the example, we show all the 
edges that correspond to the following set of forbidden 
value pairs C={ {(vi,a),(v2,a)}, {(v1, a), (v3, b)}, 
(ib), waah Evi b), (v2, )}, Ei, b), v3.4) }, 
{(v1,), (v3,b)}, (02,4), (v3, a)}, {(v2, a), (v3, DY} } 
For problem instances, studies on BINCSP gener- 
ally create large sets of instances using one of many 
problem instance generators. Several models to ran- 
domly create BINCSPs have been designed and an- 
alyzed [65.63—65]. All of these incorporate a set of 


Fig. 65.3 Example of a |V|-partite hypergraph of 
a (BINCSP) with one solution: {(v;, a), (v2, b}, (v3, a)} 


parameters that may be used to control the size and dif- 
ficulty of the problems. Often, these parameters can be 
used to create a set of problems that go through a phase 
transition. That is, we order the set on the parameters 
and observe how the algorithms behave when we move 
through the parameter space. In most constraint satis- 
faction problems we observe that the performance drops 
gradually until it reaches a minimum, after which it 
rises again. Most researchers test their algorithms in the 
region where the minimum is reached. Here the set of 
most difficult to solve problem instances is found. We 
will discuss these methods next. 

The model most often used in empirical research on 
binary constraint satisfaction problems is one that uses 
four parameters to control, to some degree, the diffi- 
culty of an instance. By varying these global parameters 
one can characterize instances that are more likely to be 
either more or less difficult to solve. These parameters 
are: the number of variables n = |V|, the size of each 
variable’s domain m = |D, | = |D,,| |D,,,|, the 
density of constraints pı, and the average tightness of 
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all the constraints p2. There are two ways of looking at 
parameters pı and p2. We will use the following defini- 
tions. 


Definition 65.3 (Density) 

The density of a BINCSP is the ratio between the max- 
imum number of constraints (5) and the actual number 
of constraints |C], 


s 


AS 


Definition 65.4 (Tightness) 

The tightness of a constraint c C C over the variables 
v,w € V of a BINCSP (V, D, C) is the ratio between 
the total number of forbidden variable assignments |c| 
and the total number of combinations of variable as- 
signments possible m = |D,||Dw], 


[el 
po(c) = — - 
m 


Definition 65.5 (Average Tightness) 

The average tightness of a BINCSP (V, D, C} is the 
sum of the tightness over all constraints divided by the 
number of constraints, 


p= decec P2(C) 
IC| 


These definitions give the density and tightness in 
terms of a ratio, or in other words, as the percent- 
ages of the maximum. Another way of looking at these 
two properties uses probabilities [65.66]. We could de- 
fine the density of a BINCSP as the probability that 
a constraint exists between two variables. The tight- 
ness can be alternatively defined in an analogous way, 
as the probability that a conflict exists between two in- 
stantiations of two variables. The differences in these 
viewpoints becomes apparent in the different imple- 
mentations of algorithms that generate BINCSPs, as 
with uniform generation the ratio in an instance is deter- 
mined beforehand, while with probability the ratio will 
vary according to a normal distribution. When compar- 
ing studies it is important to know when probabilities 
are used whether the results reported are against the 
probability set or the actual measured ratio in the whole 
instance. 


Table 65.1 Different models for the general method for 
generating binary constraint satisfaction problems 


Nogoods 
Probability Uniform 
Constraints Probability Model A Model C 
Uniform Model D Model B 


The simplest way to empirically test the perfor- 
mance of an algorithm on solving CSPs is by generating 
instances using different settings for the four main pa- 
rameters, n, m, pı, and p>. However, there are two ways 
of choosing where to put constraints in a constraint net- 
work. We can choose the number of constraints we want 
to have beforehand and then uniformly distribute them 
in the constraint network. Alternatively, we can choose 
for each possible edge in the constraint network with 
the probability pı if this edge is inserted, i.e., a con- 
straint is added. We will call the first model the uniform 
model and the second the probability model. The same 
categorization holds for nogoods. Given a constraint we 
can either distribute pzm? nogoods uniformly or with 
probability pz decide which value pairs become no- 
goods. Now we can define four different models and 
we will name them according to the models in [65.63, 
65]. The models are shown in Table 65.1. 


Definition 65.6 (Parameter Vector of a BINCSP) 

A parameter vector of a binary constraint satisfaction 
problem (BINCSP) with n variables and m as each vari- 
able’s domain size is a 4-tuple (n,m, pı, p2) of four 
parameters: the number of variables n, the domain size 
of each variable m, the density pı, and the average tight- 
ness p2. 


We can also characterize a set of binary constraints 
satisfaction problems using the parameter vector as a set 
B of BINCSP instances where 


Y(n, m, pı. P2); (n,m, pi, P?) € B 


on i 


pi ^D =P . 


nAm=m Api 


Such a set we call a suite of problem instances. 
Achlioptas et al. proves in [65.64] that as the num- 
ber of variables becomes large almost all instances 
created by Models A-D become unsolvable. The rea- 
son lies in the existence of flawed variables. Whenever 
a variable v is involved in a constraint and has all its val- 
ues incompatible with a value of an adjacent variable w, 
this variable is called flawed. In terms of compound la- 
bels using the constraint c over variables v and w this is 
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written as, 


Yv € D, : Aw € Dp: 
satisfies(((v, v}, (w,w)}),c) Ac EC. 


When the number of variables is increased without 
changing the other parameters, the number of flawed 
variables will increase, thus making it easy to prove 
instances have no solution. To overcome the problems 
a new model is proposed [65.64]: 


Definition 65.7 (Model E) 

The graph C” is a random n-partite graph with m nodes 
in each part that is constructed by uniformly, indepen- 
dently, and with repetitions selecting pe (5)m? edges out 


of the (')m” possible ones. 


The idea behind this model is that the difficulty is 
controlled by the tightness and not influenced by the 
structure of the constraint network. The parameter p, 
is responsible for the average tightness of the BINCSP. 
However, it is not the same parameter as the average 
tightness pz. Because we allow repetitions in the pro- 
cess we end up with an average tightness smaller than 
or at most equal to pe. 

Parameter pe also influences the value of pı. 
In [65.65] we find the proof that using Model E with 
fairly small values (pe < 0.05) will result in a fully 
connected constraint network (pı = 1). This is seen as 
a flaw in Model E, as many problems do not require 
a fully connected constraint network. This has led to 
yet another model. 

MacIntyre et al. propose a more generalized version 
of Model E called Model F [65.65]. This model starts 
out the same way as Model E by generating pip2m(5) 
nogoods. Afterwards, a constraint network is gener- 
ated with exactly pib) edges in the uniform way. All 
nogoods that are not in a constraint in the constraint 
network are removed from the problem instance. Model 
E is the special case of Model F where p; = 1. The 
benefit of Model F is the ability to generate problems 
where pı < 1, which is more realistic towards real- 
world problems. 

Craenen et al. [65.67] present the largest compari- 
son study of EC and CP approaches for the BINCSP. 
In this study they compare the success rate and av- 
erage number of conflict checks to a solution of 11 
evolutionary algorithms. The best four evolutionary al- 
gorithms are compared with forward checking with 
conflict-directed backjumping [65.68], and the authors 


concluded the latter has a superior performance on ev- 
ery problem instance in the benchmark. 

The following heuristic approaches are included in 
the study. In [65.69, 70], Eiben et al. propose to incor- 
porate existing CSP heuristics into genetic operators. 
A study on the performance of these heuristic-based 
operators when solving binary CSPs was published 
in [65.71]. Two heuristic-based genetic operators are 
specified: an asexual operator that transforms one in- 
dividual into a new one and a multi-parent operator 
that generates one offspring using a number of par- 
ents. In [65.72—74], Riff-Rojas introduced an EA for 
solving CSPs that uses information about the con- 
straint network in the fitness function and in the genetic 
operators (crossover and mutation). The fitness func- 
tion is based on the notion of the error evaluation of 
a constraint. Marchiori et al. introduced and investi- 
gated EAs for solving CSPs based on pre-processing 
and post-processing techniques [65.75—77]. Included 
in the comparison is the variant form [65.75, 78] that 
transforms constraints into a canonical form in such 
a way that there is only one single (type of) primi- 
tive constraint; we call this algorithm glass-box. This 
approach is used in constraint programming, where 
CSPs are given in implicit form by means of for- 
mulas of a given specification language. In [65.79, 
80] Handa et al. formulate a coevolutionary algorithm 
where a population of schemata are parasitic on the host 
population. Schemata in this algorithm are individuals 
where a portion of variables in the individual has val- 
ues while all other variables have do-not-care symbols 
represented by asterisks. 

The following approaches with emphasis on adap- 
tive features are included in the comparison; a co- 
evolutionary approach invented by Paredis and evalu- 
ated on different problems, such as neural net learn- 
ing [65.81], constraint satisfaction [65.81,82], and 
searching for cellular automata that solve the den- 
sity classification task [65.83]. Furthermore, results 
on the performance of the co-evolutionary approach 
when facing the task of solving binary CSPs are 
reported in [65.84,85]. In the co-evolutionary ap- 
proach for CSPs two populations evolve according to 
a predator-prey model: a population of candidate solu- 
tions and a population of constraints. In the approach 
proposed by Dozier et al. in [65.86] and further re- 
fined and applied in [65.87—89], information about 
the constraints is incorporated both in the genetic op- 
erators and in the fitness function. In the microge- 
netic iterative descent algorithm the fitness function 
is adaptive and employs Morris’ breakout creating 
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mechanism [65.90] to escape from local optima. The 
stepwise adaptation of weights mechanism was in- 
troduced by Eiben and van der Hauw [65.91,92] as 
an improved version of the weight adaptation mech- 
anism of Eiben et al. [65.93,94]. The approach has 
been studied in several comparisons and often proved 
to be a robust technique for solving several specific 
CSPs [65.95-97]. A comprehensive study of differ- 
ent parameters and genetic operators can be found 
in [65.98]. The basic idea is that constraints that are 
not satisfied or variables causing constraint violations 
after a certain number of steps must be hard, thus 
must be given a high weight (penalty) in the fitness 
function. 


65.5.4 Examination Timetabling 


Examination timetabling has been studied for many 
years as it is a common problem in many organi- 
zations. Already in 1986, Carter gave an extended 
survey of work on automated timetabling [65.99]. He 
is also responsible for providing problem instances, 
which are still available and popular today [65.100], 
although a more diverse benchmark is used in the 
annual timetabling competition [65.101]. Burke et al. 
provide the most extensive recent surveys of automated 
timetabling in [65.102, 103]. Examination timetabling 
is just one of many problems under the topic of 
timetabling [65.104]. 

Timetabling as a problem has many different def- 
initions due to different kinds of constraints and ob- 
jectives. The definition that is most relevant for con- 
straint satisfaction is often referred to as examination 
timetabling. The most abstract definition simply con- 
sists of a matrix C where C;j = 1 if exam i conflicts 
with exam j by having common students that must take 
both exams, C; j = 0 otherwise. This definition is equiv- 
alent to a graph coloring problem if the objective is 
to minimize the number of exam slots required, where 
the number of slots equals the number of colors re- 
quired for coloring the graph with incidence matrix C. 
Hence, an appropriate approach to performance testing 
is via graph coloring instances based on examination 
timetabling, such as the problem instances labeled SCH 
(school) in the graph coloring instances suite provided 
by Lewandowski [65.105]. 


Many problem instances and problem instance gen- 
erators exist. Infrequently, an International Timetabling 
Competition is organized by The International Series 
of Conferences on the Practice and Theory of Auto- 
mated Timetabling. At each event, another definition 
of timetabling problems is tackled. The differences be- 
tween definitions are in the objectives and the soft 
and hard constraints used. Hard constraints are treated 
the same as in constraint satisfaction, whereas soft 
constraints may be violated but will either incur an ad- 
ditional penalty on the objective function or be used 
to prioritize solutions otherwise, for instance, using 
a Pareto front. Corne et al. [65.106] identified five cat- 
egories of constraints, unary, binary, capacity, event 
spread, and agent preference. 

Three approaches exist to solving timetabling prob- 
lems. The first approach is called one-stage optimiza- 
tion. It aggregates all types of constraints of one prob- 
lem, often by summation, into one objective function 
where each type is assigned a weight. The advantage 
is that, in principle, the approach can be applied to 
any set of constraints. In practice, it may prove dif- 
ficult to optimize such a function. Representations of 
the problem fall into the two main categories direct en- 
coding (Sect. 65.3.1) [65.107] and indirect encoding 
(Sect. 65.3.2) [65.106, 108]. 

The second approach is called two-stage optimiza- 
tion. It first solves the problem of finding a feasible 
solution where all the hard constraints are satisfied. In 
the second stage it searches within the space set with 
these hard constraints and optimizes only against the 
soft constraints. The benefits are that during search we 
do not have to distinguish between feasible and infea- 
sible constraints and, therefore, are not in danger of 
the search wandering off into an infeasible part of the 
search space. Thompson and Dowsland [65.109] were 
the first to report on this approach using simulated an- 
nealing, closely followed by the first EA by Yu and 
Sung [65.110]. 

The third approach uses relaxation of constraints. 
Typically, relaxation in timetabling is achieved by 
not assigning events to slots or by adding addi- 
tional time slots. An early example of an EA is by 
Burke etal. [65.111], where an indirect encoding is 
used and additional time slots are used to relax the 
problem. 
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65.6 Creating Rather than Solving Problems 


So far we have covered evolutionary computation for 
solving CSP. A contrasting idea proposed first for con- 
straint satisfaction in [65.112] is to use evolutionary 
computation to generate problem instances. Such an 
approach allows a search for problem instances that 
adhere to certain properties as long as these can be mea- 
sured efficiently by a fitness function. 

A straightforward use for such an approach is to 
evolve problem instances that are difficult to solve for 
a particular algorithm. By measuring the efficiency of 
an algorithm to solve instances of a certain problem we 
can then change the instances with the aim of decreas- 
ing the efficiency. Measurements for efficiency of EC 
for CSP are discussed in Sect. 65.4.1. It is important 
to note that the algorithm we are evolving problem in- 
stances for can be of any kind, as long as we can execute 
it on problem instances generated and we can measure 
its efficiency. 

Such hard problem instances identify the weak 
spots in the algorithm that tries to solve it. Moreover, 
if we can characterize a set of problem instances where 
all members of the set are hard for an algorithm, then we 
can use that characterization to decide what algorithm 
is suitable for solving a new problem instance. That is, 
if the work required to obtain the characteristics of one 
instance takes less effort than solving the actual prob- 
lem instance itself [65.113]. 


65.6.1 Evolving Binary Constraint 
Satisfaction Problem Instances 


The first application to constrained problems was 
for the binary constraint satisfaction problem 
(Sect. 65.5.3), where problem instances are rep- 
resented as a binary vector with each element 
corresponding to the element of a conflict matrix 
between two variables [65.114]. Even the small in- 
stances investigated in the study led to large vectors, 
i.e., with 15 variables each with a domain of size 
15, the corresponding vector has OE 15? = 23 625 
elements. Results with problem instances of this size 
show problem instances can be created that are far 
more difficult to solve than when creating a much 
larger set of randomly generated instances [65.112]. 
Furthermore, analysis of these instances provides 
an insight as to what structure is responsible for 
making instances difficult for the algorithm; two 
well-known algorithms from constraint programming 
were tested: chronological backtracking [65.115] and 


forward checking with conflict-directed backjump- 
ing [65.116]. 


65.6.2 Evolving Boolean Satisfiability 
Problem Instances 


In [65.114] an evolutionary algorithm is used to evolve 
solvable Boolean satisfiability problem instances that 
are in conjunctive normal form and have three variables 
per clause. A 3-SAT problem is represented by a list 
of natural numbers. A number in the list, i.e., a gene, 
corresponds to a unique clause with three different lit- 
erals. The number of possible unique clauses depends 
on the number of variables and the size of the clause. 
Here, the number of variables is set to 100 and the 
size of the clause is 3, hence there are 1 313 400 unique 
clauses. This representation has strong advantages over 
a simple one gene for every literal approach. Most 
importantly, it prevents duplicate variables in clauses, 
which reduces the state space and could otherwise intro- 
duce trivial clauses, e.g., (xV =x V y), or 2-SAT clauses, 
e.g., (xV xV y). Also, the variation operators now sim- 
ply become mutation and uniform crossover for lists of 
natural numbers over a fixed domain. 

Two problem solvers are used from the annual SAT 
competition [65.117]; both are based on the Davis— 
Putnum procedure [65.4]. zChaff [65.118] is based on 
Chaff [65.119], a SAT solver that employs a partic- 
ularly efficient implementation of Boolean constraint 
propagation and a novel low overhead decision strat- 
egy. Relsat [65.120] is explained in [65.121, 122]. In 
both solvers, the number of states of instantiations are 
enumerated to determine the search effort required. 

The change of certain structural properties over the 
duration of evolution was analyzed. Two established 
properties were used: the number of solutions [65.123, 
124] and the backbone size [65.125]. No clear relation- 
ship was identified with these properties. 

However, a new relationship was identified: when 
problem instances are becoming more difficult to solve, 
the variance in the frequency in variable usage de- 
creases. In other words, the distribution of variables 
throughout the instances is more uniform when prob- 
lems are more difficult to solve. 


65.6.3 Further Investigations 


The application of evolutionary computation in 
problem generation is widespread. Smith-Miles and 
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Lopes [65.126] provide an extensive review in terms 
of measuring instance difficulty in combinatorial op- 
timization problems, which also discusses studies that 
evolve problem instances for constrained optimization 
as well as for constraint satisfaction problems. 

The maximization of the effort required to solve 
a problem instance highlights only one aspect of the 
problem difficulty. Another aspect that looks at the ef- 


fectiveness is to maximize the distance a solver is able 
to reach to the optimal solution. To compute this dis- 
tance, we require the fitness of the optimal solution 
a priori. Note, however, we do not need to know what 
the optimal solution is, only its fitness. Another ap- 
proach is to directly compare solvers by maximizing the 
difference in some aspect, e.g., efficiency or effective- 
ness, between two solvers. 


65.7 Conclusions and Future Directions 


Research on solving constraint satisfaction problems 
with evolutionary computation has produced a rich 
set of research papers that contribute solvers, insights 
into solvers and their performance, and heuristic sub- 
routines. One major flaw in this research has re- 
mained consistent over the past 20 years: most stud- 
ies compare performance results only to other evo- 
lutionary or closely associated techniques. Even re- 
cent studies, such as [65.127—129], restrict themselves 
to comparing only results from other heuristic meth- 
ods or have not included alternative techniques at 
all. 

Many studies report on the promising performance 
of a particular evolutionary algorithm over another ex- 
isting heuristic technique. The few systematic studies 
that do compare evolutionary and constraint program- 
ming techniques conclude that constraint programming 
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66. Swarm Intelligence in Optimization and Robotics 


Christian Blum, Roderich Groß 


Swarm intelligence is an artificial intelligence dis- 
cipline, which was created on the basis of the laws 
that govern the behavior of, for example, social 
insects, fish schools, and flocks of birds. The or- 
ganization of these animal societies has always 
mesmerized humans. Therefore, it is surprising 
that it has only been in the second half of the last 
century that some of the most important prin- 
ciples of swarm intelligent behavior have been 
unraveled. A prime example is stigmergy, which 
refers to a self-organization of the animal society 
via changes applied to the environment. 

In this chapter, we provide a concise introduc- 
tion to swarm intelligence, with two main research 
lines in mind: optimization and robotics. Popular 
examples of optimization algorithms based on 
swarm intelligence principles are ant colony opti- 
mization and particle swarm optimization. On the 
other side, the field of robotics has adopted var- 


66.1 Overview 


Swarm intelligence (SI) [66.1—3] is a subfield of the 
more general field of artificial intelligence [66.4]. The 
term swarm intelligence was introduced and used for 
the first time by Beni et al. [66.5—7] in the context 
of cellular robotic systems. Nowadays, SI research is 
generally concerned with the design of intelligent mul- 
tiagent systems whose inspiration is taken from the 
collective behavior of social — or even eusocial — in- 
sects and other animal populations. Examples include 
ant colonies, bee hives, wasp colonies, frog popula- 
tions, flocks of birds, and fish schools. Among these, 
social insects have always played a prominent role in 
the inspiration of SI techniques. Even though their in- 
trinsic ways of functioning have fascinated researchers 
for many years, the mechanisms that govern their be- 
havior remained unknown for a long time. In colonies of 
social insects, for example, single colony members are 
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ious swarm intelligent behaviors for problem solv- 
ing and organizing groups of robots. This has 
resulted in a separate research field nowadays 
known as swarm robotics. 


unsophisticated individuals, yet they are able to achieve 
complex tasks in cooperation. Essential colony behav- 
iors emerge from relatively simple interactions between 
the colony’s individual members. 

An important aspect of any SI system is self-orga- 
nization [66.8]. Originally, the term self-organization 
was introduced by the German philosopher Immanuel 
Kant [66.9] in an attempt to characterize what makes or- 
ganisms so different from other objects. Nowadays, the 
term self-organization refers to a process where some 
form of global order or coordination emerges from 
rather simple interactions between low-level compo- 
nents of an initially unordered system. Self-organizing 
processes are neither directed nor controlled by any 
agent or component, neither from inside nor from out- 
side the system. They are often triggered by random 
fluctuations that are amplified by positive feedback and 
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Fig. 66.1 Ants cooperate for retrieving a heavy prey 
(photo courtesy of M. J. Blesa) 


possibly counterbalanced by negative feedback, which 
generally aids in stabilizing the system. The global 
properties exhibited by self-organizing systems are thus 
the result of this distributed interplay of their com- 
ponents. As such, self-organization is typically robust 
and able to survive and self-repair damage or pertur- 
bations. Historically, self-organization processes have 
been studied in physical, chemical, biological, social, 
and cognitive systems. Well known examples are crys- 


66.2 SI in Optimization 


The use of SI techniques for solving optimization prob- 
lems has already a rather extensive history. SI tech- 
niques have been used for both solving combinatorial 
and continuous optimization problems in static and in 
distributed settings. Two of the most well-known SI 
techniques for solving optimization problems are ant 
colony optimization (ACO) and particle swarm opti- 
mization (PSO). More recently, other techniques such 
as the artificial bee colony algorithm have been de- 
veloped. Apart from solving optimization problems, SI 
techniques are being used for management tasks, for ex- 
ample, in distributed settings or in online optimization. 
The following sections will give a brief overview of this 
application field of SI. 


66.2.1 Ant Colony Optimization 
ACO [66.11] is one of the earliest SI techniques for op- 


timization. Dorigo and colleagues developed the first 
ACO algorithms in the early 1990s [66.12—14]. The 


tallization, molecular self-assembly, and the way in 
which neural networks learn to recognize complex pat- 
terns. 

During the last 50 years or so, biologists discovered 
that many aspects of the collective activities of social 
insects are self-organized as well, that is, they func- 
tion without a central control. For example, the African 
weaver ant constructs nests by pulling leaves together. 
Where the gap between leaves exceeds the body length 
of an individual ant, multiple ants organize into pulling 
chains. Once the leaves are in contact, they are glued 
together using silk from larvae, which are carried to the 
site by other workers of the colony [66.10]. Other exam- 
ples concern the recruitment of fellow colony members 
for prey retrieval (Fig. 66.1), the capabilities of termites 
and wasps to build sophisticated nests, or the ability 
of bees and ants to orient themselves in their environ- 
ment. For more examples, we refer the interested reader 
to [66.1, 2]. 

In the meantime, some of the above mentioned be- 
haviors have been used as inspiration for the resolution 
of technical problems, especially in the context of op- 
timization and in robotics. This chapter is dedicated 
to reviewing some of the — in the opinion of the au- 
thors — most interesting algorithms/systems from these 
two fields. 


development of these algorithms was inspired by the 
observation of ant colonies. Ants are social insects. 
They live in colonies and their behavior is governed 
by the goal of colony survival rather than being fo- 
cused on the survival of individuals. The behavior that 
provided the inspiration for ACO is the ants’ foraging 
behavior, and in particular, how ants of many species 
can find shortest paths between food sources and their 
nest. In order to search for food, ants initially explore 
the area around their nest by means of random walks. 
While moving, ants leave tiny drops of a pheromone 
substance on the ground. Ants are also able to scent 
these pheromones. When choosing their way, they are 
attracted by paths marked by strong pheromone con- 
centrations. When having identified a food source, ants 
evaluate the quantity and the quality of the food and 
carry some of it back to their nest. During the re- 
turn trip, the quantity of pheromone that ants leave 
on the ground may depend on the quantity and qual- 
ity of the food. The pheromone trails will guide other 
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ants to the food source. It has been shown in [66.15] 
that the indirect communication between the ants via 
pheromone trails — known as stigmergy [66.16] — en- 
ables them to find the shortest paths between their 
nest and food sources. Initially, ACO algorithms were 
developed with the aim of solving discrete optimiza- 
tion problems. It should be mentioned, however, that 
nowadays the class of ACO algorithms also comprise 
methods for the application to problems arising in net- 
works, such as routing and load balancing [66.17], and 
for the application to continuous optimization prob- 
lems [66.18]. 

ACO algorithms may be regarded from different 
perspectives. First of all, as mentioned above, they are 
SI techniques. However, seen from an operations re- 
search perspective, ACO algorithms belong to the class 
of metaheuristics [66.19-21]. The term metaheuristic, 
first introduced in [66.22], has been derived from the 
composition of two Greek words. Heuristic derives 
from the verb heuriskein (evptoKetv) which means to 
find, while the prefix meta means beyond, in an upper 
level. Before this term was widely adopted, metaheuris- 
tics were often called modern heuristics [66.23]. In 
addition to ACO, other algorithms such as evolutionary 
computation, iterated local search, simulated annealing, 
and tabu search, are often regarded as metaheuristics. 
For books and surveys on metaheuristics, we refer the 
reader to [66.19—21, 23]. 


Algorithm 66.1 Ant colony optimization (ACO) 
1: while termination conditions not met do 
2: ScheduleActivities 


3 AntBasedSolutionConstruction() 
4 PheromoneUpdate() 

5: DaemonActions(){optional} 

6 end ScheduleActivities 

7: end while 


From a technical perspective, ACO algorithms work 
as follows. Given a combinatorial optimization problem 
to be solved, first a finite set C of the so-called solution 
components, used for assembling solutions to the prob- 
lem, must be defined. Second, a set T of pheromone 
values must be defined. This set of values is commonly 
called the pheromone model, which is — from a mathe- 
matical point of view — a parameterized probabilistic 
model. The pheromone model is one of the central 
components of any ACO algorithm. The pheromone 
values t; € T are commonly associated with solution 
components. The pheromone model is used to prob- 
abilistically generate solutions to the problem under 


consideration by assembling them from the set of solu- 
tion components. In general, ACO algorithms attempt 
to solve an optimization problem by iterating the fol- 
lowing two steps: 


© Candidate solutions are constructed using 
a pheromone model, that is, a parameterized 
probability distribution over the search space. 

© The candidate solutions are used to update the 
pheromone values in a way that is deemed to bias 
future sampling toward high-quality solutions. 


The pheromone update aims to concentrate the 
search in regions of the search space containing high- 
quality solutions. In particular, the reinforcement of so- 
lution components depending on the solution quality 
is an important ingredient of ACO algorithms. It im- 
plicitly assumes that good solutions consist of good 
solution components. To learn which components con- 
tribute to good solutions can help assemble them into 
better solutions. The main steps of any ACO algorithm 
are shown in Algorithm 66.1. DaemonActions (see 
line 5 of Algorithm 66.1) may include, for example, the 
application of local search to solutions constructed in 
function AntBasedSolutionConstruction(). 

The class of ACO algorithms comprises several 
variants. Among the most popular ones are MAX- 
MIN Ant System (MMAS) [66.24] and ant colony 
system (ACS) [66.25]. For more comprehensive infor- 
mation, we refer the interested reader to [66.26]. 


66.2.2 Particle Swarm Optimization 


PSO [66.2, 27] is an SI technique for optimization that 
is inspired by the collective behavior of flocks of birds 
and/or fish schools. The first PSO algorithm was intro- 
duced in 1995 by Kennedy and Eberhart [66.28] for the 
purpose of optimizing the weights of a neural network, 
that is, for continuous optimization. In the meantime, 
PSO has also been adapted for its application to discrete 
optimization problems [66.29]. 

In PSO, solutions to the problem under consider- 
ation are labeled particles. The algorithm works on 
a whole set of particles at the same time, the so-called 
swarm. Therefore, PSO can be seen as a population- 
based optimization technique. During the run time of 
the algorithm, particles move through the search space 
on the search for an optimal, or good enough, so- 
lution. Moreover, particles communicate their current 
positions to neighboring particles. The position of each 
particle is updated according to three terms: its so- 
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called velocity, the difference between its current po- 
sition and the best position it has found so far, and that 
from the best position found by its neighbors. This has 
the effect that, during the execution of the algorithm, 
the swarm increasingly focuses on areas of the search 
space containing high-quality solutions. The term parti- 
cle swarm was chosen by Kennedy and Eberhart for the 
following reason. Their initial intention was to model 
the movements of flocks of birds and fish schools. As 
their model further evolved toward an algorithm for op- 
timization, the visual plots produced from the results of 
the algorithm rather resembled swarms of mosquitoes. 
The term particle was used due to making use of the 
term velocity, and particle seemed to be the most ap- 
propriate term in this context. 

PSO is closely related to artificial life models. Early 
works by Reynolds on the flocking model known as 
boids [66.30], and Heppner and Grenander’s studies 
on rules governing large numbers of birds flocking 
synchronously [66.31], suggested that bird flocking is 
an emergent behavior resulting from local interactions 
between the birds. These studies laid the foundation 
for the development of PSO for solving optimization 
problems. PSO is — in some way — similar to cellular 
automata (CA), which are often used for generating as- 
tonishing self-replicating patterns based on simple local 
rules. CAs may be characterized by the following three 
main attributes: 


1. Cells are updated in parallel. 

2. The value of each new cell depends only on the old 
values of the cell and its neighbors. 

3. There is no difference in rules for updating different 
cells [66.32]. 


These three attributes also hold for the particles in 
PSO. 

Henceforth, v; denotes the velocity of the ith particle 
in the swarm, x; denotes its position, p; denotes the per- 
sonal best position, and p, is the best position found by 
particles in its neighborhood. In the original PSO algo- 
rithm, v; and x;, for i = 1,...,n, are updated according 
to the following two equations [66.28]: 


vi 4 vi +c R1 Q (pj —Xj) + c&2R2 Q (pe — Xi) , 
(66.1) 


Xi xit Vi, (66.2) 
where R, and R, are independent functions return- 
ing a vector of values, generated uniformly at random, 
from the range [0, 1]. Moreover, cı and cz are the so- 
called acceleration coefficients. The symbol ® refers to 


point-wise vector multiplication. As shown in (66.1), 
the velocity term v; of a particle is composed of three 
components: the momentum, the cognitive and the so- 
cial terms. The momentum term v; carries the particle 
toward the previous direction; the cognitive term, 


cıRı Q (Pi — xi) . 


represents a force that pulls the particle toward its per- 
sonal-best position; finally, the social part, 


c2R2 Q (Pg — Xi) , 


represents a force that influences the new direction 
toward the best position of neighboring particles. Var- 
ious different neighborhood topologies may be used 
for this purpose. Examples include ring, star, and von 
Neumann. The use of rather small neighborhood topolo- 
gies — such as the one induced by the von Neumann 
neighborhood — has generally been shown to lead to 
better results when rather complex problems are ad- 
dressed, whereas larger neighborhoods generally lead 
to a better performance for simpler problems [66.33]. 
Algorithm 66.2 summarizes the basic PSO algorithm. 


Algorithm 66.2 Particle swarm optimization (PS0) 
1: Randomly generate an initial swarm 
2: while termination conditions not met do 
3: for each particle i do 

4 if f(x;) < f (p;) then p; <— x; 

5: Pg = min (Pneighbors) 

6: Update velocity (66.1) 

7 Update position (66.2) 

8: end for 

9: end while 


The class of PSO algorithms is characterized by 
a multitude of different variants, rendering it impos- 
sible to mention all of them here. However, popular 
variants include the Inertia Weight PSO [66.34], fully 
informed PSO [66.33], and adaptive hierarchical parti- 
cle swarm optimizer [66.35]. Moreover, Frankenstein’s 
PSO [66.36] is a PSO variant that was created by 
analyzing the components of existing PSO variants 
and combining (some of) them in a beneficial way. 
For more information, the interested reader may con- 
sult [66.37]. 


66.2.3 Artificial Bee Colony Algorithm 


The artificial bee colony (ABC) algorithm was first 
proposed by Karaboga and Basturk in 2005 [66.38, 
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39]. The inspiration for the ABC algorithm is to be 
found in the foraging behavior of honey bees, which 
essentially consists of three components: food source 
positions, amount of nectar and three types of honey 
bees, that is, employed bees, onlookers, and scouts. In 
short, the algorithm works as follows. Feasible solu- 
tions to the problem under consideration are modeled as 
food source positions. Moreover, the quality of a feasi- 
ble solution is modeled as the amount of nectar present 
at the corresponding food source position. Each type 
of bee is responsible for one particular operation in the 
context of generating new candidate food source po- 
sitions, that is, new candidate solutions. Specifically, 
employed bees will search in the vicinity of the food 
source position that is presently in their memory; mean- 
while they pass information about good food source 
positions to onlooker bees. Onlooker bees tend to se- 
lect good food source positions from those found by 
the employed bees, and then further search for bet- 
ter food source positions around the selected food 
source position. In case the employed bee and the 
onlookers associated with a food source position are 
not able to find a better food source position, their 
current food source position is abandoned and the em- 
ployed bee associated with this food source becomes 
a scout bee that performs a search for discovering 
new food source positions. If a scout identifies a new 
food source position, it turns into an employed bee 
again. 

Essentially, the difference between the ABC algo- 
rithm and other population-based optimization tech- 
niques is to be found in the specific way of managing 
the resources of the algorithm, as suggested by the 
foraging behavior of honey bees. Due to its simplic- 
ity and ease of implementation, the ABC algorithm 
has captured much attention recently. It should also 
be mentioned that, although the algorithm has initially 
been introduced for continuous optimization, in the 
meantime it has been adapted for its application to com- 
binatorial optimization problems as well [66.40, 41]. 
For a recent survey, we refer to [66.42]. 


66.2.4 Other SI Techniques 
for Optimization 
and Management Tasks 


In the following, we briefly mention other applica- 
tions of SI techniques for optimization and management 
tasks, the latter especially for what concerns distributed 
environments. They are grouped with respect to their 
natural inspiration. 


Division of Labor (Ants/Wasps) 

In colonies of ants and wasps, for example, there are 
various tasks to be dealt with by the colony members. 
However, the urgency to engage in certain tasks may 
change over time. In 1984, Wilson [66.43] showed that 
the concept of division of labor in colonies of Pheidole 
genus ants allows the colony to adapt to these changing 
demands. Division of labor was later modeled in [66.44, 
45] by means of response threshold models. 

These models were later used in several techni- 
cal applications. In the following, we mention a few 
of them. Nouyan et al. [66.46] consider static and dy- 
namic task allocation problems in which trucks have 
to be painted in painting booths. Another applica- 
tion concerns media streaming in peer-to-peer net- 
works [66.47]. A multiagent system for the schedul- 
ing of dynamic job shops with flexible routing and 
sequence-dependent setups is considered in [66.48]. 
Merkle et al. [66.49] made use of a response threshold 
model for self-organized task allocation in the context 
of computing systems with reconfigurable components. 
Finally, [66.50] present a system for task allocation in 
distributed environments. 


Cemetery Formation (Ants) 

The term cemetery formation refers to a behavior which 
has been observed in ant colonies of the species Phei- 
dole pallidula, among others, which cluster the bodies 
of dead nest mates. This self-organized behavior has 
given rise to several applications, especially in the con- 
text of clustering and sorting. In 1991, a model for the 
clustering and sorting behavior of ants was published 
in [66.51]. Note that clustering refers in this context to 
the formation of piles, and sorting, on the other hand, 
refers to the spatial arrangement of objects according to 
their properties. 

Mainly based on the model from [66.51], several 
algorithms for clustering and sorting were proposed in 
the literature. The first one was presented in [66.52], 
extending the original model to handle numerical data. 
More recent papers include [66.53] which deals with 
clustering and topographic mapping. Finally, the ceme- 
tery formation behavior of ants has also inspired an 
algorithm for dynamic load balancing [66.54]. 


Flashing in Fireflies 
Fireflies are winged beetles that make use of biolumi- 
nescence to attract mates or prey. Moreover, tropical 
fireflies, in particular the ones from Southeast Asia, 
synchronize their light flashes in large groups of indi- 
viduals. This is a self-organized phenomenon which is 
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mathematically described by the so-called phase-cou- 
pled oscillator models [66.55]. The benefits of this self- 
synchronization are not yet fully understood. Current 
hypotheses consider diet, social interaction, and alti- 
tude. 

The literature contains, at least, two types of tech- 
nical applications that are inspired by different aspects 
of the flashing of fireflies. First, there are applications 
that require some type of self-synchronization. Exam- 
ples include, but are not limited to, a synchronization 
protocol in sensor networks [66.56], the synchroniza- 
tion in overlay networks [66.57], and dynamic pricing 
in online markets [66.58]. Second, the literature offers 
the so-called firefly algorithm (FA) [66.59], which is in- 
spired by the way in which fireflies attract mates or prey. 
This algorithm was initially introduced for continuous 
optimization. It has, however, been adapted for the ap- 
plication to combinatorial optimization as well [66.60]. 


Fish Schooling 
A group of fish that have gathered are commonly called 
an aggregation of fish. Such a fish aggregation is called 
unstructured in the case in which the group consists of 
various species of fish having randomly gathered, for 
example, in the vicinity of a food source. If there is 
some social component to this gathering, the fish are 
said to be shoaling. Shoaling fish are aware of each 
other’s presence, adjusting, for example, their swim- 
ming behavior to each other in order to stay together. 
However, their relation is rather loose. If, in contrast, 
an aggregation of fish is more tightly organized, for ex- 
ample, when all fish move at the same speed in the same 
direction, then the aggregation is said to be school- 
ing. Schooling is a self-organized behavior that results 
from local interactions between the fish. This behav- 
ior comes with several advantages such as providing 
a means for social interactions, more successful forag- 
ing, and predator avoidance. 

There are basically two different algorithms for op- 
timization based on fish schooling to be found in the 
literature. The first algorithm is referred to as the artifi- 
cial fish swarm algorithm (AFSA). It has, for example, 
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Swarm robotics refers to the study and use of SI tech- 
niques for the coordination of groups of robots. The 
following sections provide a brief overview of this field, 
with a focus on swarm robotic systems and the tasks 
they accomplish. 


been applied to the training of feed-forward neural 
networks [66.61], multiuser detection [66.62], image 
segmentation [66.63], and generally to continuous op- 
timization [66.64]. The second algorithm is known as 
fish school search [66.65]. 


Self-Desynchronized Croaking 

(Japanese Tree Frogs) 
Different biological studies — for example, [66.66] — 
have dealt with the croaking of Japanese tree frogs. The 
male individuals make use of their croaks in order to at- 
tract females. Moreover, females of this family of frogs 
can recognize the source of such a croak and are able 
to determine the current location of the corresponding 
male. However, this is only possible if no two frogs (that 
are close enough to the female) croak at the same time. 
In such a case, the female is not able to detect where the 
croaks came from. This is why, over time, male frogs 
evolved a self-organized way of desynchronizing their 
croaks. Aihara et al. [66.67] introduced a first formal 
model based on a set of pulse-coupled oscillators for 
capturing this behavior. So far, this model has only been 
applied to distributed graph coloring [66.68, 69]. How- 
ever, the algorithm proposed in [66.69] is currently the 
state of the art for this problem. 


Nest Building (Termites/Wasps) 

Both termites and wasps build highly complex nests 
in cooperation. The construction of such nests is well 
beyond the capabilities of an individual insect. The 
nests of both termites and wasps have a very com- 
plex internal structure. Moreover, termite nests are 
extremely large in comparison to individual insects. 
Scientists studying the nest-building behavior came 
up with probabilistic models for describing (parts of) 
the behavior [66.70]. It is nowadays generally ac- 
cepted that stigmergy plays a central role in nest 
building. 

Models for nest building based on stigmergy have 
been used mainly in software tools for simulating the 
automated building of certain structures. Examples can 
be found in [66.7 1-74]. 


66.3.1 Systems 


In the late 1940s, Walter [66.75] built two autonomous 
robots called Machina speculatrix, or simply tortoise, 
which exhibited behaviors resembling those of simple 
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animals. The robots had a driving/steering mechanism, 
a head light, a photoreceptor, and a bump sensor. They 
were designed to search for and approach light sources 
of moderate intensity. If a robot observed such a source, 
its head light was turned off, otherwise it was turned on. 
In an experiment, the robots were set up in a dark envi- 
ronment, where they approached each other exhibiting 
complex motion patterns. Such mutual recognition al- 
lowed a population of machines to form a sort of 
community, which broke up once an external light 
source was introduced [66.75, p. 129]. This two-robot 
system may be the first self-organizing multirobot sys- 
tem. Interestingly, even a single robot was reported to 
exhibit complex interactions when facing its mirror im- 
age — such a behavior, if observed in an animal, might 
be accepted as evidence of some degree of self-aware- 
ness [66.75, pp. 128-129]. 

In the 1950s, inspired by von Neumann’s kinematic 
model of machine replication [66.76], the first physi- 
cal models of self-replication were built. Penrose and 
Penrose [66.77] studied a system in which passive me- 
chanical parts move on a linear track when the latter is 
subjected to side-to-side agitation. In their default posi- 
tion, the parts do not link under the influence of shaking 
alone. If a seed object composed of two complementary 
parts, one hooked up to the other, is added, it repli- 
cates by interacting with the other parts on the track. 
Jacobson [66.78] implemented a system in which self- 
propelled electromechanical parts move on a circular 
track with several branches. A seed object composed 
of two parts could trigger other parts to assemble into 
identical objects without human intervention. 

In the late 1980s, studies of Fukuda and Nak- 
agawa [66.79-81], Beni [66.5], and Wang and 
Beni [66.82] provided an enormous impetus for the field 
that developed into swarm robotics. Fukuda and Naka- 
gawa proposed a novel type of robotic system called dy- 
namically reconfigurable robotic system (DRRS), which 
can dynamically reorganize its shape and structure 
[...] for a given task and strategic purpose. DRRS 
is made of several cells with built-in intelligence and 
the ability to autonomously connect to and detach from 
one another [66.81, pp. 55-56]. The authors also pre- 
sented a first prototype of this system, the CEBOT 
Mark I [66.80]. At the same time, Beni introduced 
the term cellular robotic system, referring to a sys- 
tem that can encode information as patterns of its 
own structural units [66.5, p. 59]; the units would be 
structural elements, each with built-in intelligence, able 
to move in space and act asynchronously under dis- 
tributed control. Beni and Wang also used the terms 


swarm and swarm intelligence in this context [66.83, 
84]. 

Other early physical implementations of distributed 
robotic systems include the CEBOT Mark II [66.85], 
ACTRESS [66.86], and GOFER [66.87]. 


Hardware Architectures 
Advances in technology, for example, in computers, 
manufacturing and mobile devices have made it af- 
fordable to study swarms of around 20—1000 physi- 
cal robots [66.88] and up to around 1000000 robots 
in simulation [66.93-95]. At present, most swarm 
robotic systems consist of mobile robots that operate 
on the ground. An example is the Kilobot platform 
(Fig. 66.2a), which was designed to facilitate the fab- 
rication and operation of thousands of robots — includ- 
ing their charging, programming and activation all at 
once [66.88]. Other state-of-the-art robotic systems in- 
clude the r-one (Fig. 66.2b), which features, among 
others, a set of IR transmitters and receivers for com- 
munication and relative localization [66.89], and the 
Khepera I-IV [66.96] and e-puck [66.97], which fea- 
ture a range of sensors including a camera. Increasingly, 
swarm robotic systems operate in spaces other than on 
the ground, such as underwater [66.90, 98] (Fig. 66.2c) 
or in the air [66.99, 100]. In some robotic systems, 
the swarms operate and collaborate across multiple 
spaces, such as on the ground and in the air [66.91, 101] 
(Fig. 66.2d,e). 

According to their system architecture, most swarm 
robotic systems can be categorized into either multi- 
robot systems or modular reconfigurable robot systems. 
Multirobot systems are composed of multiple distinct 
robots, which are typically mobile and able to perform 
(collectively) more than one task in parallel (Fig. 66.2a— 
c). Modular reconfigurable robot systems are composed 
of component modules that can be physically linked 
together to form a robot (Fig. 66.2f). A few hybrid sys- 
tems exist, sharing properties of both multirobot and 
modular reconfigurable robot systems [66.91, 102-104] 
(Fig. 66.2d). 

Of particular interest among systems of modular re- 
configurable robots are those where the robots can build 
themselves [66.105, 106]. The term self-reconfigurable 
denotes the general ability of physical modules to re- 
configure themselves, regardless of whether the process 
is centrally controlled, for example, by an external com- 
puter, or decentralized and autonomous. In the follow- 
ing, we use the term self-assembly to refer to processes 
by which pre-existing components (separate or distinct 
parts of a disordered structure) autonomously organize 
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Fig. 66.2a-f Examples of swarm robotic systems: (a) Kilobots developed by Harvard University [66.88]; (b) r-one 
(after [66.89], photo courtesy of J. McLurkin, Rice University); (c) Lily developed in the CoCoRo project (after [66.90], 
photo courtesy of T. Schmickl, University of Graz); (d,e) a heterogeneous system studied in the Swarmanoid project 
(after [66.91], photo courtesy of M. Dorigo, Université Libre de Bruxelles); (f) Pebbles (after [66.92], photo courtesy of 


D. Rus, MIT) 


into patterns or structures without external interven- 
tion. Self-assembly is responsible for the generation of 
much of the order in nature [66.107] and has widely 
been applied in the synthesis of products from molec- 
ular components. Increasingly, the potential of self- 
assembly processes involving larger components — up 
to the centimeter-scale — is being recognized [66.108]. 
In robotic systems, two distinct classes of self-assem- 
bling systems exist [66.109]: (i) systems in which the 
components that self-assemble are externally propelled, 
and (ii) systems in which the components that self-as- 
semble are self-propelled. 


Sensing and Communication 
In most multirobot systems, robots interact with each 
other by using their sensors or some form of com- 
munication. Dudek et al. [66.110] presented a detailed 
taxonomy considering communication range, topology, 
and bandwidth. In the following, we adopt a simpler 
categorization proposed by Cao et al. [66.111]: 


@ Interaction via environment refers to the transfer 
of information that is mediated through the mem- 
ory of the environment. In this case, robots leave 
persistent signs that stimulate the activity of other 
robots. This kind of indirect communication is also 
referred to as stigmergy [66.16]. Stigmergic com- 
munication is widely used in social insect societies, 
for example, during the construction of mounds by 


termites of Macrotermes bellicosus [66.8], and has 
been implemented in several swarm robotic sys- 
tems [66.112-116]. 

Interaction via sensing refers to local interactions 
that occur between agents as a result of agents 
sensing one another, but without explicit communi- 
cation [66.111, p. 12]. We include in this category 
interactions where agents sense each other indi- 
rectly, that is, where the current presence or motion 
of another agent can be inferred from changes in the 
environment. Note that the boundary to stigmergic 
communication is blurred; for example, consider 
the situation where multiple agents push an object 
simultaneously [66.1 17-119]. 

In some social animals, the members of a group ob- 
serve acommon leader individual. Their actions can 
be highly dependent on the observed behavior of 
the leader, as, for instance, during an attack of the 
group [66.120]. In other animals, no recognizable 
leader individual exists; instead, individuals observe 
nearby group members. The latter situation is typi- 
cal for swarm systems. It is reported, for instance, 
for animal groups that exhibit herding, flocking, 
and schooling behavior [66.8]. Note that where the 
groups are not homogeneous, even a minority of in- 
dividuals may be able to influence the rest of the 
group [66.121]. 

In principle, interaction via sensing can be consid- 
ered an implicit form of communication, in par- 
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ticular, as an observed agent can change action 
and thereby influence the behavior of its observers. 
Arkin [66.122] referred to the interaction via sens- 
ing category as cooperation without communica- 
tion, and showed that it is sufficient to accom- 
plish tasks, that require the cooperation of multiple 
robots. Other examples of swarm robotic studies us- 
ing interaction via sensing include [66. 123-126]. 

© Interaction via communication refers to interac- 
tions involving explicit communication. Thereby, 
information is either broadcast or transferred to 
specific teammates. Information transfer can take 
place through direct physical interactions, such as 
touch. This latter form of communication can also 
be referred to as direct interaction [66.127]. Ex- 
plicit communication can improve the performance 
of a multirobot system. This is typically the case 
where the system benefits from robots being re- 
cruited to certain areas of the environment. Balch 
and Arkin [66.128] studied such an environment 
and showed that it can be sufficient for each robot 
to signal its overall state. The transfer of more 
elaborate information however would not result in 
any significant increase in task performance. Ex- 
plicit communication is commonly used in modular 
reconfigurable robot systems, for example, to ex- 
change information between inter-connected mod- 
ules or to support the docking process of separate 
modules [66.129]. 


Control and Coordination 
Over the last two decades, a range of design methods 
have been proposed for the control of swarm robotic 
systems. They can be broadly classified into behavior- 
based design methods and automated design meth- 
ods [66.130]. 

In behavior-based design methods, the user ap- 
proaches the problem in a bottom-up manner [66.131]. 
A repertoire of behaviors for individual robots is de- 
fined and often refined through a trial-and-error process. 
A common approach is the use of finite state machines. 
Each state defines a basic behavior. Transitions between 
states can be triggered by probability, external events, 
time-outs, and combinations of these [66.132-134]. 
A prominent example is the use of response threshold 
functions, for example, 


2 
=e: A 
si/ Oi or 5 i z> 
s; +6; 


1 — exp 


which define the probability for an individual to engage 
in task i based on the perceived task stimulus s; and 


threshold 6;. The particular threshold value 6; can either 
be fixed for each individual from the outset [66.135] or 
learned during its lifetime [66.136, 137]. In both cases, 
the mechanism can facilitate the emergent allocation of 
tasks in groups of otherwise identical individuals (see 
also Sect. 66.2.4). In addition, intentional approaches 
to task allocation have been considered [66.138, 139]. 
These require the agents to cooperate explicitly with 
each other. For example, the decentralized ALLIANCE 
algorithm [66.140, 141] can be used for groups of het- 
erogenous robots to perform tasks and subtasks, which 
may have ordering dependencies, in a fault-tolerant 
way. It assumes that the robots detect with some prob- 
ability the effect of their own actions as well as the 
actions of other team members. 

Virtual potential fields [66.142, 143], and physi- 
comimetics [66.144], is another widely used behavior- 
based design method. The robots mimic a physical par- 
ticle under the influence of a potential field. The latter 
guides the particle toward a point of minimal poten- 
tial energy. While the goal point, which the robot shall 
reach, would exert an attractive force on the particle, 
any obstacle would exert a repulsive force. Other robots 
can exert forces on the particle as well. Using this con- 
cept, a wide repertoire of behaviors can be realized, 
such as the collective movement of robots arranged in 
particular formations [66.145], or the tracking of mul- 
tiple moving targets [66.146]. The properties of the 
resulting swarm systems, for example, the cohesion of 
the swarm, can also be formally analyzed [66.147]. 

Other design methods include the Growing 
Point Language [66.148], the Origami Shape Lan- 
guage [66.149], and Proto [66.150]. These languages 
were developed in the context of Amorphous Comput- 
ing [66.151], which considers systems of massively 
distributed, disordered, asynchronous, and locally 
interacting computational devices. The Proto language 
has been extended for use on mobile devices. This 
extension was validated with a swarm of 40 iRobot 
robots [66.152]. Some amorphous computing ap- 
proaches allow users to specify desired global system 
properties in the language. A compiler then produces 
the local rule set for the agents to achieve these 
properties [66.149]. 

Automated design methods can be grouped into 
reinforcement learning and evolutionary robotics. In re- 
inforcement learning [66.153], an agent interacts with 
its environment by choosing actions and receiving re- 
wards. Matarié [66.154, 155] applied reinforcement 
learning in a swarm robotic context. The robots had to 
learn how to collaborate in a foraging task. The robots 
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were provided with a set of hand-coded behaviors (as 
in a behavior-based approach) and were required to 
learn how to correlate appropriate conditions for each 
of these behaviors in order to optimize the higher-level 
behavior [66.155]. The difficulties of using reinforce- 
ment learning in a swarm robotic context are discussed 
in [66.130]. A recent survey of reinforcement learning 
in robotics is reported in [66.156]. 

Evolutionary robotics is an approach to design- 
ing robots, or aspects of them (e.g., morphology, 
control) using evolutionary algorithms [66.157, 158]. 
This approach can also be applied to the design of 
swarm robotic systems. In principle, evolution can 
bypass both the problem of decomposing a given 
task and the problem of identifying basic behaviors 
that achieve the subtasks [66.159, 160]. Early studies 
in evolutionary robotics developed collective behav- 
ior such as herding or flocking in simplistic simula- 
tion environments [66.161—163]. Simulation environ- 
ments with physically embodied agents were consid- 
ered in [66.159], where neural network controllers for 
aggregation were first evolved using a group of five 
robots in a simple simulation environment; the best 
of these controllers were subsequently validated us- 
ing a more detailed simulation model of the robots. 
Quinn etal. [66.164] evolved neural network con- 
trollers for collective motion using a group of three 
simulated robots and subsequently tested the best-rated 
network in 100 trials with a group of three physical 
robots. Watson et al. [66.165] went a step further in that 
controllers for a simple phototaxis task were directly 
evolved on a group of eight physical robots. Working 
toward a distributed evolution of robot morphologies in 
hardware, Griffith et al. [66.166] demonstrated a sys- 
tem of template-replicating polymers, which were made 
of reconfigurable modules that slid passively on an 
air table and executed a finite state machine to con- 
trol their connectivity. Recent work on evolutionary 
swarm robotics considers cultural evolution, for ex- 
ample, where behaviors that can be imitated (memes) 
are subject to an evolutionary process. In these, the 
robots engage as both teachers and learners to exchange 
memes [66.167]. 

Several design methods were developed specifically 
for, or mainly adopted in, the context of modular re- 
configurable robot systems. One class of algorithms 
addresses the problem of how to adjust the relative po- 
sitions of modules without changing their connection 
topology. Yim [66.168] proposed the use of gait con- 
trol tables to produce a range of animal-like locomotion 
patterns, such as the walking gaits of hexapods. Each 


gait control table specifies for each control cycle and 
module a basic action to be performed. The controller 
is executed either from a central place or in a distributed 
fashion. In the latter case, the modules synchronize their 
actions using internal timers. Shen et al. [66.169] pro- 
posed hormone-inspired communication and control, in 
which artificial hormones help modules to synchronize 
actions and discover changes in their topology. For ex- 
ample, a set of independent caterpillar-like robots could 
be connected into a single entity, which would adapt 
its gait to the new topology. In a similar experiment, 
a connected entity was manually split into smaller enti- 
ties that continued to move as independent caterpillars. 
Støy [66.170] proposed a role-based control algorithm 
to let modular robots display periodic locomotion pat- 
terns. A module’s role specifies its actions and how 
to synchronize them with neighbor modules. For com- 
munication, a parent-child architecture is used; thus, 
modules need to be arranged in acyclic graphs. An ex- 
tended version of the control algorithm can also cope 
with cycles. 

Another class of algorithms addresses the problem 
of how to adjust the relative positions of modules by 
changing the connection topology [66.106]. One ap- 
proach is to formulate the problem as a search problem. 
For example, in order to reconfigure a lattice-based 
robot from one topology to another, a graph search 
is performed, where the start node of the graph cor- 
responds to the initial topology of the robot and the 
end node corresponds to the desired topology of the 
robot [66.171]. Due to the combinatorial explosion of 
possibilities, an exhaustive search of such graphs is im- 
practical whenever the number of modules is not small. 
State-of-the-art approaches are thus heuristic and con- 
sider ways of reducing the problem complexity. For 
example, Yoshida et al. [66.172] proposed a two-level 
motion planner. A global planner ensures that the robot 
as a whole follows a predefined 3D trajectory. To do 
so, it specifies several candidate paths that bring indi- 
vidual modules from the tail to the head of the robot. 
A motion scheme selector chooses a feasible path for 
each module based on a rule database. Another exam- 
ple is to merge logically a group of nearby modules into 
meta-modules, which, typically, have more advanced 
locomotion abilities than the individual modules. The 
problem is then reduced to developing controllers for 
both meta-modules and modular robots composed of 
meta-modules [66.173]. In principle, modular robots 
can solve the search problem on the fly [66.174]. Other 
than by search, the reconfiguration problem can also 
be attempted by local movement strategies, for ex- 


Swarm Intelligence in Optimization and Robotics | 66.3 SI in Robotics: Swarm Robotics 


ample, random walks [66.175, 176], cellular automata 
rules [66.177], gradient rules [66.178, 179], or combi- 
nations of these [66.180]. These approaches naturally 
lead to decentralized implementations, as is desired in 
swarm robotics. 


66.3.2 Tasks 


A range of capabilities have already been demonstrated 
with swarm robotic systems. In the following, a brief 
overview is given. More detailed information is pro- 
vided in Chaps. 71-74 of Part F of this handbook. 
Garnier et al. [66.189] demonstrated how a group of 20 
Alice robots aggregate in a homogeneous environment. 
The robots mimic the aggregation behavior of cock- 
roaches, which are reported to join and leave clusters 
with probabilities that depend on the sizes of clus- 
ters [66.190]. Such probabilistic algorithms have the ad- 
vantage that, as long as the environment is bounded, it 
is not required that the robots initially form a connected 
graph in terms of their sensing and/or communication. 
A deterministic algorithm for aggregation is considered 
in [66.181]. It requires robots to have one binary sen- 
sor, which informs them whether or not there is another 
robot in their line of sight. The robots do not need mem- 
ory and do not need to perform arithmetic computation. 
They rotate on the spot when they perceive another 
robot, and move backward along a circular trajectory 
otherwise. This algorithm was validated with groups of 
40 e-puck robots (Fig. 66.3a). 

Werfel et al. [66.116] developed a system of robots 
that can simultaneously construct and navigate struc- 
tures from a supply of building blocks (Fig. 66.3b). The 
robots are inspired by termites, which use stigmergic 
rules to construct sophisticated structures, in particular, 
the mounds they inhabit. Given a desired target struc- 
ture, it is possible to generate automatically a set of 
rules to be uploaded onto each robot. Using only local 
information, these rules allow the robots to coordinate 
their activities in a way that avoids conflict. A group of 
three robots constructed several structures, one resem- 
bling a castle. 

Halloy et al. [66.182] showed that hybrid societies 
comprising both cockroaches and robots can collec- 
tively decide to aggregate under either of two shel- 
ters (Fig. 66.3c) and that it is possible for the robots 
to influence the decision-making process. In general, 
such interactive robots could be used to study and 
control animal groups [66.182, 191], including live- 
stock [66.192, 193], and to inform ecological conserva- 
tion policy. 


Following the pioneering simulation works on 
boids [66.30], Turgut et al. [66.183] demonstrated how 
a group of robots can flock through a real environment 
using simple rules. To align with each other, the robots 
used virtual heading sensors, each comprising a digital 
compass and a wireless communication module. Flock- 
ing was demonstrated with 9 Kobot robots in a bounded 
environment (Fig. 66.3d). 

Krieger etal. [66.184] studied algorithms that al- 
low a group of robots to forage (Fig. 66.3e). The robots 
rested in a central place, the nest. A robot would leave 
the nest if the total energy of the colony dropped below 
a threshold. Each robot had its own threshold, which ef- 
fectively enabled the division of labor within the group. 
In addition, a robot would leave the nest when being 
recruited by another robot that had found a cluster of 
food. The pair of robots would then perform a tandem 
run to reach the cluster. The algorithms were tested on 
groups of up to 12 Khepera robots. The groups were 
reported to perform more efficiently when employing 
the division of labor and recruitment mechanisms than 
without such mechanisms. 

Grof etal. demonstrated how a group of 16 
s-bot robots self-assemble into a single composite en- 
tity [66.185]. The process was seeded by one of the 
robots activating its light emitting diode (LED) ring 
in red. Other robots activated their LED rings in blue. 
Once a robot would connect to the seed structure, it 
became red too, thereby attracting other robots to the 
structure as it grows (Fig. 66.3f). The problem of self- 
assembling into arbitrary morphologies of s-bot robots 
was considered in [66.194]. 

Holland and Melhuish [66.186] studied algorithms 
that allow groups of robots to sort (and cluster) ob- 
jects of different types (Fig. 66.3g). Six robots were 
programmed using simple rules, which regulated the 
conditions under which objects of different types were 
picked up and deposited. 

Following the pioneering work of Kube 
etal. [66.195,196], Chen etal. [66.187] proposed 
an algorithm for a group of robots to transport ob- 
jects larger than themselves toward a goal location 
(Fig. 66.3h). The robots were programmed to only 
push the object across the portion of its surface where 
the direct line of sight to the goal is occluded by the 
object. The algorithm was proven to work for objects 
of arbitrary convex shape and it was tested with 20 
e-puck robots. 

Ijspeert et al. [66.188] studied an algorithm that al- 
lows a group of robots to pull sticks out of the ground 
collaboratively (Fig. 66.31). Upon encountering a stick, 
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Fig. 66.3a-i Examples of capabilities demonstrated by swarm robotic systems: (a) aggregation (after [66.181]); (b) 
construction (after [66.116]; reprinted with permission from AAAS); (c) decision making (after [66.182]; photo courtesy 
of J. Halloy, Université Libre de Bruxelles); (d) flocking (after [66.183]; photo courtesy of E. Sahin, Middle East Tech- 
nical University); (e) foraging (after [66.184]; photo courtesy of L. Keller, University of Lausanne); (f) self-assembly 
(after [66.185]); (g) sorting of objects (after [66.186]; photo courtesy of C. Melhuish, Bristol Robotics Laboratory); (h) 
transport of objects (after [66.187]); (i) pulling sticks out of the ground (after [66.188]; reprinted with permission from 


Springer) 


a robot would only be able to pull it partially out of 
the ground. It would then wait for a second robot to 
arrive and pull the stick out completely. The optimal 


66.4 Research Challenges 


Research challenges concerning the use of swarm intel- 
ligence in optimization are mainly related to increasing 
their efficiency. More specifically, in addition to pro- 
viding an innovative way of problem solving, swarm 
intelligence approaches must also be efficient concern- 
ing, for example, computation time in order to be 
able to compete with state-of-the-art optimization tech- 
niques. This may often be achieved by hybridizing 
swarm intelligence approaches with components taken 
from optimization algorithms in other fields such as, 


waiting time for the first robot was derived from an an- 
alytic model of the system. The algorithm was validated 
using a system of six Khepera robots. 


for example, operations research. The interested reader 
may find various references to such kind of techniques 
in [66.197]. 

With regard to swarm robotics, a major challenge 
is the transition from systems operating in structured 
indoor environments, as typically found in laborato- 
ries, to the more complex environments found in the 
real world. Over the next decades, swarms of robots 
are expected to have impact in a range of application 
scenarios, including cognitive factories, deep sea ex- 
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ploration, disaster management, precision farming, and 
space systems. Working toward more complex environ- 
ments also concerns the ability of swarms of robots to 
interact safely with humans. Another challenge con- 
cerns the miniaturization of swarm robotic systems. 
Most of the current systems comprise of centimeter- 
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sized robots. The swarm robotics approach, however, 
should be equally applicable to intelligent autonomous 
devices operating at scales from a millimeter down to 
a micrometer. This could have profound implications, 
for example, on advanced materials and healthcare 
technologies. 
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transonic airfoil design is the often excessive 67.1.3 Multiobjective Optimization 1313 

number of computational fluid dynamic (CFD) 67.1.4 Surrogate Modeling ......-cssss.00--- 1316 

simulations required to ensure convergence. In ea 
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this study, a multiobjective particle swarm op- 
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Stokes (RANS) solver to assess the performance 
of the solutions. The successful integration of 
these design tools is facilitated through the ref- 
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67.1 Airfoil Design 


Airfoil design originates from an understanding of the 
fundamental physics of flight, where the aim is to 
identify or conform to the best possible shape for the 
given operating requirements. It has evolved from the 
use of wind tunnel catalogs and traditional cut-and- 
try methods to automated computational frameworks. 
While automated frameworks effectively simplify the 
design process, success is still largely dependent on 
the fidelity of the computational methods, as well as 
the experience of the designer in formulating the prob- 
lem [67.1]. This section is devoted to a discussion 


of airfoil design optimization architecture. The con- 
cepts that are especially applicable to this study are 
introduced, laying the foundations for the proposed 
methodology. 


67.1.1 Airfoil Design Architecture 


The direct method of airfoil design, pioneered by the 
work of Hicks and Henne [67.2], refers to the philos- 
ophy of using mathematical optimization methods to 
identify the optimal shape that achieves the prescribed 


1312 


1'29 | 4 Hed 


Part F | Swarm Intelligence 


Define objective Geometrical shape 


function(s) parameterization 
Define Computational 
constraints flow solver 


Initialize 
candidate shape(s) 


Convergence 
obtained? 


Optimal shape | 


design criteria. The generalized framework for an aero- 
dynamic shape optimization process is demonstrated in 
Fig. 67.1. The success of the direct approach is essen- 
tially dependent on three main components within the 
design loop: 


@ Shape parameterization. All design strategies share 
the common requirement that the geometry is rep- 
resented by a finite number of design variables. 
A method to mathematically parameterize shapes 
is, therefore, required so that modifications can be 
made via direct manipulation of the design vari- 
ables. The number of design variables is directly 
proportional to the geometrical degrees of freedom 
and, therefore, governs the dimensionality of the 
problem. 

© Computational flow solver. The objective function 
is obtained from the flow solver and it is, therefore, 
up to the discretion of the designer to appropriately 
formulate the objective and constraint functions, 
such that they reflect the design and operating re- 
quirements. The choice of the flow solver ultimately 
governs the overall fidelity and efficiency of the op- 
timization process, since repeated evaluations of the 
objective function are required for each candidate 
shape. 

© Optimization algorithm. The responsibility of the 
optimizer is to iteratively determine the shape modi- 
fications required to satisfy the objective, whilst ad- 
hering to any shape or performance constraints. The 
optimizer should be robust and applicable to a wide 
operational spectrum, yet efficient to guarantee con- 
vergence with the least computational expense. 


The integration of high-fidelity flow solvers and 
flexible parameterization methods for numerical op- 


candidate shape(s) 


Fig. 67.1 Generalized process 
flowchart for direct airfoil shape op- 
timization 


Perturb 


Optimization 


timization is still a computationally challenging and 
intensive undertaking. The extension to multiple objec- 
tives leads to a more generalized problem formulation, 
yet significantly increases the computational cost of 
convergence. While all elements of the design loop in- 
fluence the efficiency of the process, arguably the most 
important element is the optimizer itself. The following 
section introduces the optimization paradigm adopted 
in this study, derived from the field of computational 
swarm intelligence. 


67.1.2 Intelligent Optimization: PSO 


The formation of hierarchies within groups of animals 
is a naturally occurring phenomenon and is simple to 
comprehend. Even humans have the intuitive tendency 
to appoint leaders (e.g., political leaders, military gen- 
erals, etc.). Another interesting phenomenon, which is 
more difficult to perceive, is the self-organized behav- 
ior of groups where a leader cannot be identified. This 
is known as swarming and is evident from the flock- 
ing behavior of birds or fish moving in unison. The 
increasingly cited field of computational swarm intel- 
ligence focuses on the artificial simulation of swarming 
behavior to model a wide range of applications, includ- 
ing optimization [67.3]. 

Particle swarm optimization (PSO) is the stochas- 
tic population-based technique described by Kennedy 
and Eberhart [67.4] in accordance with the principles 
of swarm intelligence. The PSO architecture was de- 
rived from a synthesis of the fields of social psychology 
and engineering optimization. As was eloquently stated 
by the authors in their original paper [67.4]: 


Why is social behavior so ubiquitous in the animal 
kingdom? Because it optimizes. What is a good way 
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to solve engineering optimization problems? Mod- 
eling social behavior. 


The dynamics of the swarm are modeled on the 
social-psychological tendency of individuals to learn 
from previous experience and emulate the success of 
others. Similar to most evolutionary techniques, the 
swarm is initialized with a population of random indi- 
viduals (particles) sampled over the design space. The 
particles navigate the multi-dimensional design space 
over a number of iterations or time steps. Each particle 
maintains knowledge of its current position in the de- 
sign space. This is analogous to the fitness concept of 
conventional evolutionary algorithms (EAs). Each par- 
ticle also records its personal best position, which is 
where the particle has experienced the greatest success. 
Aside from recording personal information, each par- 
ticle also tracks the position of other members in the 
swarm. This level of social interaction between parti- 
cles is coined the swarm topology. Particles may either 
be confined to share information only with their imme- 
diate neighbors, or they may be encouraged to share 
their experiences with the entire swarm. Utilizing this 
information, each particle adjusts its position in the de- 
sign space by accelerating towards the successful areas 
of the design space. The absence of selection is com- 
pensated by this use of leaders to guide the swarm to 
converge to the most successful position. In this way, 
a solution which initially performs poorly may possibly 
be on the future road to success. 

PSO has steadily gained popularity as a global op- 
timization technique [67.3]. Its increasing use in the 
literature is due to its simple and straightforward imple- 
mentation (despite its intricate origins) and its efficient 
and accurate convergence rates [67.5]. 


67.1.3 Multiobjective Optimization 


Airfoil design problems are often characterized by sev- 
eral interacting or conflicting requirements, which must 
be satisfied simultaneously. In the case of an airfoil 
operating within the transonic regime, airfoil shape op- 
timization is performed to limit shock and viscous drag 
(C4) losses, and reduce shock-induced boundary layer 
instability at the design Mach number (M) and lift 
coefficient (C). This often occurs at the expense of ex- 
cessive pitching moments (Cm) due to aft loading and 
performance degradation under off-design conditions. 
To facilitate adequate performance over a wide opera- 
tional spectrum requires a search algorithm capable of 
handling multiple conflicting objectives. 


Let S € R” denote the design space and let x = 
{X1,X2,..-,Xn}€S denote the decision vector with 
lower and upper bounds Xin and Xmax, respectively. The 
generic unconstrained multiobjective problem (MOP) is 
thus expressed as, 


min f(x) = {fi(&),..-.fm()} . (67.1) 


where f;(x): IR” — R is the i-th component of the ob- 
jective vector and m is the number of objectives. The 
definition of the optimum must be redefined since in 
the presence of conflicting objectives, improvement in 
one objective may cause a deterioration in another. It is 
often necessary to identify a set of trade-off solutions, 
which can all be considered equally optimal. A solution 
is termed non-dominated or Pareto optimal (after the 
nineteenth century Italian economist Vilfredo Pareto) if 
the value of any objective cannot be improved without 
deteriorating at least one other objective. The candidate 
solutions are defined as a and b € S. The candidate a 
dominates the candidate b (denoted by a < b) if, 


Vi=l,....m f(a) <fb) AF: f@ <f). 
(67.2) 


The concept of dominance is illustrated in Fig. 67.2. 
The shaded area denotes the area of objective vectors 
dominated by a. A decision vector a® is, therefore, non- 
dominated or Pareto optimal if there is no other feasible 
decision vector a £ a* € S such that f(a) < f(a*). The 
Pareto front is the set of objective vectors which cor- 
respond to all non-dominated solutions. Multiobjective 
algorithms aim to identify the closest approximation to 
the true Pareto front, while ensuring a diverse Pareto 
optimal set. 


h 
Dominance region 
of solution a 
f(b) |-------------- a 9 
i@)\|-------------- — 
| 
fila) filb) 7 


Fig. 67.2 Illustration of dominance on a two-objective 
landscape 


1313 


1'29 | 4 Hed 


1314 Part F 


Swarm Intelligence 


1'29 | 4 Hed 


Techniques for Solving MOP 
From a design perspective, the primary aim of mul- 
tiobjective optimization is to obtain Pareto optimal 
solutions which are in the preferred interests of the de- 
signer, or best suit the intended application. Methods 
for solving MOPs are, therefore, characterized by how 
the designer preferences are articulated. As suggested 
by Fonseca and Fleming [67.6], there are three generic 
classes of methods for solving multiobjective problems: 


© A priori methods. The preferences of the designer 
are expressed by aggregating the objective functions 
into a single scalar through weights or bias, ulti- 
mately making the problem single objective. 

© A posteriori methods. The algorithm first identifies 
a set of non-dominated solutions, subsequently pro- 
viding the designer greater flexibility in selecting 
the most appropriate solution. 

@ Interactive methods. The decision making and op- 
timization processes occur at interleaved steps, and 
the preferences of the designer are interactively re- 
fined. 


The a priori strategy requires the designer to indi- 
cate the relative importance of each objective before 
performing the optimization. A notable method that 
falls into this category is the weighted aggregation 
method, which is a fairly popular choice for airfoil de- 
sign applications due to its simplicity and capability of 
handling many flight conditions [67.7—9]. Despite its 
popularity, there are recognized deficiencies with this 
strategy [67.10]. The prior selection of weights does 
not necessarily guarantee that the final solution will 
faithfully reflect the preferred interests of the designer, 
and varying the weights continuously will not neces- 
sarily result in an even distribution of Pareto optimal 
solutions, nor a complete representation of the Pareto 
front [67.11]. 

Alternatively the a posteriori methods provide max- 
imum flexibility to the designer to identify the most pre- 
ferred solution, at the expense of greater computational 
effort. Generally, these methods involve explicitly solv- 
ing each objective to obtain a set of non-dominated 
solutions, a concept which is ideal for population-based 
evolutionary algorithms [67.12—14]. While these meth- 
ods are computationally more complex, researchers in 
aerodynamic design are realizing the benefits of evolu- 
tionary multiobjective optimization (EMO), especially 
if there is a certain ambiguity in selecting the final de- 
sign [67.15-17]. However, it poses the challenge of 
identifying and exploiting the entire Pareto front, which 


may be impractical for design applications due to the 
excessive number of function evaluations. 

While conventional EMO techniques may be com- 
putationally demanding, Fonseca and Fleming [67.12] 
argue that their most attractive aspect is the intermedi- 
ate information generated, which can be exploited by 
the designer to refine preferences and improve conver- 
gence. These interactive methods involve the progres- 
sive articulation of preferences, which originates from 
the multicriteria decision making literature [67.18]. The 
optimization and decision making processes are in- 
terleaved, exploiting the intermediate information pro- 
vided by the optimizer to refine preferences [67.6]. 


Handling Multiple Objectives with PSO 
PSO has been demonstrated to be an effective tool 
for single-objective optimization problems due to its 
fast convergence [67.5]. It has also gained rapid pop- 
ularity in the field of multi-objective optimization 
(MOO) [67.19]. Since PSO is a population-based tech- 
nique, it could ideally be tailored to identify a number 
of trade-off solutions to a MOP in one single run, sim- 
ilar to EMO techniques. Comprehensive surveys on 
extending PSO to handle multiple objectives have been 
provided by Engelbrecht [67.20], and more recently by 
Sierra and Coello Coello [67.19]. It was established that 
the primary ambiguity in specifically tailoring PSO to 
handle multiple objectives was the selection of guides 
for each particle to avoid convergence to a single so- 
lution. The selection process for particle leaders must, 
therefore, be restructured, to encourage search diversity 
and to ensure that non-dominated solutions found dur- 
ing the search are maintained. 

Initial attempts to design a multiobjective particle 
swarm optimization (MOPSO) algorithm were moti- 
vated by the archive strategy by [67.21]. Coello Coello 
and Lechuga [67.22] incorporated the concept of Pareto 
dominance in PSO by maintaining two independent 
populations: the particle swarm and the elitist archive. 
Non-dominated solutions are stored in the archive and 
subsequently used as neighborhood leaders. The objec- 
tive space is separated into hypercubes, which serve 
as a particle anti-clustering mechanism. Solutions in 
sparsely populated hypercubes have a higher selection 
pressure to be leaders, and solutions in densely pop- 
ulated hypercubes are removed if the archive limit is 
exceeded. This initial approach was later extended by 
Mostaghim and Teich [67.23], who studied the concept 
of €-dominance and compared it to existing clustering 
techniques for fixing the archive size, with favorable re- 
sults. 
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Fieldsend and Singh [67.24] addressed the compu- 
tational complexity of maintaining a restricted archive, 
by incorporating the dominated tree method. This data 
structure allows for an unrestricted archive size, which 
interacts with the population to define global leaders. 
A turbulence operator (similar to the concept of mu- 
tation in EA) was also implemented, where swarm 
members were randomly displaced on the design space 
to reduce the probability of premature stagnation. In 
the non-dominated sorting particle swarm optimization 
(NSPSO) algorithm of Li [67.25], the non-dominated 
sorting mechanisms of non-dominated sorting genetic 
algorithm (NSGA-II) are incorporated. The popula- 
tion and the personal best position of each particle are 
consolidated to form one single population, and the 
non-dominated sorting scheme is utilized to rank each 
solution. Global guides are selected based on particle 
clustering, where a niching or crowding distance met- 
ric is used to further classify non-dominated solutions. 
Li later proposed the maximinPSO algorithm [67.26], 
which does not use any niching method to maintain 
diversity. 

Sierra and Coello Coello [67.27] proposed an eli- 
tist archive incorporating the €-dominance strategy to 
maintain global leaders for the swarm. A crowding dis- 
tance operator is employed to classify non-dominated 
solutions and maintain uniformity. The crowding dis- 
tance operator is also used to limit the number of 
candidate leaders after each population update, simpli- 
fying the mechanism to control the set of candidate 
leaders. A turbulence operator is implemented to en- 
courage diversity, whereby particles are randomly mu- 
tated. A similar approach by [67.28] was developed in 
parallel (although this method does not implement €- 
dominance), where the crowding distance was used to 
both define the global guides and truncate the size of the 
archive. The proposed algorithm is primarily influenced 
by the two latter studies. 


Preference-Based Optimization 
The concept of interactive optimization has led to 
an increasing interest in coupling classical interactive 
methods to EMO as an intuitive way of reflecting the 
designer preferences and identifying solutions of inter- 
est to the designer. This has led to the development 
of the preference-based optimization philosophy, which 
provides the motivation for the current study. Compre- 
hensive surveys on preference-based optimization are 
provided by Coello Coello [67.29] and Rachmawati 
and Srinivasan [67.30]. The first recorded attempt at 
incorporating preferences within an evolutionary mul- 


tiobjective optimization framework was made by Fon- 
seca and Fleming [67.31] using the goal programming 
approach. Goal programming [67.11] is an ideal ap- 
proach to indicate desired levels of performance for 
each objective, since they closely relate to the final 
solution of the problem. Goals may either represent 
target or ideal values. Fonseca and Fleming later ex- 
tended the approach where an online decision making 
strategy was proposed based on goal and priority in- 
formation [67.6]. A goal programming mechanism for 
identifying preferred solutions for MOP was also pro- 
posed by [67.32]. While the reported frameworks draw 
on the preferred interests of the designer to aid the 
optimization process, the goal programming approach 
is computationally complex, and there is no means of 
specifying any relation or trade-off between the objec- 
tives [67.30]. 

Thiele et al. [67.33] proposed another variant of 
interactive evolutionary multiobjective optimization. 
A coarse representation of the Pareto front is initially 
presented to the designer. The most interesting regions 
are subsequently isolated, on which the algorithm con- 
tinues to focus on exclusively. This proposal effectively 
removes the necessity to predefine target values for each 
objective and provides the designer with a means of 
isolating the preferred trade-offs. However, it is a two- 
stage approach requiring an initial approximation to the 
Pareto front, which may be unnecessarily expensive. 
The integration of other classical preference articula- 
tion methods has also been proposed in the literature. 
A reference point-based evolutionary multiobjective 
optimization framework was proposed by [67.34]. The 
crowding distance operator of the NSGA-II algorithm 
was modified to include the reference point information 
and the extent of the preferred region was controlled by 
€-dominance. Deb and Kumar also experimented with 
the use of other classical preference methods, such as 
the reference direction method [67.35] and the light 
beam search method [67.36]. 

Recently, the use of interactive methods has also 
been integrated within PSO frameworks. Wickramas- 
inghe and Li [67.37] integrated the reference point 
method to both the NSPSO [67.25] and maxim- 
inPSO [67.26] algorithms. Significant improvement in 
convergence efficiency was highlighted, and it was 
demonstrated that final solutions are of higher rele- 
vance to the designer. Wickramasinghe and Li [67.38] 
later extended their approach to handle MOP, by re- 
placing the dominance criteria entirely with the simpler 
distance metric. It was conclusively demonstrated that 
without the use of the reference point, obtaining a final 
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set of preferred solutions solely through conventional 
dominance-based techniques is improbable. 


67.1.4 Surrogate Modeling 


The most prohibiting factor of design optimization 
is the cost of evaluating the objective and constraint 
functions. For high-fidelity airfoil design, function eval- 
uations may very well be measured in hours. This 
computational burden ultimately questions the practi- 
cality of performing an optimization study, and is often 
alleviated by simply reducing the level of sophistica- 
tion of the solver. This consequently reduces the fidelity 
of the final design, which is undesirable. Another miti- 
gating strategy which has steadily gained popularity in 
design is the use of inexpensive surrogates or metamod- 
els [67.39]. These models emulate the response of the 
expensive function at an unobserved location, based on 
observations at other locations. Surrogate models are 
not specifically optimization methods, but rather they 
may ideally be used in lieu of the expensive function 
to extract information from the design space during the 
optimization process. 

The insightful texts by Keane and Nair [67.39] and 
Forrester et al. [67.40] provide a detailed account of the 
use of surrogates in design. The most common use is to 
construct a curve fit of an expensive function landscape, 
which can be used to predict results without recourse to 
the original function. This is supported by the assump- 
tion that the inexpensive surrogate will still be usefully 
accurate when predicting sufficiently far from observed 
data points [67.40]. Figure 67.3 illustrates the use of 
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Fig. 67.3 Constructing an interpolation-based surrogate to 
fit a one-dimensional function 


a surrogate to fit the one-dimensional multi-modal func- 
tion, based on four sample observations. It is important 
to note, however, that the original function landscape 
could potentially represent any deterministic quantity of 
the design space. Rather than exactly emulating the re- 
sponse of a high-fidelity flow solver, the surrogate may, 
in fact, be used to bridge the gap between flow solvers 
of varying fidelity [67.40]. Alternatively, a surrogate 
may be used to interpret or filter noisy landscapes, so 
as to eliminate the adverse effects of flow solver con- 
vergence or grid discretization. Surrogates may also be 
used for data mining and design space visualization. 
Such methodologies are applied to extract useful infor- 
mation about the relationship between the design space 
and the objective space, allowing informed decisions to 
be made, which could simplify a seemingly complex 
problem. 

For the aforementioned uses of surrogate model- 
ing, the common requirement is to replicate the func- 
tion relationship between the variable inputs and the 
output quantity of interest. This is typically achieved 
by sampling the design space using the exact func- 
tion to sufficiently model the underlying relationship 
within the allowable computational budget. Whether 
the aim is to locally model the design space surround- 
ing an existing design or tune a surrogate to repli- 
cate the global design space is entirely dependent on 
the formation of the sampling plan [67.39]. The con- 
struction of a surrogate model in either case should 
ideally make use of a parallel computing structure. 
A suitable surrogate model f of the precise objec- 
tive function f should then be constructed to fit the 
dataset. 

There are a multitude of popular techniques for 
constructing surrogates in the literature. For a com- 
prehensive review of different methods, the reader is 
referred to (among others) [67.39-42]. The selection 
of the surrogate model is dependent on the information 
that the designer is attempting to extract from the design 
space. Polynomial response surfaces and radial basis 
functions are fairly popular techniques for constructing 
local surrogates, especially if some level of regression 
is desirable. Techniques such as Kriging or support 
vector machines are more ideally suited to global op- 
timization studies, since they offer greater flexibility in 
tuning model parameters and provide a confidence in- 
terval of the predicted output. Neural networks require 
extensive training and validation, yet have also been 
a popular technique for design applications, notably in 
aerodynamic modeling [67.43] and visualization tech- 
niques [67.44, 45]. 
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It was established in Sect. 67.1.1 that the shape param- 
eterization method essentially governs the dimension- 
ality of the problem and the attainable shapes, whereas 
the objective flow solver dictates the overall fidelity of 
the optimum design. In this section, we present a dis- 
cussion on these elements of the design loop to be used 
in conjunction with the developed optimizer for the sub- 
sequent design process. 


67.2.1 The PARSEC Parameterization Method 


Geometry manipulation is of particular importance in 
aerodynamic design. The selection of the shape param- 
eterization method is an important contributing factor, 
since it will effectively define the objective landscape 
and the topology of the design space [67.46]. If the 
aim of the optimization process is to improve on an 
established design, then perhaps local parameterization 
methods, which offer a greater number of geometri- 
cal degrees of freedom, are desirable. However, the 
large number of variables may cause the convergence 
rate for global design applications to deteriorate. The 
development of efficient parameterization models has, 
therefore, been given significant attention, to increase 
the flexibility of geometrical control with a minimum 
number of design variables. 

For certain applications, it is possible to make use of 
fundamental aerodynamic theory to refine the param- 
eterization method, such that the design variables re- 
late to important aerodynamic or geometric quantities. 
A common method for airfoil shape parameterization 
is the PARSEC method [67.47]. It has the advantage 
of strict control over important aerodynamic features, 
and it allows independent control over the airfoil geom- 
etry for imposing shape constraints. The methodology 


Xxup 


Fig. 67.4 Airfoil representation via the PARSEC method 


is characterized by 11 design variables (Fig. 67.4), in- 
cluding leading edge radius (rg), upper and lower 
thickness locations (xup, ZUP, XLO, ZLO) and curvatures 
(Zxxyp+ ZxxL9), trailing edge direction (arg) and wedge 
angle (re), and trailing edge coordinate (zre) and 
thickness (Azrg). The shape function is modeled via 
a sixth-order polynomial function 


6 
1 
n= 


Zk = ank’ Xp > (67.3) 


n=l 


where (x,z) are the shape coordinates and k denotes 
either the upper (suction) or lower (pressure) airfoil 
surface. The coefficients a, are determined from the ge- 
ometric parameters. A modification by Jahangirian and 
Shahrokhi [67.48] was introduced to provide additional 
control over the trailing edge curvature. For supercrit- 
ical transonic airfoils, this is beneficial to reduce the 
probability of downstream boundary layer separation, 
which results in increased drag values. A new vari- 
able Aarg was introduced, which directly influences 
the additional curvature of the trailing edge. The modi- 
fication decouples the trailing edge parameterization by 
first defining a smoother upper surface contour and then 
constraining the lower surface to intersect the trailing 
edge coordinate. Figure 67.5 illustrates the modifica- 
tion to the trailing edge curvature. The modification is 
applied to the upper and lower surfaces as follows 
L-tan AQTE 


= 


te [1 +n- (1-7)“] , 7.) 


where the constants 7, jz, and t are set to 0.8, 2, and 6, 
respectively. The modification is applied over the entire 


Fig. 67.5 Additional trailing edge curvature via the modi- 
fied PARSEC method 
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Table 67.1 PARSEC parameter ranges for transonic optimization 


Description Variable 
Leading edge radius TLE 
Trailing edge direction OTE 
Trailing edge wedge angle Bre 
Upper-crest abscissa XUP 
Upper-crest ordinate ZUP 
Upper-crest curvature ZxxUP 
Lower-crest abscissa XLO 
Lower-crest ordinate ZLO 
Lower-crest curvature Bri 
Trailing edge curvature Sate 


surface, such that Lyp = Lio = c, where c is the airfoil 
chord length. 

Table 67.1 presents the upper and lower bound- 
aries for the subsequent optimization case study. These 
boundaries have been selected based on a thorough 
screening study involving a statistical sample of a num- 
ber of benchmark airfoils. 


67.2.2 Transonic Flow Solver 


The optimization process is ultimately dependent on 
the selection of the flow solver, since it is the most 
computationally expensive component, and repeated 
evaluations of the objective and constraint functions 
are required for each candidate shape. However if the 
flow solver is not sufficiently accurate, the optimization 
process will converge to shapes that exploit the numer- 
ical errors or limitations, rather than the fundamental 
physics of the problem. For this reason, it is desirable 
to maintain the correct balance between solution accu- 
racy and computational expense, which is dictated by 
the flow regime. For certain problems where the aero- 
dynamic flow field is well behaved, it may be sufficient 
to consider more robust linear solvers. However for 
high-fidelity design it is prudent to consider non-linear 
and more computationally demanding solvers, to ensure 
that optimized shapes provide the anticipated perfor- 
mance requirements in flight. 

The general purpose finite volume code ANSYS 
Fluent is adopted in this study. A pressure-based nu- 
merical procedure is adopted with third-order spatial 
discretization to capture the occurring flow phenomena. 
The momentum equations and pressure-based continu- 
ity equation are solved concurrently, with the Courant— 
Friedrichs—Lewy number set at 200. The one-equation 


Lower bound Upper bound 
0.0063 0.0151 
0.2405(—) 0.0026(—) 
0.0655 0.2618 
0.3170 0.5250 
0.0497 0.0683 
0.5135(—) 0.2393(—) 
0.2835 0.3418 
0.0603(—) 0.0478(—) 
0.2535 0.8405 
0.0080(—) 0.3696 
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Fig. 67.6 C-type grid for transonic simulation 


Spalart—Allmaras turbulence model [67.49] is selected, 
and turbulent flow is modeled over the entire airfoil 
surface. The C-type grid (as represented in Fig. 67.6) 
stretches 25 chord lengths aft and normal of the airfoil 
section. Resolution of the C-grid is 460 x 65, providing 
an affordable mesh size of approximately 30000 ele- 
ments. The first grid point is located 2.5 x 1074 units 
normal to the airfoil surface, resulting in an average y- 
plus value of 120. In the interest of robust and efficient 
convergence rates, a full multi-grid (FMG) initializa- 
tion scheme is employed, with coarsening of the grid 
to 30 cells. In the FMG initialization process, the Euler 
equations are solved using a first-order discretization to 
obtain a flow field approximation before submission to 
the full iterative calculation. 
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The proposed algorithm was primarily motivated by the 
studies of Wickramasinghe and Li [67.38]. The prin- 
cipal argument is that for most design applications, 
to explore the entire Pareto front is often unneces- 
sary, and the computational burden can be alleviated by 
considering the immediate interests of the designer. In 
Sect. 67.1.3, a discussion on the benefits of preference- 
based optimization was provided. Drawing on these 
concepts, a preference-based algorithm is proposed, 
where a designer-driven distance metric is used to scalar 
quantify the success of a solution. The multiobjective 
search effort is coordinated via a MOPSO algorithm. 
The swarm is guided by a reference point, which is an 
intuitive means of articulating the preferences of the 
designer and can ideally be based on an existing or 
target design. This section provides a comprehensive 
discussion on the proposed algorithm, highlighting its 
viability for the intended domain of application. 


67.3.1 The Reference Point Method 


In this research, the swarm is guided by a reference 
point to confine its search focus exclusively on the 
preferred region of the Pareto front as dictated by the 
preferences of the designer. Introducing the preferred 
region provides the designer flexibility to explore other 
interesting alternatives. This hybrid methodology is ad- 
vantageous for navigating high-dimensional and multi- 
modal landscapes, which are typical of aerodynamic 
design problems. Furthermore, inherently considering 
the preferences of the designer provides a feasible 
means of quantifying the practicality of a design. 


The Reference Point Distance Metric 
The reference point method has been integrated into 
MOO algorithms, notably by Deb and Sundar [67.34] 
and Wickramasinghe and Li [67.37,38]. These stud- 
ies highlight the benefits of incorporating preference 
information via the reference point in terms of con- 
vergence. Guided by the information provided by the 
reference point, the swarm can simultaneously identify 
multiple solutions in the preferred region. This provides 
the designer flexibility to explore several preferred de- 
signs, while alleviating the computational burden of 
identifying the entire Pareto front. A reference point 
distance metric following the work of Wickramasinghe 
and Li [67.37] is proposed. This metric provides an in- 
tuitive criterion to select global leaders and assists the 
swarm to identify only solutions of interest to the de- 


signer. The distance of a particle x to the reference point 
z is defined as 


d(x) = max {(fi (x) —z)} . (67.5) 


A solution a is, therefore, preferred to solution b if 
d,(a) < d_(b). This condition is an extension of the con- 
dition f(a) < f(b), therefore, the distance metric may, in 
fact, substitute the dominance criteria entirely [67.38]. 
Using this distance metric, the swarm is guided to pre- 
ferred regions of the Pareto optimal front. Figure 67.7 
illustrates the search directions of the algorithm when 
guided by a reference point, and the corresponding 
preferred design as a direct result of minimizing the dis- 
tance metric d,. 

The distinguishing feature of the reference point 
distance metric over the mathematical Euclidean dis- 
tance is that solutions do not converge to the reference 
point, but on the preferred region of the Pareto front 
as dictated by the search direction. This is illustrated 
in Fig. 67.8. All solutions are non-dominated and lie 
on the circular arc surrounding the reference point Z, 
and thus the Euclidean distance to the reference point 
is equal. However, since solution 7 has the smallest 
maximum translational distance to the reference point 
compared to any other solution, it is considered pre- 
ferred. The definition of the reference point distance 
also suggests negative values. If the distance of the 
preferred solution d,(z’) < 0, then it can simply be con- 
sidered that the reference point is dominated or z’ < 
z. Since the designer generally has no prior knowl- 
edge of the topology and location of the Pareto front, 
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fi 

Fig. 67.7 Illustration of the search direction governed by 
the reference point 
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a reference point may be ideally placed in any fea- 
sible or infeasible region, as is shown in Fig. 67.7. 
It is, therefore, the consensus that the reference point 
draws on the experience of the designer to express 
the preferred compromise, rather than specific target 
values or goals. Similarly, the reference point dis- 
tance metric ranks or assesses the success of a particle 
as one single scalar, instead of an array of objective 
values. 


Defining the Preferred Region 
As is demonstrated in Fig. 67.7, if there is no con- 
trol over the solution spread the swarm will explore 
the preferred search direction and converge to the sin- 
gle solution z’ as dictated by the reference point Z. 
The advantage of maintaining a population of parti- 
cles provides the designer the possibility to explore 
a range of interesting alternatives within a preferred 
region of the Pareto front. The aim is, therefore, to iden- 
tify a set of solutions surrounding the intersection point 
z’. A threshold parameter 5 > 0 is defined, such that 
a solution x is within the preferred region if the follow- 
ing conditional statement is true 

d(x) < dZ) +8. (67.6) 
Figure 67.9 illustrates the preferred region for a bi- 
objective problem. The extent of the solution spread is 
proportional to 6 and evidently as ô — 0, the designer 
is interested in determining only the most preferred 
solution z’. Conversely, as 6 —> oo, the designer is in- 
terested in determining all solutions along the Pareto 
front, and thus the influence of the reference point loca- 
tion diminishes. 
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Fig. 67.8 Illustration of the reference point distance for 
solutions with equal Euclidean distance 


67.3.2 User-Preference Multiobjective PSO: 
UPMOPSO 


The proposed algorithm combines the searching pro- 
ficiency of PSO and the guidance of the reference 
point method. The swarm is guided by the user-defined 
reference point to confine its search to focus exclu- 
sively on the identified preferred region of the Pareto 
front. While the concept of the reference point is fairly 
intuitive, ensuring that the swarm is guided by this 
information to identify preferred solutions is more am- 
biguous. The algorithm function is consolidated in 
Algorithm 67.1 and further described in the subsequent 
steps. 


Algorithm 67.1 The UPMOPSO algorithm 
1: OBTAIN user-defined preferences 
2: INITIALIZE swarm 
3: EVALUATE fitness and distance metric 
4: 


ASSIGN personal best 
a) hA 
> 
fi 
b) ft 


NI 


Vd; < d;(z') + 02 
Í 


fi 


Fig. 67.9a,b Definition of the preferred region via the pa- 
rameter 5. (a) 5; = 0.01, (b) 62 = 0.001 


Preference-Based Multiobjective Particle Swarm Optimization for Airfoil Design 


67.3 Optimization Algorithm 


5: CONSTRUCT archive 

6: t=1 

7: repeat 

8: SELECT global leaders 

9: UPDATE particle velocity 
10: | UPDATE particle position 
11: ADJUST boundary violation 
12: EVALUATE fitness and distance metric 
13: UPDATE personal best 
14: UPDATE archive 
15: t=f+1 
16: until t = fmax OR fmax 


OBTAIN User-Defined Preferences 

The designer stipulates the reference point z and the 
corresponding solution spread 6 to define the loca- 
tion and extent of the preferred region. For airfoil 
design applications, designers can exploit the exist- 
ing domain knowledge to determine the most feasi- 
ble performance compromise for the desired operating 
conditions. 


INITIALIZE the Particles 

A swarm of N particles is required to navigate the de- 
sign space S bounded by Xmin and Xmax. To safeguard 
against magnitude and scaling issues, all variables are 
normalized into the unit cube, such that S = [0, 1]”. The 
i-th particle in the swarm is characterized by the n- 
dimensional vectors x; and v;, which are the particle 
position and velocity, respectively. These vectors are 
randomly initialized within the unit cube at time t = 
0. The particle personal best position is recorded as 
the particle position, such that p; = x;. The particles 
are then evaluated with the objective functions and fit- 
ness is assigned. The reference point distance metric is 
computed for each particle to measure the individual 
preference value. 


UPDATE Archive and SELECT Global Leaders 

A secondary population of non-dominated solutions 
in the form of an elitist archive is maintained at 
time, t. The non-dominated solutions identified by 
the particles are appended to the archive. A non- 
dominated sorting procedure is applied, where all 
members pertaining to local inferior fronts are omit- 
ted. The archive serves as a mutually accessible 
memory bank for the particles of the swarm. Each 
member is a potential candidate for global leader- 
ship of the particles during the subsequent velocity 
update. 


Defining the global leaders ultimately governs the 
direction of the search. The swarm should efficiently 
navigate the design space such that the search effort is 
locally focused within the preferred region and provides 
a uniform spread of solutions. Since all members of the 
archive are mutually non-dominated, a ranking proce- 
dure is necessary to distinguish the most appropriate 
candidates for leadership from the remaining members. 
At each time step ft, the most preferred solution z’(t) 
is recorded. The subset of members X,(t) selected for 
global leadership satisfy the condition of (67.6), such 
that 


Xe (Ð) E€ d(O) <d.@/()) +8. (67.7) 


Since not every member will initially satisfy this condi- 
tion, the number of candidate leaders may fluctuate over 
time. This condition provides the necessary selection 
pressure for particles to locally focus the search effort 
within the preferred region, avoiding the unnecessary 
computational effort of exploring undesired regions of 
the design space. Each swarm particle is randomly as- 
signed a leader to promote diversity in the search. In the 
case where all non-dominated solutions satisfy the con- 
dition of (67.7), additional guidance through a crowding 
distance metric (as described in [67.27]) is provided to 
promote a uniform spread. 

As the particles are guided to converge to the pre- 
ferred region, the number of identified non-dominated 
solutions will steadily increase. To avoid this number 
escalating unnecessarily and to maintain high com- 
petitiveness within the archive, there is a restriction 
(denoted by Kmax) on the number of solutions permit- 
ted for entry. If the number of members K > Kmax, 
the newest solution is permitted entry, and the exist- 
ing least preferred member is removed. If all archive 
members exist within the preferred region, the most 
crowded solutions are removed. This ensures that solu- 
tions in densely populated regions are removed in favor 
of solutions which exploit sparsely populated regions, 
to further promote a uniform spread. 


UPDATE Particle Position 
The update equations of PSO adjust the position of 
the i-th particle from time ¢ to t+ 1. In this algo- 
rithm, the constriction type 1 framework of Clerc and 
Kennedy [67.50] is adopted. In their studies, the authors 
studied particle behavior from an eigenvalue analysis of 
swarm dynamics. The velocity update of the i-th par- 
ticle is a function of acceleration components to both 
the personal best position, p;, and the global best po- 
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sition, py. The updated velocity vector is given by the 
expression, 


vi(t+ 1) = xvi (À + Ri[0, 91] 8 iD —xi() 
+ R20, p] 8 (Pe) —xi())}. (67.8) 


The velocity update of (67.8) is quite complex and is 
composed of many quantities that affect certain search 
characteristics. The previous velocity v;(t) serves as 
a memory of the previous flight direction and prevents 
the particle from drastically changing direction and is 
referred to as the inertia component. The cognitive com- 
ponent of the update equation (p;(t)— x;(t)) quantifies 
the performance of the i-th particle relative to past per- 
formances. The effect of this term is that particles are 
drawn back to their own best positions, which resem- 
bles the tendency of individuals to return to situations 
where they experienced most success. The social com- 
ponent (p,(t) —x;(t)) quantifies the performance of the 
i-th particle relative to the global (or neighborhood) best 
position. This resembles the tendency of individuals to 
emulate the success of others. 

The two functions R,[0,¢] and R2[0, p2] return 
a vector of uniform random numbers in the range [0, g1] 
and [0, g2], respectively. The constants g, and g2 are 
equal to g/2 where g = 4.1. This randomly affects the 
magnitude of both the social and cognitive components. 
The constriction factor y applies a dampening effect as 
to how far the particle explores within the search space 
and is given by 


X= 2/|2—@- V¢?—4g|. 


Once the particle velocity is calculated, the particle is 
displaced by adding the velocity vector (over the unit 
time step) to the current position, 


(67.9) 


xilt + 1) = x(t) + vit + 1) š (67.10) 
Particle flight should ideally be confined to the feasible 
design space. However, it may occur during flight that 
a particle involuntarily violates the boundaries of the 
design space. While it is suggested that particles which 
leave the confines of the design space should simply 
be ignored [67.51], the violated dimension is restricted 
such that the particle remains within the feasible design 
space without affecting the flight trajectory. 


UPDATE Personal Best 
The ambiguity in updating the personal best using the 
dominance criteria lies in the treatment of the case 


when the personal best solution p;(t) is mutually non- 
dominated with the solution x;(t + 1). The introduction 
of the reference point distance metric elegantly deals 
with this ambiguity. If the particle position x;(t + 1) is 
preferred to the existing personal best p;(t), then the 
personal best is replaced. Otherwise the personal best 
is remained unchanged. 


67.3.3 Kriging Modeling 


Airfoil design optimization problems benefit from the 
construction of inexpensive surrogate models that em- 
ulate the response of exact functions. This section 
presents a novel development in the field of preference- 
based optimization. Adaptive Kriging models are in- 
corporated within the swarm framework to efficiently 
navigate design spaces restricted by a computational 
budget. The successful integration of these design tools 
is facilitated through the reference point distance met- 
ric, which provides an intuitive criterion to update the 
Kriging models during the search. 

In most engineering problems, to construct a glob- 
ally accurate surrogate of the original objective land- 
scape is improbable due to the weakly correlated design 
space. It is more common to locally update the predic- 
tion accuracy of the surrogate as the search progresses 
towards promising areas of the design space [67.40]. 
For this purpose, the Kriging method has received 
much interest, because it inherently considers confi- 
dence intervals of the predicted outputs. For a complete 
derivation of the Kriging method, readers are encour- 
aged to follow the work of Jones [67.41] and Forrester 
et al. [67.40]. We provide a very brief introduction to 
the ordinary Kriging method, which expresses the un- 
known function y(x) as, 

yx) = B+2z(x), (67.11) 
where x = [x1,...,%,] is the data location, 6 is a con- 
stant global mean value, and z(x) represents a local 
deviation at the data location x based on a stochastic 
process with zero-mean and variance o? following the 
Gaussian distribution. The approximation ĵ(x) is ob- 
tained from 

$(x) = B+r7R7'(Y—-18), (67.12) 
where B is the approximation of 6, R is the correla- 
tion matrix, r is the correlation vector, Y is the training 
dataset of N observed samples at location X, and 1 is 
a column vector of N elements of 1. The correlation 
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matrix is a modification of the Gaussian basis function, 
n , 
R(x', x’) = exp (-Zsut-xe) ; (67.13) 
k=1 


where 6; > 0 is the k-th element of the correlation pa- 
rameter 0. Following the work of Jones [67.41], the 
correlation parameter 0 (and hence the approximations 
B and ô?) are estimated by maximizing the concen- 
trated In-likelihood of the dataset Y, which is an n- 
variable single-objective optimization problem, solved 
using a pattern search method. The accuracy of the pre- 
diction j at the unobserved location x depends on the 
correlation distance with sample points X. The closer 
the location of x to the sample points, the more confi- 
dence in the prediction )(x). The measure of uncertainty 
in the prediction is estimated as 


(1—17R™"r)? 


Pæ =a [1-rR r+ TRI 


| (67.14) 


if x C X, it is observed from (67.14) that 5(x) reduces 
to zero. 


67.3.4 Reference Point Screening Criterion 


Training a Kriging model from a training dataset is time 
consuming and is of the order O(N?). Stratified sam- 
pling using a maximin Latin hypercube (LHS [67.52] 
is used to construct a global Kriging approximation 
[X, Y]. The non-dominated subset of Y is then stored 
within the elitist archive. This ensures that candidates 
for global leadership have been precisely evaluated (or 
with negligible prediction error) and, therefore, offer no 
false guidance to other particles. Adopting the concept 
of individual-based control [67.42], Kriging predictions 
are then used to pre-screen each candidate particle after 


the population update (or after mutation) and subse- 
quently flag them for precise evaluation or rejection. 
The Kriging model estimates a lower-confidence bound 
for the objective array as 


Fix), ... fm) }m = a — @81(@)},..., Gm) 
— WS (x)}] , (67.15) 


where w = 2 provides a 97% probability for f(x) to 
be the lower bound value of Ê (x). An approximation to 
the reference point distance, d, (x), can thus be obtained 
using (67.5). This value, whilst providing a means of 
ranking each solution as a single scalar, also gives an 
estimate to the improvement that is expected from the 
solution. At time ¢, the archive member with the highest 
ranking according to (67.5) is recorded as din. The can- 
didate x may then be accepted for precise evaluation, 
and subsequent admission into the archive if d(x) < 
dmin. Particles will thus be attracted towards the areas of 
the design space that provide the greatest resemblance 
to z, and the direction of the search will remain consis- 
tent. 

As the search begins in the explorative phase and the 
prediction accuracy of the surrogate model(s) is low, de- 
pending on the deceptivity of the objective landscape(s) 
there will initially be a large percentage of the swarm 
that is flagged for precise evaluation. Subsequently, 
as the particles begin to identify the preferred region 
and the prediction accuracy of the surrogate model(s) 
gradually increases, the screening criterion becomes 
increasingly difficult to satisfy, thereby reducing the 
number of flagged particles at each time step. To re- 
strict saturation of the dataset used to train the Kriging 
models, a limit is imposed of N = 200 sample points 
where lowest ranked solutions according to (67.5) are 
removed. 


67.4 Case Study: Airfoil Shape Optimization 


The parameterization method and transonic flow solver 
described in the preceding section are now integrated 
within the Kriging-assisted UPMOPSO algorithm for 
an efficient airfoil design framework. The framework is 
applied to the re-design of the NASA-SC(2)0410 airfoil 
for robust aerodynamic performance. A three-objective 
constrained optimization problem is formulated, with 
fi = Ca and f =—C,, for M = 0.79, Cı = 0.4, and 
fs = 9Cqa/0M for the design range M = [0.79, 0.82], 
Cı = 0.4. The lift constraint is satisfied internally within 


the solver, by allowing Fluent to determine the an- 
gle of incidence required. A constraint is imposed on 
the allowable thickness, which is defined through the 
parameter ranges (see Table 67.2) as approximately 
9.75% chord. The reference point is logically selected 
as the NASA-SC(2)0410, in an attempt to improve on 
the performance characteristics of the airfoil, whilst still 
maintaining a similar level of compromise between the 
design objectives. The solution variance is controlled 
by ô = 5 x 107°. 
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The design application is segregated into three 
phases: pre-optimization and variable screening; opti- 
mization and; post-optimization and trade-off screen- 
ing. 


67.4.1 Pre-Optimization 
and Variable Screening 


Global Kriging models are constructed for the aero- 
dynamic coefficients from a stratified sample of N = 
100 design points based on a Latin hypercube de- 
sign. This sampling plan size is considered suffi- 
cient in order to obtain sufficient confidence in the 
results of the subsequent design variable screening 
analysis. Whilst a larger sampling plan is essential 
to obtain fairly accurate correlation, the interest here 
is to quantify the elementary effect of each vari- 
able to the objective landscapes. The global Kriging 
models are initially trained via cross-validation. The 
cross-validation curves for the Kriging models are 
illustrated in Fig. 67.10. The subscripts to the aero- 
dynamic coefficients refer to the respective angle of 
incidence. 

It is observed in Fig. 67.10 that the Kriging mod- 
els constructed for the aerodynamic coefficients are able 
to reproduce the training samples with sufficient confi- 
dence, recording error margin values between 2 to 4%. 
It is hence concluded that the Kriging method is very 
adept at modeling complex landscapes represented by 
a limited number of precise observations. 

To investigate the elementary effect of each de- 
sign variable on the metamodeled objective landscapes, 
we present a quantitative design space visualization 
technique. A popular method for designing prelimi- 
nary experiments for design space visualization is the 
screening method developed by Morris [67.53]. This al- 
gorithm calculates the elementary effect of a variable x; 
and establishes its correlation with the objective space f 
as: 


a) Negligible 

b) Linear and additive 

c) Nonlinear 

d) Nonlinear and/or involved in interactions with x. 


Table 67.2 NASA-SC(2)0410 airfoil results for the formu- 
lated objectives 
Airfoil Mach fi h fa 
number, 

M 


NASA-SC(2)0410 0.79 0.008708 0.1024 0.189625 


a) Drag coefficient Cay 
0.045 


0.035 


0.025 


0.015 


0.005 
0.005 


= 
0.035 0.045 
Drag coefficient Cay 


0.015 0.025 


b) Moment coefficient C,, 
0.25 


79 


0.2 


0.15 


0.1 


0.05 


> 
0.15 0.2 0.25 


Moment coefficient Cm 


0 0.05 0.1 


c) Drag coefficient Cag 
0.1 


0.08 
0.06 
0.04 * 


0.02 


> 
0.06 0.08 0.1 
Drag coefficient Cag 


0 0.02 0.04 


Fig. 67.10a—c Cross-validation curves for the constructed 
Kriging models. (a) Training sample for Cy at M = 0.79. 
(b) Training sample for C, at M = 0.79. (c) Training sam- 
ple for Cy at M = 0.82 


In plain terminology, the Morris algorithm mea- 
sures the sensitivity of the i-th variable to the objective 
landscape f. For a detailed discussion on the Morris al- 
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e » 
~ x 
gorithm the reader is referred to Forrester et al. [67.40] 
and Campolongo et al. [67.54]. Presented here are the 
results of the variable screening analysis using the Mor- 
ris algorithm for the proposed design application. 

Figure 67.11 graphically shows the results obtained 
from the design variable screening study. It is immedi- 
ately observed that the upper thickness coordinates have 
a relatively large influence on the drag coefficient for 
both design conditions. At higher Mach numbers the 
effect of the lower surface curvature z,,,, is also sig- 
nificant. It is demonstrated, however, that the variables 
Zxx,9 and apg have the largest effect on the moment 
coefficient — variables which directly influence the aft 
camber (and hence the aft camber) on the airfoil. These 
variables will no doubt shift the loading on the airfoil 
forward and aft, resulting in highly fluctuating moment 
values. 

Similar deductions can be made by examining the 
variable influence on d, shown in Fig. 67.11d. The 
variable influence on d, is case specific and entirely de- 
pendent on the reference point chosen for the proposed 
optimization study. Since the value of d, is a means 
of ranking the success of a multiobjective solution as 
one single scalar, variables may be ranked by influ- 
ence, which is otherwise not possible when considering 
a multiobjective array. Preliminary conclusions to the 
priority weighting of the objectives to the reference 
point compromise can also be made. It is observed that 


the variable influence on d; most closely resembles the 
plots of the drag coefficients C4 and Cag, suggesting 


Fig. 67.11a-d Variable influence 
on aerodynamic coefficients (sub- 
scripts refer to the operating Mach 
number). (a) Drag Ca. (b) Moment 
Cm- (c) Drag Cag. (d) d; 


that the moment coefficient is of least priority for the 
preferred compromise. It is interesting to see that the 
trailing edge modification variable Aarg is of particular 
importance for all design coefficients, which validates 
its inclusion in the subsequent optimization study. 


67.4.2 Optimization Results 


A swarm population of N; = 100 particles is flown to 
solve the optimization problem. The objective space is 
normalized for the computation of the reference point 
distance by fmax —fmin- Instead of specifying a maximum 
number of time steps, a computational budget of 250 
evaluations is imposed. A stratified sample of N = 100 
design points using an LHS methodology was used to 
construct the initial global Kriging approximations for 
each objective. A further 150 precise updates were per- 
formed over t ~ 100 time steps until the computational 
budget was breached. As is shown in Fig. 67.12a, the 
largest number of update points was recorded during 
the initial explorative phase. As the preferred region 
becomes populated and s — 0, the algorithm triggers 
exploitation, and the number of update points steadily 
reduces. 

The UPMOPSO algorithm has proven to be very 
capable for this specific problem. Figure 67.12b fea- 
tures the progress of the highest ranked solution (i. e., 
dmin) as the number of precise evaluations increase. The 
reference point criterion is shown to be proficient in fil- 
tering out poorer solutions during exploration, since it 
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Fig. 67.12a,b UPMOPSO performance for transonic air- 
foil shape optimization. (a) History of precise updates. 
(b) Progress of most preferred solution 


is only required to reach 50 update evaluations within 
15% of din, and to reach a further 50 evaluations within 
3%. Furthermore, no needless evaluations as a result of 
the lower-bound prediction are performed during the 
exploitation phase. This conclusion is further comple- 
mented by Fig. 67.13a, as a distinct attraction to the 
preferred region is clearly visible. A total of 30 non- 
dominated solutions were identified in the preferred 
region, which are shown in Fig. 67.13b. 


67.4.3 Post-Optimization 
and Trade-Off Visualization 


The reference point distance also provides a feasi- 
ble means of selecting the most appropriate solutions. 
For example, solutions may be ranked according to 
how well they represent the reference point compro- 
mise. To illustrate this concept, self-organizing maps 


Fig. 67.13a,b Precise evaluations performed and the re- 
sulting non-dominated solutions. (a) Scatter plot of all 
precise evaluations. (b) Preferred region 250 evaluations 


(SOMs) [67.44] are utilized to visualize the interac- 
tion of the objectives with the reference point com- 
promise. Clustering SOM techniques are based on 
a technique of unsupervised artificial neural networks 
that can classify, organize, and visualize large sets of 
data from a high to low-dimensional space [67.45]. 
A neuron used in this SOM analysis is associated 
with the weighted vector of m inputs. Each neuron is 
connected to its adjacent neurons by a neighborhood 
relation and forms a two-dimensional hexagonal topol- 
ogy [67.45]. The SOM learning algorithm will attempt 
to increase the correlation between neighboring neurons 
to provide a global representation of all solutions and 
their corresponding resemblance to the reference point 
compromise. 

Each input objective acts as a neuron to the SOM. 
The corresponding output measures the reference point 
distance (i.e., the resemblance to the reference point 
compromise). A two-dimensional representation of the 
data is presented in Fig. 67.14, organized by six SOM- 
ward clusters. Solutions that yield negative d, values 
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Table 67.3 Preferred airfoil objective values with measure 
of improvement 

NASA-SC(2)0410 0.008708 0.1024 0.189625 

Preferred design 0.008106 0.0933 0.168809 

% Improvement 6.9 8.8 10.9 


indicate success in the improvement over each aspi- 
ration value. Solutions with positive d, values do not 
surpass each aspiration value but provide significant 
improvement in at least one other objective. Each of 
the node values represent one possible Pareto-optimal 
solution that the designer may select. The SOM chart 
colored by d, is a measure of how far a solution deviates 
from the preferred compromise. However, the concept 
of the preferred region ensures that only solutions that 
slightly deviate from the compromise dictated by z are 
identified. Following the SOM charts, it is possible to 


a) z-coordinate 


0.08 4 
—— NASA-SC(2)0410 
0.06 —- Preferred airfoil 


0.2 0.4 0.6 0.8 1 
x-coordinate 


b) Pressure coefficient, C 
=il 


—e— NASA-SC(2)0410 
— Preferred airfoil 


ES > 


0 0.2 0.4 0.6 0.8 1 
x-coordinate 


a) b) 
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0 0.002 0.004 0.006 0.008 0.01 -0.001 0.001 0.002 0.003 0.004 


Fig. 67.14 SOM charts to visualize optimal trade-offs between the 
design objectives. (a) fi, (b) f>, (c) fs, (d) d; 


visualize the preferred compromise between the design 
objectives that is obtained. The chart of d, closely fol- 
lows the fı chart, which suggests that this objective 
has the highest priority. If the designer were slightly 
more inclined towards another specific design objec- 
tive, then solutions that perhaps place more emphasis 
on the other objectives should be considered. In this 
study, the most preferred solution is ideally selected as 
the highest ranked solution according to (67.5). 


67.4.4 Final Designs 


Table 67.3 shows the objective comparisons with the 
NASA-SC(2)0410. Of interest to note is that the most 
active objective is fı, since the solution which provides 
the minimum d, values also provides the minimum fı 
value. This implies that the reference point was sit- 


Fig. 67.15a,b The most preferred solution observed by 
the UPMOPSO algorithm. (a) Preferred airfoil geometry. 
(b) C, distributions for M = 0.79, Cı = 0.4 < 
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Fig. 67.16a,b Pressure contours for design condition of M = 
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Fig. 67.17 Drag rise curves for C; = 0.4 


uated near the fı Pareto boundary. Of the identified 
set of Pareto-optimal solutions, the largest improve- 
ments obtained in objectives fọ and f3 are 36.4 and 
91.6%, respectively, over the reference point. The pre- 
ferred airfoil geometry is shown in Fig. 67.15a in 
comparison with the NASA-SC(2)0410. The preferred 
airfoil has a thickness of 9.76% chord and main- 
tains a moderate curvature over the upper surface. 
A relatively small aft curvature is used to gener- 


b) 


D 


0.79, C; = 0.4. (a) NASA-SC(2)0410, (b) Preferred airfoil 


ate the required lift, whilst reducing the pitching 
moment. 

Performance comparisons between the NASA- 
SC(2)0410 and the preferred airfoil at the design condi- 
tion of M = 0.79 can be made from the static pressure 
contour output in Fig. 67.16, and the surface pressure 
distribution of Fig. 67.15b. The reduction in Cg is at- 
tributed to the significantly weaker shock that appears 
slightly upstream of the supercritical shock position. 
The reduction in the pitching moment is clearly visible 
from the reduced aft loading. Along with the improve- 
ment at the required design condition, the preferred 
airfoil exhibits a lower drag rise by comparison, as is 
shown in Fig. 67.17. There is a notable decline in the 
drag rise at the design condition of M = 0.79, and the 
drag is recorded as lower than the NASA-SC(2)0410, 
even beyond the design range. Also visible is the solu- 
tion that provides the most robust design (i. e., min f3). 
The most robust design is clearly not obtained at the 
expense of poor performance at the design condition, 
due to the compromising influence of z. If the designer 
were interested in obtaining further alternative solutions 
which provide greater improvement in either objective, 
it would be sufficient to re-commence (at the current 
time step) the optimization process with a larger value 
of ô, or by relaxing one or more of the aspiration val- 
ues, Z;. 
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67.5 Conclusion 


In this chapter, an optimization framework has been 
introduced and applied to the aerodynamic design of 
transonic airfoils. A  surrogate-driven multiobjective 
particle swarm optimization algorithm is applied to nav- 
igate the design space to identify and exploit preferred 
regions of the Pareto frontier. The integration of all 
components of the optimization framework is entirely 
achieved through the use of a reference point distance 
metric which provides a scalar measure of the preferred 
interests of the designer. This effectively allows for the 
scale of the design space to be reduced, confining it to 
the interests reflected by the designer. 

The developmental effort that is reported on here 
is to reduce the often prohibitive computational cost 
of multiobjective optimization to the level of prac- 
tical affordability in computational aerodynamic de- 
sign. A concise parameterization model was consid- 
ered to perform the necessary shape modifications in 
conjunction with a Reynolds-averaged Navier-Stokes 
flow solver. Kriging models were constructed based 
on a stratified sample of the design space. A pre- 
optimization visualization tool was then applied to 
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68. Ant Colony Optimization for the Minimum-Weight 
Rooted Arborescence Problem 


Christian Blum, Sergi Mateo Bellido 
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68.1 Introductiory Remarks 


Solving combinatorial optimization problems with ap- 
proaches from the swarm intelligence field has al- 
ready a considerably long tradition. Examples of 
such approaches include particle swarm optimization 
(PSO) [68.1] and artificial bee colony (ABC) optimiza- 
tion [68.2]. The oldest — and most widely used — algo- 
rithm from this field, however, is ant colony optimiza- 
tion (ACO) [68.3]. In general, the ACO metaheuristic 
attempts to solve a combinatorial optimization prob- 
lem by iterating the following steps: (1) Solutions to 
the problem at hand are constructed using a pheromone 
model, that is, a parameterized probability distribution 
over the space of all valid solutions, and (2) (some 
of) these solutions are used to change the pheromone 
values in a way being aimed at biasing subsequent sam- 
pling toward areas of the search space containing high 
quality solutions. In particular, the reinforcement of 
solution components depending on the quality of the 
solutions in which they appear is an important aspect 
of ACO algorithms. It is implicitly assumed that good 
solutions consist of good solution components. To learn 


which components contribute to good solutions most 
often helps assembling them into better solutions. 

In this chapter, ACO is applied to solve the 
minimum-weight rooted arborescence (MWRA) prob- 
lem, which has applications in computer vision such as, 
for example, the automated reconstruction of consistent 
tree structures from noisy images [68.4]. The structure 
of this chapter is as follows. Section 68.2 provides a de- 
tailed description of the problem to be tackled. Then, 
in Sect. 68.3 a new heuristic for the MWRA problem 
is presented which is based on the deterministic con- 
struction of an arborescence of maximal size, and the 
subsequent application of dynamic programming (DP) 
for finding the best solution within this constructed ar- 
borescence. The second contribution is to be found in 
the application of ACO [68.3] to the MWRA prob- 
lem. This algorithm is described in Sect. 68.4. Finally, 
in Sect. 68.5 an exhaustive experimental evaluation of 
both algorithms in comparison with an existing heuris- 
tic from the literature [68.5] is presented. The chapter is 
concluded in Sect. 68.6. 
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68.2 The Minimum-Weight Rooted Arborescence Problem 


As mentioned before, in this work we consider the 
MWRA problem, which is a generalization of the prob- 
lem proposed by Venkata Rao and Sridharan in [68.5, 
6]. The MWRA problem can technically be described 
as follows. Given is a directed acyclic graph G = (V, A) 
with integer weights on the arcs, that is, for each a € 
A exists a corresponding weight w(a) € Z. Moreover, 
a vertex v, € V is designated as the root vertex. Let A 
be the set of all arborescences in G that are rooted 
in v,. In this context, note that an arborescence is a di- 
rected, rooted tree in which all arcs point away from 
the root vertex (see also [68.7]). Moreover, note that A 
contains all arborescences, not only those with max- 
imal size. The objective function value (that is, the 


a) Example input graph b) Optimal solution, value: —19 
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Fig. 68.1a,b (a) An input DAG with eight vertices and 14 
arcs. The uppermost vertex is the root vertex v,. (b) The op- 
timal solution, that is, the arborescence rooted in v, which 


has the minimum weight among all arborescence rooted 
in v, that can be found in the input graph 


Fig. 68.2a,b (a) A 2D image of the retina of a human eye. 
The problem consists in the automatic reconstruction (or 
delineation) of the vascular structure. (b) The reconstruc- 
tion of the vascular structure as produced by the algorithm 
proposed in [68.4] 


weight) f(T) of an arboresence T € A is defined as 
follows: 


f(T):=) ow). (68.1) 


aET 


The goal of the MWRA problem is to find an ar- 
boresence T* € A such that the weight of T* is 
smaller or equal to all other arborescences in A. In 
other words, the goal is to minimize objective func- 
tion f(-). An example of the MWRA problem is shown 
in Fig. 68.1. 

The differences to the problem proposed in [68.5] 
are as follows. The authors of [68.5] require the root 
vertex v, to have only one single outgoing arc. More- 
over, numbering the vertices from 1 to |V|, the given 
acyclic graph G is restricted to contain only arcs qj, ; 
such that i<j. These restrictions do not apply to the 
MWRA problem. Nevertheless, as a generalization of 
the problem proposed in [68.5], the MWRA prob- 
lem is NP-hard. Concerning the existing work, the 
literature only offers the heuristic proposed in [68.5], 
which can also be applied to the more general MWRA 
problem. 

The definition of the MWRA problem as previ- 
ously outlined is inspired by a novel method which 
was recently proposed in [68.4] for the automated re- 
construction of consistent tree structures from noisy 
images, which is an important problem, for example, 
in Neuroscience. Tree-like structures, such as den- 
dritic, vascular, or bronchial networks, are pervasive 
in biological systems. Examples are 2D retinal fun- 
dus images and 3D optical micrographs of neurons. 
The approach proposed in [68.4] builds a set of can- 
didate arborescences over many different subsets of 
points likely to belong to the optimal delineation and 
then chooses the best one according to a global ob- 
jective function that combines image evidence with 
geometric priors (Fig. 68.2, for example). The so- 
lution of the MWRA problem (with additional hard 
and soft constraints) plays an important role in this 
process. Therefore, developing better algorithms for 
the MWRA problem may help in composing bet- 
ter techniques for the problem of the automated re- 
construction of consistent tree structures from noisy 
images. 
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68.3 DP-HEUR: A Heuristic Approach to the MWRA Problem 


In this section, we propose a new heuristic approach 
for solving the MWRA problem. First, starting from 
the root vertex v,, a spanning arborescence T” in G is 
constructed as outlined in lines 2—9 of Algorithm 68.1. 
Second, a DP algorithm is applied to 7’ in order 
to obtain the minimum-weight arborescence T that 
is contained in T’ and rooted in v,. The DP algo- 
rithm from [68.8] is used for this purpose. Given an 
undirected tree T = (Vr, Er) with vertex and/or edge 
weights, and any integer number k € [0,|V7|— 1], this 
DP algorithm provides — among all trees with exactly k 
edges in T — the minimum-weight tree T*. The first step 
of the DP algorithm consists in artificially converting 
the input tree T into a rooted arborescence. Therefore, 
the DP algorithm can directly be applied to arbores- 
cences. Morever, as a side product, the DP algorithm 
also provides the minimum-weight arborescences for 
all 7 with O</<k, as well as the minimum-weight 
arborescences rooted in v, for all / with O</<k. 
Therefore, given an arborescence of maximal size T’, 
which has |V|—1 arcs (where V is the vertex set 
of the input graph G), the DP algorithm is applied 
with |V| — 1. Then, among all the minimum-weight ar- 
borescences rooted in v, for /<|V|—1, the one with 
minimum weight is chosen as the output of the DP 


algorithm. In this way, the DP algorithm is able to gen- 
erate the minimum-weight arborescence T (rooted in v,) 
which can be found in arborescence T’. The heuristic 
described above is henceforth labeled DP-HEUR. As 
a final remark, let us mention that for the description 
of this heuristic, it was assumed that the input graph is 
connected. Appropriate changes have to be applied to 
the description of the heuristic if this is not the case. 


Algorithm 68.1 Heuristic DP-HEuR for the MWRA 


problem 
1: input: a DAG G = (V,A), and a root node v, 
2: Ti := (VG = {v,}, AG = Ø) 
3: Apos = {a = (vq, vi) EA | va € VG, vi É VG} 
4: fori=1,...,|V|—1do 
5: a* = (vq, Y1) := argmin{w(a) | a € Apos} 
6: A; :=Ai_ U {a*} 
7. Vi:= VL, Uf} 
8: T= (V/A) 
9: Apos = {a = (vq, vi) EA | va € Vi £ VI} 
10: end for 
11: T:= Dynamic_Programming(T/y—; k = 
IVI- 1) 
12: output: arborescence T 


68.4 Ant Colony Optimization for the MWRA Problem 


The ACO approach for the MWRA problem which is 
described in the following is a MAX-MIN Ant Sys- 
tem (MMAS) [68.9] implemented in the hyper-cube 
framework (HCF) [68.10]. The algorithm, whose pseu- 
docode can be found in Algorithm 68.2, works roughly 
as follows. At each iteration, a number of na solutions 
to the problem is probabilistically constructed based 
on both pheromone and heuristic information. The sec- 
ond algorithmic component which is executed at each 
iteration is the pheromone update. Hereby, some of 
the constructed solutions — that is, the iteration-best 
solution T®, the restart-best solution T™, and the best- 
so-far solution T°’ — are used for a modification of 
the pheromone values. This is done with the goal of 
focusing the search over time on high-quality areas 
of the search space. Just like any other MMAS al- 
gorithm, our approach employs restarts consisting of 
a re-initialization of the pheromone values. Restarts are 
controlled by the so-called convergence factor (cf) and 


a Boolean control variable called bs_update. The main 
functions of our approach are outlined in detail in the 
following. 


Algorithm 68.2 Ant Colony Optimization for the 
MWRA Problem 


l; 


ee 


input: a DAG G = (V, A), and a root node v, 
T™ := ({v,}, Ø), 
bs_update := false 

Ta := 0.5 for all a € A 
while termination conditions not met do 


T® := ({v,}, Ø), cf:= 0, 


S:=@ 

fori=1,...,n, do 
T := Construct_Solution(G, v,) 
S:= SU {T;} 

end for 

T? := argmin{f(T) | T € S} 

if T? < T® then T® := T? 

if T < T™ then T°’ := T? 
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13: ApplyPheromoneUpdate 
(cf,bs_update,T ,T® ,T™®,T®) 

14: cf:= ComputeConvergenceFactor(T ) 

15: if cf> 0.99 then 


16: if bs_update = true then 
17: Ta := 0.5 for all a € A 
18: T® := ({v,}, Ø) 

19: bs_update := false 

20: else 

21: bs_update := true 

22: end if 

23: endif 


24: end while 
25: output: T®™, the best solution found by the algo- 
rithm 


Construct_Solution(G, v,): This function, first, 
constructs a spanning arborescence T’ in the way which 
is shown in lines 2—9 of Algorithm 68.1. However, the 
choice of the next arc to be added to the current ar- 
borescence at each step (see line 5 of Algorithm 68.1) 
is done in a different way. Instead of deterministically 
choosing from Apos, the arc which has the small- 
est weight value, the choice is done probabilistically, 
based on pheromone and heuristic information. The 
pheromone model T that is used for this purpose con- 
tains a pheromone value Tą for each arc a € A. The 
heuristic information n(a) of an arc a is computed as 
follows. First, let 


Wmax := max{w(a) |a € A}. (68.2) 


Based on this maximal weight of all arcs in G, the 
heuristic information is defined as follows: 


n(a) := Wmax + 1 — w(a) . (68.3) 


In this way, the heuristic information of all arcs is a pos- 
itive number. Moreover, the arc with minimal weight 
will have the highest value concerning the heuristic 
information. Given an arborescence T’ (obtained af- 
ter the ith construction step), and the nonempty set of 
arcs Apos that may be used for extending T/, the prob- 
ability for choosing arc a € Apos is defined as follows 


Ta’ n(a) 
aes ta nlà) ` 


However, instead of choosing an arc from Apos always 
in a probabilistic way, the following scheme is applied 
at each construction step. First, a value r € [0, 1] is cho- 
sen uniformly at random. Second, r is compared to 


p(a| Tj) := (68.4) 


a so-called determinism rate ô € [0, 1], which is a fixed 
parameter of the algorithm. If r < ô, arc a* € Apos is 
chosen to be the one with the maximum probability, 
that is 


a* := argmax{p(a | Tj) | a € Apos} - (68.5) 


Otherwise, that is, when r > 5, arc a* € Apos is chosen 
probabilistically according to the probability values. 

The output T of the function Construct_Solu- 
tion(G, v,) is chosen to be the minimum-weight ar- 
borescence which is encountered during the process of 
constructing T’, that is, 


T := argmin{f(T;) |i=0,...,|VJ—1}. 


ApplyPheromoneUpdate(cf, bs_update, T, T®, 
T™®, T°’): The pheromone update is performed in the 
same way as in all MMAS algorithms implemented in 
the HCF. The three solutions T, T, and TS (as de- 
scribed at the beginning of this section) are used for 
the pheromone update. The influence of these three so- 
lutions on the pheromone update is determined by the 
current value of the convergence factor cf, which is de- 
fined later. Each pheromone value t, € T is updated as 
follows: 


Ta = Ta + P+ (Ea — Ta) , (68.6) 


where 


Ea = Kin: A(T”, a) +k A(T, a) +kps° A(T”, a) , 
(68.7) 


where Kip is the weight of solution T, kẹ the one of 
solution T", and kp the one of solution T°’. Moreover, 
A(T, a) evaluates to 1 if and only if arc a is a component 
of arborescence 7. Otherwise, the function evaluates 
to 0. Note also that the three weights must be cho- 
sen such that Kib + Kj + Kos = 1. After the application 


Table 68.1 Setting of Kib, Kb, and Kps depending on the 
convergence factor cf and the Boolean control variable 
bs_update 


bs_update = FALSE bs_update 


cef<0.7 cfe cf> 0.9 TRUE 
[0.7, 0.9) 
Kib 2/3 1/3 0 0 
Krb 1/3 2/3 1 0 
Kbs 0 0 0 1 
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of (68.6), pheromone values that exceed Tmax = 0.99 
are set back to Tmax, and pheromone values that have 
fallen below Tmin = 0.01 are set back to Tmin. This pre- 
vents the algorithm from reaching a state of complete 
convergence. Finally, note that the exact values of the 
weights depend on the convergence factor cf and on 
the value of the Boolean control variable bs_update. 
The standard schedule as shown in Table 68.1 has been 
adopted for our algorithm. 
ComputeConvergenceFactor(7 ): The conver- 
gence factor (cf) is computed on the basis of the 


68.5 Experimental Evaluation 


The algorithms proposed in this chapter — that is, DP- 
HEUR and ACO — were implemented in ANSI C++ 
using GCC 4.4 for compiling the software. Moreover, 
the heuristic proposed in [68.5] was reimplemented. As 
mentioned before, this heuristic — henceforth labeled 
VENSRI — is the only existing algorithm which can 
directly be applied to the MWRA problem. All three al- 
gorithms were experimentally evaluated on a cluster of 
PCs equipped with Intel Xeon X3350 processors with 
2667 MHz and 8 Gb of memory. In the following, we 
first describe the set of benchmark instances that have 
been used to test the three algorithms. Afterward, the 
algorithm tuning and the experimental results are de- 
scribed in detail. 


68.5.1 Benchmark Instances 


A diverse set of benchmark instances was generated 
in the following way. Three parameters are necessary 
for the generation of a benchmark instance G = (V,A). 
Hereby, n and m indicate, respectively, the number of 
vertices and the number of arcs of G, while q € [0, 1] 
indicates the probability for the weight of any arc to be 
positive (rather than negative). The process of the gen- 
eration of an instance starts by constructing a random 
arborescence T with n vertices. The root vertex of T is 
called v,. Each of the remaining m—n-+ 1 arcs was gen- 
erated by randomly choosing two vertices v; and vj, and 
adding the corresponding arc a = (vj, vj) to T. In this 
context, a = (v;, vj) may be added to T, if and only if by 
its addition no directed cycle is produced, and neither 
(vi, yj) nor (vj, vi) form already part of the graph. The 
weight of each arc was chosen by, first, deciding with 
probability q if the weight is to be positive (or nonpos- 
itive). In the case of a positive weight, the weight value 


pheromone values 


Ie er wee) ) 
cf:=2 a 0.5]. 
f (( IT| á (Tmax = Tmin) 


This results in cf = 0 when all pheromone values are set 
to 0.5. On the other side, when all pheromone values 
have either value Tmin OF Tmax, then cf = 1. In all other 
cases, cf has a value in (0, 1). This completes the de- 
scription of all components of the proposed algorithm, 
which is henceforth labeled ACO. 


was chosen uniformly at random from [1, 100], while in 
the case of a nonpositive weight, the weight value was 
chosen uniformly at random from [—100, 0]. 

In order to generate a diverse set of benchmark 
instances, the following values for n, m, and q were con- 
sidered: 


@ ne {20,50, 100, 500, 1000, 5000}; 
@ me {2n,4n, 6n}; 
@ gé€{0.25,0.5,0.75}. 


For each combination of n, m, and q, a total of 10 
problem instances were generated. This resulted in a to- 
tal of 540 problem instances, that is, 180 instances for 
each value of q. 


68.5.2 Algorithm Tuning 


The proposed ACO algorithm has several parameters 
that require appropriate values. The following parame- 
ters, which are crucial for the working of the algorithm, 
were chosen for tuning: 


@ n,€ {3,5,10,20}: the number of ants (solution 
constructions) per iteration; 

@ p€{0.05,0.1, 0.2}: the learning rate; 

@ ô € {0.0, 0.4, 0.7, 0.9}: the determinism rate. 


We chose the first problem instance (out of 10 prob- 
lem instances) for each combination of n, m, and g for 
tuning. A full factorial design was utilized. This means 
that ACO was applied (exactly once) to each of the 
problem instances chosen for tuning. The stopping cri- 
terion was fixed to 20 000 solution evaluations for each 
application of ACO. For analyzing the results, we used 
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a rank-based analysis. However, as the set of problem 
instances is quite diverse, this rank-based analysis was 
performed separately for six subsets of instances. For 
defining these subsets, we refer to the instances with 
n € {20, 50, 100} as small instances, and the remaining 
ones as large instances. With this definition, each of the 
three subsets of instances concerning the three differ- 
ent values for g, was further separated into two subsets 
concerning the instance size. For each of these six sub- 
sets, we used the parameter setting with which ACO 
achieved the best average rank for the corresponding 
tuning instances. These parameter settings are given in 
Table 68.2. 


68.5.3 Results 


The three algorithms considered for the comparison 
were applied exactly once to each of the 540 prob- 
lem instances of the benchmark set. Although ACO is 
a stochastic search algorithm, this is a valid choice, be- 
cause results are averaged over groups of instances that 
were generated with the same parameters. As in the 
case of the tuning experiments, the stopping criterion 
for ACO was fixed to 20000 solution evaluations. Ta- 
bles 68.3-68.5 present the results averaged — for each 
algorithm — over the 10 instances for each combination 
of n and m (as indicated in the first two table columns). 
Four table columns are used for presenting the results 
of each algorithm. The column with heading value pro- 
vides the average of the objective function values of 
the best solutions found by the respective algorithm for 
the 10 instances of each combination of n and m. The 
second column (with heading std) contains the corre- 
sponding standard deviation. The third column (with 
heading size) indicates the average size (in terms of the 
number or arcs) of the best solutions found by the re- 
spective algorithm (remember that solutions — that is, 
arborescences — may have any number of arcs between 
0 and |V|—1, where |V] is the number of the input DAG 
G=(V,A)). Finally, the fourth column (with heading 
time (s)) contains the average computation time (in sec- 


Table 68.2 Parameter setting (concerning ACO) used for 
the final experiments 


 q=0.25 g=05 g=0.75 
Small instances Py, = 20 in, = 20 t=) 
p=0.2 p=0.2 p=0.05 
6=0.7 6=0.7 6=0.4 
Large instances Ng = 20 na =20 ng=20 
p=0.2 p=0.2 p=0.2 
6=09 6=0.9 6=0.9 


onds). For all three algorithms, the computation time 
indicates the time of the algorithm termination. In the 
case of ACO, an additional table column (with heading 
evals) indicates at which solution evaluation, on aver- 
age, the best solution of a run was found. Finally, for 
each combination of n and m, the result of the best- 
performing algorithm is indicated in bold font. 
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Fig. 68.3a-c Average improvement (in %) of ACO and 
DP-HEUR over VENSRI. Positive values correspond to 
an improvement, while negative values indicate that the 
respective algorithm is inferior to VENSRI. The improve- 
ment is shown for the three different arc-densities that are 
considered in the benchmark set, that is, m = 2n, m = 4n, 
and m = 6n 
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Concerning the 180 instances with q = 0.25, the 
results allow us to make the following observations. 
First, ACO is for all combinations of n and m the 
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Fig. 68. 


n=20 | n=50 | n=100 | n= 500 |n = 1000/7 = 5000 


4 These graphics show, for each combination of n and m, 


information about the average size — in terms of the number of 
arcs — of the solutions produced by DP-HEUR, ACO, and VENSRI 


best-performing algorithm. Averaged over all prob- 
lem instances ACO obtains an improvement of 29.8% 
over VENSRI. Figure 68.3a shows the average im- 
provement of ACO over VENSRI for three groups 
of input instances concerning the different arc den- 
sities. It is interesting to observe that the advantage 
of ACO over VENSRI seems to grow when the arc 
density increases. On the downside, these improve- 
ments are obtained at the cost of a significantly in- 
creased computation time. Concerning heuristic DP- 
HEUR, we can observe that it improves over VEN- 
SRI for all combinations of n and m, apart from 
(n = 100, m = 2n) and (n = 500, m = 2n). This seems 
to indicate that, also for DP-HEUR, the sparse in- 
stances pose more of a challenge than the dense 
instances. Averaged over all problem instances, DP- 
HEUR obtains an improvement of 18.6% over VENSRI. 
The average improvement of DP-HEUR over VEN- 
SRI is shown for the three groups of input instances 
concerning the different arc-densities in Fig. 68.3a. 
Concerning a comparison of the computation times, 
we can state that DP-HEUR has a clear advan- 
tage over VENSRI especially for large-size problem 
instances. 

Concerning the remaining 360 instances (q = 0.5 
and q = 0.75), we can make the following additional 
observations. First, both ACO and DP-HEUR seem to 
experience a downgrade in performance (in compari- 
son to the performance of VENSRI) when q increases. 
This holds especially for rather large and rather sparse 
graphs. While both algorithms still obtain an aver- 
age improvement over VENSRI in the case of q = 
0.5 — that is, 19.9% improvement in the case of ACO 
and 7.3% in the case of DP-HEUR — both algorithms 
are on average inferior to VENSRI in the case of 
q = 0.75. 

Finally, Fig. 68.4 presents the information which 
is contained in column size of Tables 68.3-68.5 in 
graphical form. It is interesting to observe that the 
solutions produced by DP-HEUR consistently seem 
to be the smallest ones, while the solutions pro- 
duced by VENSRI seem generally to be the largest 
ones. The size of the solutions produced by ACO 
is generally in between these two extremes. More- 
over, with growing q the difference in solution size 
as produced by the three algorithms seems to be 
more pronounced. We currently have no explana- 
tion for this aspect, which certainly deserves further 
examination. 
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68.6 Conclusions and Future Work 


In this work, we have proposed a heuristic and an ACO 
approach for the minimum-weight rooted arboresence 
problem. The heuristic makes use of dynamic program- 
ming as a subordinate procedure. Therefore, it may be 
regarded as a hybrid algorithm. In contrast, the pro- 
posed ACO algorithm is a pure metaheuristic approach. 
The experimental results show that both approaches are 
superior to an existing heuristic from the literature in 
those cases in which the number of arcs with positive 
weights is not too high and in the case of rather dense 
graphs. However, as far as sparse graphs with a rather 
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69. An Intelligent Swarm of Markovian Agents 


Dario Bruneo, Marco Scarpa, Andrea Bobbio, Davide Cerotti, Marco Gribaudo 


We define a Markovian agent model (MAM) as an 
analytical model formed by a spatial collection of 
interacting Markovian agents (MAs), whose prop- 
erties and behavior can be evaluated by numerical 
techniques. MAMs have been introduced with the 
aim of providing a flexible and scalable frame- 
work for distributed systems of interacting objects, 
where both the local properties and the interac- 
tions may depend on the geographical position. 
MAMs can be proposed to model biologically in- 
spired systems since they are suited to cope with 
the four common principles that govern swarm 
intelligence: positive feedback, negative feedback, 
randomness, and multiple interactions. In the 
present work, we report some results of a MAM for 
a wireless sensor network (WSN) routing protocol 
based on swarm intelligence, and some prelim- 
inary results in utilizing MAs for very basic ant 
colony optimization (ACO) benchmarks. 


69.1 Swarm Intelligence: 
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69.1 Swarm Intelligence: A Modeling Perspective 


Swarm intelligent (SI) algorithms are variously inspired 
from the way in which colonies of biological organ- 
isms self-organize to produce a wide diversity of func- 
tions [69.1, 2]. Individuals of the colony have a limited 
knowledge of the overall behavior of the system and 
follow a small set of rules that may be influenced by the 
interaction with other individuals or by modifications 
produced in the environment. The collective behavior of 
large groups of relatively simple individuals, interacting 
only locally with few neighboring elements, produces 
global patterns. Even if many approaches have been 
proposed that differentiate in many respects, four ba- 
sic common principles have been isolated that govern 
SI: 


© Positive feedback 
© Negative feedback 


@ Randomness 
© Multiple interactions. 


The same four principles also govern a class of al- 
gorithms inspired by the expansion dynamics of slime 
molds in the search for food [69.3, 4], that have been 
utilized as the base for the generation of routing proto- 
cols in wireless sensor networks (WSNs). 

Through the adoption of the above four principles, 
it is possible to design distributed, self-organizing, and 
fault tolerant algorithms able to self-adapt to the en- 
vironmental changes, that present the following main 
properties [69.1]: 


i) Single individuals are assumed to be simple with 
low computational intelligence and communication 
capabilities. 
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ii) Individuals communicate indirectly, through modi- 
fication of the environment (this property is known 
as stigmergy [69.2]). 

iii) The range of the interaction may be very short; nev- 
ertheless, a robust global behavior emerges from the 
interaction of the nodes. 

iv) The global behavior adapts to topological and envi- 
ronmental changes. 


The usual way to study such systems is through 
simulation, due to the large number of involved in- 
dividuals that lead to the well-known state explosion 
problem. Analytical techniques are preferable if, start- 
ing from the peculiarities of SI systems, they allow to 
realize effective and scalable models. Along this line, 
new stochastic entities, called Markovian agents (MAs) 
[69.5,6] have been introduced with the aim of pro- 
viding a flexible, powerful, and scalable technique for 
modeling complex systems of distributed interacting 
objects, for which feasible analytical and numerical so- 
lution algorithms can be implemented. Each object has 
its own local behavior that can be modified by the mu- 
tual interdependences with the other objects. MAs are 
scattered over a geographical area and retain their spa- 
tial position so that the local behavior and the mutual 
interdependencies may be related to their geographical 
positions and other features like the transmittance char- 
acteristics of the interposed medium. MAs are modeled 
by a discrete-state continuous-time finite Markov chain 
(CTMC) whose infinitesimal generator is influenced by 
the interaction with other MAs. The interaction among 
agents is represented by a message passing model com- 
bined with a perception function. When residing in 
a state or during a transition, an MA is allowed to 
send messages that are perceived by the other MAs, 
according to a spatial-dependent perception function, 
modifying their behavior. Messages may model real 
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The structure of a single MA is represented in Fig. 69.1. 
States i, j, . . . , k are the states of the CTMC representing 
the MA. The transitions among the states are of two 
possible types and are drawn in a different way: 


© Solid lines (like the transition from i to j or the 
self-loops in 7 or in j) indicate the fixed compo- 
nent of the infinitesimal generator and represent the 
local or autonomous behavior of the object that is 


physical messages (as in WSNs) or simply the mutual 
influences of an MA over the other ones. 

The flexibility of the MA representation, the spatial 
dependency, and the mutual interaction through mes- 
sage passing and perception function, make MA models 
suited to cope with various biologically inspired mech- 
anisms governed by the four aforementioned principles. 
In fact, the MAM, whose constituent elements are the 
MAs, was specifically studied to cope with the follow- 
ing needs [69.6]: 


i) Provide analytical models that can be solved by nu- 
merical techniques, thus avoiding the need of long 
and expensive simulation runs. 

ii) Provide a flexible and scalable modeling framework 
for distributed systems of interacting objects. 

iii) Provide a framework in which local properties can 
be coupled with global properties. 

iv) Local and global properties and interactions may 
depend on the position of the objects in the space 
(space-sensitive models). 

v) The solution algorithm self-adapts to variations in 
the system topology and in the interaction mecha- 
nisms. 


Interactive Markovian agents have been first in- 
troduced in [69.5,7] for single class MAs and then 
extended to multiclass multimessage Markovian agent 
model in successive works [69.8—10]. In [69.9, 11, 12], 
MAs have been applied to routing algorithms in WSNs, 
adopting SI principles [69.13]. 

This work describes the structure of MAMs and 
the numerical solution algorithms in Sect. 69.2. Then, 
applications derived from biological models are pre- 
sented: a swarm intelligent algorithm for routing pro- 
tocols in WSNs (Sect. 69.3) and a simple ant colony 
optimization (ACO) example (Sect. 69.4). 


independent of the interaction with the other MAs 
(like, for instance, the time-to-failure distribution, 
or the reaction to an external stimulus). Note that 
we include in the representation also self-loop tran- 
sitions that require a particular notation since they 
are not visible in the infinitesimal generator of the 
CTMC [69.14]. 

@ Dashed lines (like the transition from i to k or the 
transitions entering into 7 or j from other states 
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not shown in the figure) represent the transitions 
induced by the interaction with the other MAs. 
The way in which the rates of the induced tran- 
sitions are computed is explained in the following 
section. 


During a local transition (or a self-loop) an MA can 
emit a message of any type with an assigned probabil- 
ity, as represented by the dotted arrows in Fig. 69.1 
emerging from the solid transitions. The pair (gj, m) 
denotes both the message generation probability and 
the message type. Messages generated by an MA may 
be perceived by other MAs with a given probability, 
according to a suitable perception function, and the 
interaction mechanism between emitted messages and 
perceived messages generates the induced transitions 
(dashed lines). The pair (m, aj) denotes both the type 
of the perceived message and the corresponding accep- 
tance probability. 

An MAM is a collection of interacting MAs defined 
over a geographical space V. Given a position v inside 
V, p(v) denotes the density of MAs in v. According 
to the definition of the density p(v), we can classify 
a MAM with the following taxonomy: 


@ An MAM is static if p(v) does not depend on time, 
and dynamic if it does depend on time. 

@ An MAM is discrete if the geographical area on 
which the MAs are deployed is discretized and p(v) 
is a discrete function of the space or it is continuous 
if e(v) is a continuous function of the space. 


Further, MAs may belong to a single class or to 
different classes with different local behaviors and in- 
teraction capabilities, and messages may belong to dif- 
ferent types where each type induces a different effect 
on the interaction mechanism. The perception function 
describes how a message of a given type emitted by an 
MA of a given class in a given position in the space 


Fig. 69.1 Schematic structure of a Markovian agent 


is perceived by an MA of a given class in a different 
position. 


69.2.1 Mathematical Formulation 


A multiple agent class, multiple message type MAM is 
defined by the tuple [69.12] 


MAM = {C, M, V,U, R}, (69.1) 


where C = {1,..., C} is the set of agent classes. We de- 
note with MA“ an agent of class c € C. M = {1,..., M} 
is the set of message types. Each agent (independently 
of its class) can send or receive messages of type m € 
M. V is the finite space over which Markovian agents 
are spread. U = {u!(-),...,u“(-)} is a set of M per- 
ception functions (one for each message type). R = 
fol(-),..., p°(-)} is a set of C agent density functions 
(one for each agent class). Each agent MA“ of class c 
is characterized by a state space with n, states, and it is 
defined by the tuple 


MAS = {Q*(v), A°(v), G° (v, m), A° (v, m), zo (V)} , 
(69.2) 


where Q°(v) is the local component of the infinitesimal 
generator; A‘(v) is the vector of the self-jump transition 
rates; G° (v, m) is the matrix containing the probabilities 
of generating a message of type m; A‘(v, m) is the ma- 
trix containing the probabilities of accepting a message 
of type m; x§(v) is the initial probability vector. 

Note that even if the structure of the CTMC associ- 
ated to each MA of a given class is the same for all the 
objects, the values of the parameters may depend on po- 
sition v and, therefore, may vary from MAs belonging 
to the same class. 

An MAM can be analyzed solving a set of coupled 
differential equations. Let us call pf (t, v) the density of 
agents of class c, in state i, located in position v at time 
t. In the following, we will focus on static MAMs thus 
assuming that the total density of agents in position v 
remains constant over the time; we have that 


Detvy = pv), Vv, vr>0. (69.3) 


i=l 


We collect the state densities into a vector p°(t, v) = 
[of (t, v)] and we are interested in computing the tran- 
sient evolution of p° (t, v). 

From the above definitions, we can compute the 
total rate B*(v, m) at which messages of type m are gen- 
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erated by an agent of class c in state j in position v 
È — jc g 
B; (v,m) = À; (v) 8; m) 


+Y G) e;m), 
i 


(69.4) 


where the first term on the right-hand isde is the con- 
tribution of the messages of type m emitted during 
a self-loop from j and the second term is the contribu- 
tion of messages of type m emitted during a transition 
from j to any k (Æ j). 

The interdependences among MAs are ruled by a set 
of perception functions whose general form is 


u” (c,v,i,c', v, j). (69.5) 
The perception function u” (-) in (69.5) represents how 
an MA of class c in position v in state i perceives the 
messages of type m emitted by an MA of class c’ in 
position v’ in state j. The functional form of w’"(-) iden- 
tifies the perception mechanisms and must be specified 
for any given application since it determines how an 
MA is influenced by the messages emitted by the other 
MAs. The transition rates of the induced transitions are 
primarily determined by the structure of the perception 
function. 

A pictorial and intuitive representation of how the 
perception function u” (c, v, i,c’, v’,j) acts, is given in 
Fig. 69.2. The MA in the top right portion of the figure 
in position v’ broadcasts a message of type m from state 
j that propagates in the geographical area until reaches 
the MA in the bottom left portion of the figure in po- 
sition v and in state i. Upon acceptance of the message 
according to the acceptance probability aj,(v, m), an in- 
duced transition from state i to state k (represented by 
a dashed line) is triggered in the model. 


With the above definitions we are now in the po- 
sition to compute the components of the infinitesimal 
generator of an MA that depends on the interaction with 
the other MAs and that constitutes the original and in- 
novative part of the approach. 

We define y;;(t, v,m) the total rate at which mes- 
sages of type m coming from the whole volume V are 
perceived by an MA of class c in state 7 in location v. 


ngs 


c 
y(t, v, m) = | 5 5 u”(c,v, i,c, v, j) 


y d =1j=1 
x Be (mof (t, v’)dv' , 


where y(t, v,m) is computed by taking into account 
the total rate of messages of type m emitted by all 
the MAs in state j and in a given position v’ (the 
term B; (y, m)) times the density of MAs in v’ (the 
term pj(t, v’)) times the perception function (the term 
u” (c, v, i, c', v’,j)) summed over all the possible states 
j and class c’ of each MA and integrated over the whole 
space V. From an MA of class c in position v and in 
state i an induced transition to state k (drawn in dashed 
line) is triggered with rate y(t, V,m) aix(v,m) where 
aix(V, m) is the appropriate entry of the acceptance ma- 
trix A(v, m). 

We collect the rates (69.6) in a diagonal ma- 
trix P(t, v, m) = diag(y£(t, v, m)). This matrix can be 
used to compute K°(f, v), the infinitesimal generator of 
a class c agent at position v at time t 


K‘(t,v) = Q° + T°, v,m)[A“(m) — I]. (69.7) 


(69.6) 


The first term on the right-hand side is the local transi- 
tion rate matrix and the second term contains the rates 
induced by the interactions. 


Fig. 69.2 Message passing mecha- 
nism ruled by a perception function 
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The evolution of the entire model can be studied by 
solving Vv, c the following differential equations: 


p° (0, v) = p° (V) T6 » (69.8) 
cet) = p° (t, v)K“(t, v). (69.9) 


From the density of agents in each state, we can com- 
pute the probability of finding a class c agent at time t 
in position v in state 7 as 


pf (t, v) 


‘ 69.10 
p(y) ; i 


ne(t, v) = 


We collect all the terms in a vector x° (t, v) = [x° (t, v)]. 
Note that the definition of (69.10) together with (69.3) 
ensures that }_; z; (t, v) = 1, Yt, Vv. 

Note that each equation in (69.9) has the dimen- 
sion n; of the CTMC of a single MA of class c. In this 
way, a problem defined over the product state space of 
all the MAs is decomposed into several subproblems, 
one for each MA, having decoupled the interaction 
by means of (69.6). Equation (69.9) provides the ba- 
sic time-dependent measures to evaluate more complex 
performance indices associated to the system. Equation 
(69.9) is discretized both in time and space and are 
solved by resorting to standard numerical techniques 
for differential equations. 


69.3 A Consolidated Example: WSN Routing 


In this section, we present our first attempt to model 
swarm intelligence inspired mechanisms through the 
MAM formalism. This application describes an MAM 
model for the analysis of a swarm intelligence rout- 
ing protocol in WSNs and was first proposed in [69.9] 
and then enriched in [69.12]. In this work, we show 
new experiments to illustrate the self-adaptability of the 
MAM model to the changing of environmental condi- 
tions. 

WSNs are large networks of tiny sensor nodes that 
are usually randomly distributed over a geographical 
region. The network topology may vary in time in an 
unpredictable manner due to many different causes. 
For example, in order to reduce power consumption, 
battery-operated sensors undergo cycles of sleeping — 
active periods; additionally, sensors may be located in 
hostile environments increasing their likelihood of fail- 
ure; furthermore, data might also be collected from dif- 
ferent sources at different times and directed to different 
sinks. For this reason, multihop routing algorithms used 
to route messages from a sensor node to a sink should 
be rapidly adaptable to the changing topology. Swarm 
intelligence has been successfully used to face these 
problems thanks to its ability in converging to a single 
global behavior starting from the interaction of many 
simple local agents. 


69.3.1 A Swarm Intelligence Based Routing 


In [69.15], a new routing algorithm, inspired by the 
biological process of pheromone emission (a chemi- 
cal substance produced and layed down by ants and 
other biological entities), has been proposed. The rout- 


ing table in each node stores the pheromone level owned 
by each neighbor, coded as a natural integer quan- 
tity [69.15]; when a data packet has to be sent it is 
forwarded to the neighbor with the highest pheromone 
level. This approach correctly works only if a sequence 
of increasing values of pheromone levels toward the 
sinks exists; in other words, the sinks must have the 
maximum pheromone level in the WSN and a decreas- 
ing pheromone gradient must be established around the 
sinks covering all the net. 

To build the pheromone gradient, the initial setting 
of the WSN is as follows: the sinks are set to a fixed 
maximum pheromone level, whereas the sensor nodes’ 
pheromone levels are set to 0. When the WSN is oper- 
ating, each node periodically sends a signaling packet 
with its pheromone level and updates its value based on 
the level of its neighbors. 

More specifically, the algorithm for establishing the 
pheromone gradient is based on two types of nodes in 
the WSN, called sinks and sensors, respectively, and the 
pheromone is assumed discretized into P different lev- 
els, ranging from 0 to P— 1. In this way, routing paths 
toward the sink are established through the exchange 
of pheromone packets containing the pheromone level 
p(0 < p < P—1) of each node. 

Sink nodes, once activated, set their internal 
pheromone level to the highest value p = P — 1. Then, 
they, at fixed time interval, broadcast a pheromone mes- 
sage to their neighbors with the value p. We assume T1 
is the time interval incurring between two consecutive 
sending of pheromone message. 

Instead, the pheromone level of a sensor node is 
initially set to O and then it is periodically updated ac- 
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cording to two distinct actions — excitation action (the 
positive feedback) and evaporation action (the negative 
feedback): 


@ Excitation action: Sensor nodes periodically broad- 
cast to the neighbors a pheromone message con- 
taining their internal pheromone level p. Like the 
sink node, sensor nodes perform the sending at reg- 
ular time interval T1. When a sensor node receives 
a pheromone level p, sent by a neighbor it com- 
pares p, with its own level p and updates the latter 
if pa >p. The new value is computed as a func- 
tion of the current and the received pheromone level 
update(p, pn). In this context, we use the average 
of the sender and the receiver level as the new up- 
dating value, thus the function is assumed to be 
update(p, Py) = round((p + py)/2). 

© Evaporation action: it is triggered at regular time 
interval T2 and it simply decreases the current value 
of p by one unit assuring it maintains a value greater 
or equal to 0. 


We note that, despite all nodes perform their exci- 
tation action with the same mean time interval T1, no 
synchronization activity is required among the nodes; 
all of them act asynchronously in accordance with the 
principles of biological systems where each entity acts 
autonomously with respect to the others. 

The excitation—evaporation process, like in biologi- 
cal systems, assures the stability of the system and the 
adaptability to possible changes in the environment or 
in some nodes. Any change in the network condition 
is captured by an update of the pheromone level of 
the involved nodes that modifies the pheromone gradi- 
ent automatically driving the routing decisions toward 
the new optimal solution. In this way, the network can 
self-organize its topology and adapt to environmen- 
tal changes. Moreover, when link failures occur, the 
network reorganization task is accomplished by those 
nodes near the broken links. This results in a robust and 
self-organized architecture. 

The major drawback of this algorithm is the dif- 
ficulty in appropriately setting the parameter T1 and 
T2: as shown in [69.12, 15], the stability of the sys- 
tem and the quality of the produced pheromone gradient 
is strictly dependent on the parameters ratio. When 
T1 decreases and T2 is fixed, pheromone messages 
are exchanged more rapidly among the nodes and 
their pheromone level tends to the maximum level be- 
cause the sink node always sends the same maximum 
value. Without an appropriate balancing action, the 
pheromone level saturates all the nodes of the WSN. 


At the opposite, let us suppose T1 is fixed and T2 de- 
creases; in this case the pheromone level in each sensor 
node decreases more quickly than its updating accord- 
ing to the value of the neighbors. As a result all the 
levels will be close to zero. From this behavior, we note 
that: (1) both timers are necessary to ensure that the al- 
gorithm could properly work, and (2) a smart setting of 
both timers is necessary in order to have the best gra- 
dient shape all over the network. The MAM model we 
are going to describe in the next section helps us to de- 
termine the best parameter values. 


69.3.2 The MAM Model 


The MAM model used to study the gradient forma- 
tion is based on two agent classes: the class sink node 
denoted by a superscript s and the class sensor node 
denoted by a superscript n. The message exchange is 
modeled by using M different message types. As we 
will explain later, since each message is used to send 
a pheromone level, we set M = P, where P is the num- 
ber of different pheromone intensities considered in the 
model. 


Geographical Space 

The geographical space V where the N agents are lo- 
cated is modeled as a ny, X ny, rectangular grid, and each 
cell has a square shape with side d,. Sensors can only be 
located in the center of each cell and we allow at most 
one node per cell: i.e., some cell might be empty, and 
N < m X ny. Moreover, sink nodes are very few with 
respect to the number of sensor nodes. 


Agent's Structure and Behavior 
Irrespective of the MA class considered, we model the 
pheromone level of a node with a state and this choice 
determines two different MA structures. 

The sink class (Fig. 69.3a) is very simple and is 
characterized by a single state labeled P — 1 with a self- 
loop of rate A = T The sink has always the same 
maximum pheromone level, and emits a single message 
of type P — 1 with rate À. 

Instead, the sensor class (Fig. 69.3b) has P states 
identifying the range of all the possible pheromone 
levels. Each state is labeled with the pheromone in- 
tensity i (i =0,...,P—1) in the corresponding node 
and has a self-loop of rate A = H that represents the 
firing of timer at regular intervals equal to T1. This 
event causes the sending of a message (Sect. 69.3.2). 
The evaporation phenomenon is modeled by the solid 
arcs (local transitions) connecting state i with state i— 1 
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(0 <i< P-— 1). The evaporation rate is set to y = oe 
in such a way we represent the firing of timer T2. 


Message Types 
The types of messages in the model correspond to the 
different levels of pheromone a node can store, thus we 
define M = {0,1,...,P—1}. Any self-loop transition 
in state i emits a message of the corresponding type i at 
a constant rate A, both in sink and in sensor nodes. The 
sink message is always of type P — 1, representing the 
maximum pheromone intensity, whereas the messages 
emitted by a sensor node corresponds to the state where 
it actually is. 

When a message of type m is emitted, neighbor- 
ing nodes are able to receive it changing their state 
accordingly. This behavior is implemented through the 
dashed arcs (whose labels are defined through (69.11)) 
that model the transitions induced by the reception of 
a message. In particular, when a node in state i receives 
a message of type m, it immediately jumps to state j if 
m € M(i, j), with 


M(i, j) = {m € [0, ..., P — 1] : round((m + i)/2) = j} 
Vi,je[0,...,P—1]:j>i. (69.11) 


In other words, an MA in state i jumps to the state j 
that represents the pheromone level equal to the mean 
between the current level i and the level m encoded in 
the perceived message. 


Perception Function 
Messages of any type sent by a node are characterized 
by the same transmission range t, that defines the radius 
of the area in which an MA can perceive a message pro- 
duced by another MA. This property is reflected in the 
perception function u”(-) that, Wm € [1,...,M], is de- 
fined as 


0 dist(v, v) >t 
u”(v,c,i, v’, c, i) = " (69.12) 


1 dist(v, vV) <t, 


Fig. 69.3a,b Markovian agent mod- 
els. (a) Agent class = sink, (b) Agent 
class = sensor 


where dist(v, v’) represents the Euclidean distance be- 
tween the two nodes at position v and v’. 

As can be observed, the perception function in 
(69.12) is defined irrespective of the message type, 
because in this kind of application the reception of 
a message of any type i depends only on the distance 
between the emitting and the perceiving node. The 
transmission range ¢, depends on the properties of the 
sensor and it influences the number n of neighbors per- 
ceiving the message. In the numerical experimentation, 
we consider d; < t4 < /2d, corresponding to n = 4. 

Generation and Acceptance Probabilities 
In this application, messages are only generated dur- 
ing self-loop transitions with probability 1, so that Vi, j, 
gi(m) = 1 and g; (m) = 0, (i £ j). Similarly, we assume 
either a; (m) = 0 or a} (m) = 1, that is incoming mes- 
sages are always accepted or always ignored. 


69.3.3 Numerical Results 


In order to analyze the behavior of the WSN model, the 
main measure of interest is the evolution of x’ (t, v) i. e., 
the distribution of the pheromone intensity of a sensor 
node over the entire area V as a function of the time. 
The value of 7; (t, v) can be computed from (69.10) and 
allows us to obtain several performance indices like the 
average pheromone intensity ¢ (t, v) at time f for each 
cell ve V 


P—1 


plt, v) =J i ntv). (69.13) 


i=0 


The distribution of the pheromone intensity over the en- 
tire area V depends both on the pheromone emission 
rate A and on the pheromone evaporation rate j1; fur- 
thermore, the excitation—evaporation process depends 
on the transmission range t, that determines the number 
of neighboring cells 7 perceived by an MA in a given 
position. To take into account this physical mechanism, 
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Fig. 69.4a-c Distribution of the pheromone intensity varying r. (a) r = 1.2, (b) r= 1.8, (c) r = 2.4 
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Fig. 69.5a-f Distribution of the pheromone intensity with respect to t when two sinks are alternately activated. The 
change is applied at time t = 17.5s. (a) t = Os, (b) t = 17s, (c) t = 17.5, (d) t= 19s, (e) t= 24s, (f) t = 29s 


we define the following quantity, 


à-n 
r=—, 


H 


(69.14) 


which regulates the balance between the pheromone 
emission and evaporation in the SI routing algorithm. 
For a complete discussion about the performance in- 
dices that can be derived and analyzed using the de- 
scribed MAM, refer to [69.12]. 

The numerical results have been obtained with the 
following experimental setting. The geographical space 
is a square grid of sizes np = nw = 31, where N = 961 
sensors are uniformly distributed with a spatial density 
equal to 1 (one sensor per cell). Further, we set A = 4.0, 
P= 20, and n = 4. The first experiment aims at investi- 
gating the formation of the pheromone gradient around 


the sink as a function of the model parameters. To this 
end, a single sink node is placed at the center of the area 
and the pheromone intensity distribution is evaluated as 
a function of the parameter r, by varying jz being A and 
n fixed. 

Figure 69.4 shows the distribution of the pheromone 
intensity #(t,v) measured in the stable state for three 
different values of r. If the value of r is small (r= 
1.2) or high (r = 2.4), the quality of the gradient is 
poor. This is due to the prevalence of one of the two 
feedbacks: negative (with r = 1.2 evaporation prevails) 
or positive (with r= 2.4 excitation prevails and all 
sensors saturate). On the contrary, intermediate values 
(r = 1.8) generate well-formed pheromone gradients 
able to cover the whole area, thanks to the correct 
balance between the two feedbacks. Then, an oppor- 
tune evaluation of the value of r has to be carried 
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Fig. 69.6 The 100x 100 grid with 10000 cells and 50 ran- 
domly scattered sinks 


out in order to generate a pheromone gradient that fits 
with the topological specification of the WSN under 
study. 

In order to understand the dynamic behavior of the 
SI algorithm, we carried out a transient analysis able to 
highlight different phases of the gradient construction 
process when the position of the sink changes in time. 
In particular, in the following experiment (Fig. 69.5) we 
analyzed how the algorithm self-adapts to topological 
modifications by recalculating the pheromone gradient 
when two different sinks are present in the network and 
they are alternately activated. Figure 69.5a,b show how 
the pheromone signal is spread on the space V until 
the stable state is reached. At this point (t = 17.5 s), we 
deactivated the old sink and we activated a new one 
in a different position (Fig. 69.5c). Figure 69.5d,e de- 
scribe the evolution of the gradient modification. It is 
possible to observe that, thanks to the properties of the 
SI algorithm, the WSN is able to rapidly discover the 
new sink and to change the pheromone gradient by for- 
getting the old information until a new stable state is 
reached (Fig. 69.5f). 


Fig. 69.7 Distribution of the pheromone intensity when 
the network is composed by a grid of 10000 sensor nodes 
with 50 sinks 


Finally, in order to test the scalability of the MAM 
in more complex scenarios, we have assumed a rectan- 
gular grid with ny, = ny = 100 hence with N = 100 x 
100 = 10000 sensors, and we have randomly scattered 
50 sinks in the grid. The grid is represented in Fig. 69.6, 
where the sinks are drawn as black spots. Since each 
sensor is represented by an MA with P = 20 states 
(Fig. 69.3b), the product state space of the overall sys- 
tem has N = 2019 states! 

The steady pheromone intensity distribution for the 
geographical space represented in Fig. 69.6 is reported 
in Fig. 69.7. Through this experiment, we can assess 
that the pheromone gradient is also reached when no 
symmetries are present in the network and that the pro- 
posed model is able to capture the behavior of the pro- 
tocol in generating a correct pheromone gradient also 
in the presence of different maximums. Using the same 
protocol configurations found for a simple scenario, the 
SI algorithm is able to create a well-formed pheromone 
gradient also in a completely different situation, making 
such routing technique suitable in nonpredictable sce- 
narios. This scenario also demonstrates the scalability 
of the proposed analytical technique that can be easily 
adopted in the analysis of very large networks. 
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69.4 Ant Colony Optimization 


The aim of this section is to show how MAMs 
can be adopted to represent one of the more clas- 
sical swarm intelligence algorithm known as ACO 
[69.2], that was inspired by the foraging behav- 
ior of ant colonies which, during food search, ex- 
hibit the ability to solve simple shortest path prob- 
lems. To this end, in this work, we simply show 
how to build a MAM that solves the famous Dou- 
ble Bridge Experiment which was first proposed by 
Deneubourg et al. in the early 90s [69.16, 17], and that 
has been proposed as an entry benchmark for ACO 
models. 

In the experiment, a nest of Argentine ants is con- 
nected to a food source using a double bridge as shown 
in Fig. 69.8. Two scenarios are considered: in the 
first one the bridges have equal length (Fig. 69.8a), 
in the second one the lengths of the bridges are dif- 
ferent (Fig. 69.8b). The collective behavior can be 
explained by the way in which ants communicate in- 
directly among them (stigmergy). During the journey 
from the nest to the food source and vice versa, ants 
release on the ground an amount of pheromone. More- 
over ants can perceive pheromone and they choose 
with greater probability a path marked by a stronger 
concentration of pheromone. As a results, ants releas- 
ing pheromone on a branch, increase the probability 
that other ants choose it. This phenomenon is the re- 
alization of the positive feedback process described 
in Sect. 69.1 and it is the reason for the conver- 
gence of ants to the same branch in the equal length 
bridge case. When lengths are different, the ants choos- 
ing the shorter path reach the food source quicker 
than those choosing the longer path. Therefore, the 
pheromone trail grows faster on the shorter bridge 
and more ants choose it to reach food. As a result, 
eventually all ants converge to follow the shortest 
path. 


b) IN 


Nest Food 


Fig. 69.8a,b Experiment scenarios. Modified from Goss etal. 
[69.17]. (a) Equal branches, (b) Different branches 


69.4.1 The MAM Model 


We represent the double bridge experiment through 
a multiple agent class and multiple message type MAM. 
We model ants by messages, and locations that ants 
traverse by MAs. Three different MA classes are intro- 
duced: the class Nest denoted by superscript n, the class 
Terrain denoted by superscript t, and the class Food de- 
noted by superscript f. Two types of messages are used: 
ants walking from the nest to the food source corre- 
spond to messages of type fw (forward), whereas ants 
coming back to the nest correspond to messages of type 
bw (backward). 


Geographical Space 

Agents (either nest, terrain, or food source) are de- 
ployed on a discrete geographical space V represented 
as an undirected graph G = (V, E), where the elements 
in the set V are the vertices and the elements in the 
set E are the edges of the graph. In Fig. 69.9a,b, we 
show the locations of agents for the equal and the dif- 
ferent length bridge scenarios, respectively. The squares 
are the vertices of the graph and the labels inside them 
indicate the class of the agent residing on the vertex. 
In this model, we assume that only a single agent re- 
sides on each vertex. Message passing from a node to 
another is depicted as little arrows labeled by the mes- 
sage type. As shown in Fig. 69.9, different lengths of the 
branches are represented by a different number of hops 
needed by a message to reach the food source starting 
from the nest. Figure 69.9c represents a three branches 
bridge with branches of different length. 


Agent's Structure and Behavior 
The structure of the three MA classes is described in the 
following: 


@ MA Nest: The nest is represented by a single MA 
of class n, shown in Fig. 69.10a. The nest MA” is 
composed by a single state that emits messages of 
type fw at a constant rate A, modeling ants leaving 
the nest in search for food. 

@ MA Terrain: An MA of class t (Fig. 69.10c) repre- 
sents a portion of terrain on which an ant walks and 
encodes in its state space the concentration of the 
pheromone trail on that portion of the ground. We 
assume that the intensity of the pheromone trail is 
discretized in P levels numbered 0,1,...P—1. 
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Fig. 69.9a-c Graph used to model the experiment scenarios. (a) Equal branches, (b) two different branches, (c) three 
different branches 

With reference to Fig. 69.10c, the meaning of the © MA Food source: An MA of class f represents the 

states is the following: food source (Fig. 69.10b). The reception of a mes- 

sage of type fw in state fọ indicates that a forward 

@ to denotes no pheromone on the ground and no ant ant has reached the food source. After a mean time 

walking on it; of 1/ns, such an ant leaves the food and starts 

@ z; denotes a concentration of pheromone of level i the way back to the nest becoming a backward ant 
and no ant on the ground; (emission of message of type bw). D 
© ti denotes an ant of forward type residing on the ter- a 
rain while the pheromone concentration is at level i; In order to keep model complexity low thus increas- — 
© tj denotes an ant of backward type residing on ing the model readability, we have chosen to limit to 1 S 
the terrain while the pheromone concentration is at the number of ants that can reside at the same time on F 


level i. 


The behavior of the MA‘ agent at the reception of 
the messages is the following: 


@ fw- forward ant: A message of type fw perceived 
by an MA’ in states t;, induces a transition to state 
t(-1)¢ Meaning that the arrival of a forward ant in- 
creases the pheromone concentration of one level 
(positive feedback). 

@ bw-backward ant: A message of type bw perceived 
by an MA’ in states t;, induces a transition to state 
t(+1)p Meaning that the arrival of a backward ant 
increases the pheromone concentration of one level 
(positive feedback). 


Ants remain on a single terrain portion for a mean 
time of 1/ns, then they leave toward another destina- 
tion. The local transitions from states fj to states t; and 
the generation of message fw model this behavior for 
forward ants. An analogous behavior is represented for 
backward ants by local transitions from states tņ to 
states t;. The local transitions at constant rate jz from 


a portion of terrain (or in the food source). For this rea- 
son, message reception is not enabled in states fir or tip 
for MAs of class ¢ and in the state fı for MAs of class f. 
In future works, we will study effective techniques (e.g., 
intervening on MA density) in order to release such an 
assumption. 


Perception Function 
The perception function rules the interactions among 
agents and, in this particular example, defines the prob- 


states f; to states ¢;_; indicate the decreasing of one unit Fig. 69.10a—c Markovian agent models for the ACO experiment. 
of the concentration of pheromone due to evaporation (a) MA”: Agent of class nest. (b) MA‘: Agent of class food. 
(negative feedback): (c) MA‘: Agent of class terrain 
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ability that a message (ant) follows a specific path both 
on the forward and backward direction. The defini- 
tion of the perception function takes inspiration on the 
stochastic model proposed in [69.16, 17] to describe the 
dynamic of the ant colony. In such a model the proba- 
bility of choosing the shorter branch is given by 


(k + gis(t))* 


E+ ont? ++ eae E 


Pis(T) = 


where pj,(t) (respectively pa(t)) is the probability of 
choosing the shorter (longer) branch, gjs(t) (Ya(T)) is 
the total amount of pheromone on the shorter (longer) 
branch at a time t. The parameter k is the degree of at- 
traction attributed to an unmarked branch. It is needed 
to provide a non-null probability of choosing a path not 
yet marked by pheromone. The exponent œ provides 
a nonlinear behavior. 

In our MA model, the perception function w’"(-) is 
defined, Vm € {fw, bw}, as 


u"(v,c,i,v’.c’,j,T) 
= (k + E[x‘(z, v)” 
Deer vh eNex (vc) (k + Eja” (T, v”)pe 
(69.16) 


’ 


where k and «œ have the same meaning as in (69.15), 
E[x‘(t,v)] gives the mean value of the concentra- 
tion of pheromone at a time t in position v on the 
ground, and corresponds to g(t). The computation of 
E[x‘(t, v)] will be addressed in (69.18). The function 
Next” (v’,c’) gives the set of pairs {(c’”, v’”)} such that 
the agent of class c” in position v” perceives a message 
of type m emitted by the agent of class c’ in posi- 
tion v’. Figure 69.11a helps to interpret (69.16). The 
multiple box stands for all the agents receiving a mes- 
sage m sent by the agent of class c’ in position v’. 
The value of u” (v, c, i, v’,c’,j, T) is proportional to the 
mean pheromone concentration of the agent in class c 


b) 


A (£, bi) 


= (c, v) 


SS (£ ba) 


Fig. 69.11a,b Perception function description. (a) General case, 
(b) example of scenario in Fig. 69.9b 


at position v with respect to the sum of the mean con- 
centrations of all the agents that receive message m by 
the agent in class c’ and position v’. For instance, we 
consider the scenario depicted in Fig. 69.11b, where 
a class n agent in position bọ sends messages of type 
fw to two other class t agents at position bı and bz, 
and we compute ui (bo, ft, i, bo, n, j, T). In such case, 
the evaluation of function Next” (bo, n) gives the set 
of pair {(t, b1), (t, b2)} and the value of the function 
is 


u™ (b2, t, i, bo, n, j, T) 
(k + E[r" (t, b2)])“ 


KOTAM GTA 
(69.17) 


As a final remark, we highlight that u” (-) does not 
depend on the state variables i and j of the sender and 
receiver agents even if these variables appear in the def- 
inition of u” (-) ((69.16)). Instead, u” (-) depends on the 
whole probability distribution x®(t, v) needed to com- 
pute the mean value E[z‘“(t, v)]. 


Generation and Acceptance Probabilities 

As in Sect. 69.3, also in this ACO-MAM model we 
only allow 8; m) =0 or 8; jm) =] and a; (m) = 
0 or a; jm) =] Vc,m. In particular, for the terrain 
agent MA’, messages of type fw are sent with proba- 
bility gi, ,(fw) = 1, and are accepted with probability 
a, a+pr@W) = 1 only in a t; state inducing a transi- 
tion to a f(j41)¢ state. An analogous behavior is fol- 
lowed during emission and reception of messages of 
type bw. 
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Fig. 69.12 Mean pheromone concentration with A = 1.0, 
jt = 1 and n = 1 for the equal branches experiment 
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Fig. 69.13a-f Mean pheromone concentration for the case with two different branches. (a) Mean pheromone concen- 
tration A = 1.0, u = 0 and ņ = 1, (b) mean pheromone concentration A = 1.0, u = 0 and 7 = 10, (c) mean pheromone 
concentration A = 1.0, u = 0.5 and n = 1, (d) mean pheromone concentration A = 1.0, u = 0.5 and ņ = 10, (e) mean 
pheromone concentration A = 1.0, u = 2 and ņ = 1, (f) mean pheromone concentration 2 = 1.0, u = 2 and n = 10 


69.4.2 Numerical Results for a class c agent, E[x“(t, v)], defined as 
for ACO Double 
Bridge Experiment E[x‘(t, v)] = > (Vv, c)I(s) , (69.18) 
sEese 


We have performed several experiments on the ACO 
model. In particular, we study the mean value of the where S° denotes the state space of a class c agent, I(s) 
concentration of pheromone at a time t in position v represents the pheromone level in state s, and it corre- 
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Fig. 69.14a,b Mean pheromone concentration for the case 
with three different branches. (a) n = 10, (b) n = 10 


sponds to 
I(s) =i, Vs € {tj} U {tig} U {tip} (69.19) 


This value is used in (69.16) to compute u”(-) which, 
as previously said, rules the ant’s probability to follow 
a specific path; therefore, such performance index pro- 
vides useful insights of the modeled ant’s behavior. 


69.5 Conclusions 


In this work, we have presented how the Markovian 
agents performance evaluation formalism can be used 
to study swarm intelligent algorithms. Although the 
formalism was developed to study largely distributed 
systems like sensor networks, or physical propagation 
phenomena like fire or earthquakes, it has been proven 
to be very efficient in capturing the main features of 
swarm intelligence. 

Beside the two cases presented in this chap- 
ter, routing in WSNs and ant colony optimiza- 


We consider three scenarios depicted in Fig. 69.9, 
the labels bj denote the positions where we compute 
the mean value of the concentration of pheromone. In 
all the experiments, the intensity of the pheromone trail 
is discretized in P = 8 levels. 

In Fig. 69.12, the mean pheromone concentration 
E[x“(t, b;)] over the time for the equal branches experi- 
ment is plotted. As it can be seen, both mean pheromone 
concentrations have exactly the same evolution proving 
that ants do not prefer one of the routes. 

The case with two different branches is considered 
in Fig. 69.13. The speed of the ants (i. e., parameter 77) is 
considered in the column (the left column corresponds 
to 7 = 1.0 and the right column to 7 = 10), while the 
evaporation of the pheromone is taken into account in 
the rows (respectively with u = 0, u = 0.5, and u = 
2). When no evaporation is considered (Fig. 69.13a,b), 
both paths are equally chosen due to the finite amount 
of the maximum pheromone level considered in this 
work. However the shorter path reaches its maximum 
level earlier than the longer route. In all the other cases, 
it can be seen that the longer path is abandoned after 
a while in favor of the shorter one. The evaporation of 
the pheromone and the speed of the ants both play a role 
in the time required to drop the longer path. Increasing 
either of the two, reduces the time required to discover 
the shorter route. 

Finally, Fig. 69.14 considers a case with three 
branches of different length and different evaporation 
levels (7 = 1 and ņ = 10). Also in this case the model 
is able to predict that ants will choose the shortest 
route. It also shows that longer paths are dropped in 
an order proportional to their length: the longest route 
is dropped first, and the intermediate route is dis- 
carded second. Also in this case, the evaporation rates 
determine the speed at which paths are chosen and 
discarded. 


tion, the formalism is capable of considering other 
cases like Slime Mold models. Future research lines 
will try to emphasize the relations between Marko- 
vian agents and swarm intelligence, trying to in- 
tegrate both approaches: using Markovian agents 
to formally study new swarm intelligent algo- 
rithms, and use swarm intelligent techniques to 
study complex Markovian agents models in order 
to find optimal operation points and best connection 
strategies. 
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70. Honey Bee Social Foraging Algorithm 
for Resource Allocation 


Jairo Alonso Giraldo, Nicanor Quijano, Kevin M. Passino 


Bioinspired mechanisms are an emerging area in 
the field of optimization, and various algorithms 
have been developed in the last decade. We in- 
troduce a novel bioinspired model based on the 
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voltage allocation problem to achieve a maximum 
uniform temperature in a multizone temperature 
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Over several decades researchers’ interest in under- 
standing the patterns and collective behaviors of some 
organisms has increased because of the possibility 
of generating mathematical models that can be used 
for solving problems [70.1]. These bioinspired mod- 
els have been used to develop robust technological 
solutions in different research fields [70.2]. One of 
the first models based on natural behaviors is the ge- 
netic algorithm (GA), proposed by Holland in [70.3]. 
This method reproduces the concepts of evolution con- 
sidering natural selection, reproduction, and mutation 
in organisms. Many variations have been developed 
since then [70.4], and a wide variety of applica- 
tions have been implemented [70.5,6]. A sub-field 
of bioinspired algorithms is the so-called swarm in- 
telligence [70.1,7], which is inspired by the collec- 
tive behavior of social animals that are able to solve 
distributed and complex problems following individ- 
ual simple rules and producing emerging behaviors. 
Swarm intelligence mainly refers to those techniques 


inspired by the social behavior of insects, such as 
ants [70.8] and bees [70.9-11], or the social inter- 
action of different animal societies (e.g., flocks of 
birds) [70.12]. Ant colony optimization (ACO), as in- 
troduced by Dorigo et al. [70.13], mimics the foraging 
behavior of a colony of ants, based on pheromone 
proliferation, and it has been used in the solution 
of optimization problems [70.1, 14], and in some en- 
gineering applications [70.15-17]. Another common 
approach is particle swarm optimization (PSO), which 
mimics the behavior of social organisms that move 
according to the knowledge of their neighbors’ good- 
ness, and it is able to solve continuous optimization 
problems [70.18]. This technique has been widely im- 
plemented in a variety of applications, such as econom- 
ical dispatch [70.19, 20], feature selection [70.21], and 
some resource allocation problems [70.22, 23], to name 
just a few. 

There are also several bioinspired techniques based 
on the collective behavior of foraging bees, and 
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each one has different characteristics and applications. 
In [70.24], a decentralized honey bee algorithm is pre- 
sented, which is based on the distribution of forager 
bees amongst flower patches, which occurs in such 
a way that the nectar intake is maximized. This tech- 
nique has been applied to Internet servers hosting dis- 
tribution. Tereshko in [70.25] also developed a model of 
the foraging behavior of a honey bee colony based only 
on the recruitment and abandonment process, taking 
into account just the local information of a food source. 
However, in [70.26], the algorithm was improved by 
considering either local and global information of food 
sources. Another approach of honey bee foraging al- 
gorithms was developed by Karaboga [70.27], and 
it is called the artificial bee colony (ABC), which 
can be used to solve unconstrained optimization prob- 
lems. In [70.28], a comparison of the ABC algorithm 
performance was made with other common heuris- 
tic algorithms, such as genetic algorithms and par- 
ticle swarm optimization. The authors conclude that 
ABC can be used for multivariable and multimodal 
function optimization. Several applications have been 
developed [70.29-31], and some improvements have 
been made in order to solve constrained optimization 
problems [70.32]. Teodorović and Dell’ Orco in [70.33] 
introduced another algorithm based on honey bee for- 
aging called bee colony optimization (BCO). This 
technique is very similar to ABC and follows almost 
the same steps of exploring, foraging, and recruitment 
based on the waggle dance. However, in ABC, the ini- 
tial population is distributed in such a way that scouts 
and foragers are in equal proportion, while in BCO 
the initial population distribution is not fixed. Some 
applications have been developed using BCO to solve 
difficult optimization problems, such as combined heat 
and power dispatch [70.34], and job scheduling [70.35]. 
There are many other applications, and we refer the 
reader to [70.36] for an extensive literature review of 
the field. 

In general, none of the previous optimization meth- 
ods based on honey bee social foraging attempt to 
mimic the whole behavior of the foraging process. They 
mainly concentrate on the communication between the 
agents (bees), which is achieved through the waggle 
dance. One of the goals of this chapter is to show 
another swarm intelligence method (i. e., a honey bee 
social foraging) that mimics very closely the real be- 
havior of a hive (or even multiple hives) of bees, in 
order to solve dynamic resource allocation problems. 
This method is based on the models obtained by Seeley 
and Passino in [70.37,38], where each bee can be an 


explorer, an employed forager, an observer, or a rester. 
The foraging process consists of exploring a landscape 
with different profitability sites. Hence, if a site is good 
enough, the explorer will try to recruit other bees in the 
hive using the waggle dance, which varies its intensity 
according to the quality of the site and the nectar un- 
loading time. The observers will tend to follow the bees 
with the higher dance intensity and they may become 
employed foragers. If a site is no longer good enough, 
the bees may tend to become observers and will try to 
follow another waggle dance. One of the advantages of 
this method is that each bee only considers local in- 
formation about its position and the profitability of the 
forage site. Besides, the communication is only consid- 
ered in the waggle dance process, which depends on 
the nectar unloading wait time. Hence, we do not need 
to have full information of each agent. However, with 
only the unloading wait time information, an emerging 
behavior is produced, and complex resource allocation 
problems can be solved. This method is based on exper- 
imental results and imitates almost the whole behavior 
of honey bees during the foraging process, which is not 
the case with the other approaches presented before, 
which only considered a few actions of the foraging 
activity. On the other hand, the utility of the theoret- 
ical concepts that are introduced in this chapter are 
illustrated in an engineering application, which consists 
of a multizone temperature control grid. These kinds 
of problems are very important in commercial and in- 
dustrial applications, including the distributed control 
of thermal processes, semiconductor processing, and 
smart building temperature control [70.39-42]. Here, 
we use a multizone grid similar to the one in [70.43], 
with four zones, each one with a temperature sensor, 
and a lamp that varies its temperature. The complexity 
of these kinds of problems arises mainly due to the in- 
terzone effects (e.g., lamps affecting the temperature in 
neighboring zones), ambient temperature and external 
wind currents, zone component differences, and sensor 
noise. This is why common control strategies cannot 
be applied. For this reason, different experiments are 
implemented in order to observe the performance of 
the algorithm under different conditions. Besides, we 
compare its behavior with two common evolutionary al- 
gorithms, i.e., genetic algorithm and PSO, which have 
been selected because of their low computational cost 
and their high capability to solve optimization prob- 
lems. These algorithms have been modified in order to 
solve dynamic resource allocation problems, and their 
behavior can be compared with the honey bee social 
foraging algorithm. 


Honey Bee Social Foraging Algorithm 


70.1 Honey Bee Foraging Algorithm 


This chapter is organized as follows. First, in 
Sect. 70.1, we introduce the honey bee social foraging 
algorithm. Then, in Sect. 70.2 the multizone tempera- 
ture problem is presented and the other two evolution- 


70.1 Honey Bee Foraging Algorithm 


The honey bee social foraging algorithm models the 
behavior of social honey bees during nectar foraging, 
based on experimental studies summarized in [70.37] 
and some ideas from other mathematical models. This 
algorithm models some activities such as exploration 
and foraging, nectar unload, dance strength decisions, 
explorer allocation, recruitment on the dance floor, and 
interactions with other hives. The theory and the experi- 
ments are based on the work developed by Quijano and 
Passino in [70.43]. 


70.1.1 Landscape of Foraging Profitability 


The landscape is assumed as a spatial distribution of 
forage sites with encoded information of the foraging 
profitability that quantifies the distance from the hive, 
nectar sugar content, nectar abundance, and any other 
relevant site variables. There is a number of B bees that 
are represented by a two-dimensional position 6! € R?, 
fori=1,2,...,B. During foraging, bees sample a for- 
aging profitability landscape denoted by J;(@) € [0, 1], 
which is proportional to the profitability of nectar at lo- 
cation 0. Hence, J¢(@) = 1 represents a location with 
the highest possible profitability and J;(@) = 0 repre- 
sents a location with no profitability. 

As an example, assume the foraging landscape J;(@) 
is zero everywhere except at forage sites. We could have 
four forage sites, indexed by j = 1,2,3,4, centered at 
various positions that are initially unknown to the bees. 
Each site can be represented as a cylinder with radius 
e}, and height N} € [0, 1] that is proportional to nectar 
profitability. We may also assume that the profitability 
of a bee being at site j decreases as the number of bees 
visiting that site increases. This can be denoted by sj, 
which in behavioral ecology theory is called the suit- 
ability function [70.44]. 


70.1.2 Roles and Expedition of Bees 
There are several kinds of bees involved in the foraging 


process during an expedition, and each kind has a differ- 
ent function. An expedition can be considered as a time 


ary strategies, genetic algorithm and PSO for resource 
allocation are introduced. The results and comparisons 
are presented in Sects. 70.3 and 70.4, and in Sect. 70.5 
some conclusions are drawn. 


instant where each bee executes a function according 
to its role. There are By(k) employed foragers that ac- 
tively bring nectar back from some site, and some of 
them dance to recruit new bees if the site is good. B.(k) 
explorer foragers go to random positions in the envi- 
ronment, bring their nectar back if they find any, dance 
to recruit, and they can become foragers if they find 
a relative good site. There are By(k) = Bo(k) + B,(k) 
unemployed foragers, with B,(k) bees that rest (or are 
involved in some other activity), and B,(k) that ob- 
serve the dances of employed and explorer foragers on 
the dance floor. Some of the observers will follow the 
dances. 

We ignore the specific path used by the foragers on 
expeditions and we assume that a bee samples the for- 
aging profitability landscape once on its expedition, and 
this value is held when the bee returns to the hive. Let 
the foraging profitability assessment by the employed 
forager or explorer i be 


if Je(0'(k)) 
+oj(k) > 1, 
Fİ (k) = J0) + oik) if1> HO'W) 
+@;(k) > €n, 
: if JCO’) 
+oilk) <en, 


where 6'(k) represents the position of the i-th bee at 
the k-th expedition, and wi(k) is the profitability as- 
sessment noise, which can be considered uniformly dis- 
tributed between (—0.1, 0.1). The value €„ sets a lower 
threshold on site profitability, and here we use €, = 
0.1. 


70.1.3 Dance Strength Determination 


The number of waggle runs of bee i at the k-th expe- 
dition is called dance strength and is denoted by Li(k). 
The unemployed foragers have L;(k) = 0, and the em- 
ployed foragers that have F'(k) = 0 will have Li(k) = 0 
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since they do not find a location above the profitabil- 
ity threshold €,, and for this reason they will become 
unemployed foragers. 


Unloading Waiting Time 
Now, we will explain dance strength decisions for the 
employed foragers and explorers that find a site of suf- 
ficiently good profitability and have F! (k) > en. Firstly, 
we have to model the unloading wait time in order to re- 
late it with the dance strength. Let F,(k) = an Fi(k) 
be the total nectar profitability assessment at time k for 
the hive and F Ck) be the quantity of nectar gathered 
for a profitability assessment F'(k). We assume that 
Fi (K) = aF'(k), where œ > 0 is a proportionality con- 
stant. We may choose œ = 1, such that the total hive 
nectar influx Fig(k) is equal to the total nectar profitabil- 
ity assessment. Suppose that the number of food-storer 
bees is sufficiently large so the wait time W'(k) that 
bee i experiences is given by 

W'(k) = y max {Fig(k) + œi, (k), 0} , (70.1) 
where y is a scale factor and wi, (k) is a random 
variable uniformly distributed in (—@,),@ )) that rep- 
resents variations in the wait time a bee experiences. 
When the total nectar influx is maximum, the value of 
the wait time is approximately 30 s, based on the exper- 
iments in [70.45]. With this assumption we can obtain 
the values of y and wo from the fact that the maximum 
wait time from (70.1) is given by W(B+@.) = 30. 
Hence, it can be noted that Wwa is the variation in the 
number of seconds in wait time due to the noise, and 
Ww has to be set adequately. If we let ww, =5 and 
we have assumed that B = 200, we obtain two equa- 
tions and two unknowns, which gives y = 52/200 and 
Ww = 40. 


Dance Decision Function 
Now, we assume that each successful forager converts 
the wait time it experienced into a scaled version of an 
estimate of the total nectar influx that we define as 

Falk) = SW'(k) . (70.2) 
The value Fg (k) provides bee i a noisy estimate of the 
whole colony’s foraging performance, since it provides 
an indication of how many successful foragers are wait- 
ing to be unloaded [70.37]. The proportionality constant 
is ô > 0, and since W'(k) € [0, Y (B + wo )] = [0, 30] s, 
it implies that Fuh) € [0, 306]. In order to ensure that 
Îi (K) € [0, 1], we consider that 0 < ô < 5- 


With this estimation, each bee has to decide how 
long to dance according to some forage site variables 
that determine the energetic profitability (e.g., distance 
from hive, sugar content of nectar, nectar abundance), 
and some conditions that determine the threshold of 
the dance response (e.g., weather, time of day, colony’s 
nectar influx). The decision function is 


Li = max fp (Fw -Êa W) 0} (70.3) 
which indicates the number of waggle runs of bee i 
at expedition k. The parameter f > 0 has the effect of 
a gain on the rate of recruitment for sites above the 
dance threshold, and experimentally [70.37] we can set 
B = 100. 

When a bee has Li(k) > 0, it may consider dancing 
for her forage site. The probability that bee i will choose 
to dance for the site it is dedicated to is given by 


? 
B 


where ¢ € [0,1]; matching the behavior of what is 
found in experiments, we choose ¢ = 1. 


pri, k) = SLi(k)], 


70.1.4 Explorer Allocation 
and Forager Recruitment 


Bees that are not successful on an expedition, or those 
that do not consider dancing, become unemployed for- 
agers. Some of these bees will start to rest or they 
will become observers and they will start seeking danc- 
ing bees in order to get recruited. The probability that 
an unemployed forager or current rester bee will be- 
come an observer is po € [0, 1]. Based on the results 
in [70.46], we choose p = 0.35 in such a way that in 
times where there are no forage sites being harvested 
there can be about 35% of the bees performing as for- 
age explorers. 

If an observer bee does not find any dance to follow, 
it will go exploring. So we take the B,(k) observer bees 
and each one can become an explorer with probability 
De(k) or can follow the dance and become an employed 
forager with probability 1 — p,(k). We choose 


(70.4) 


ey 


e(k) = 
pelk) exp E 


where L,(k) = yo Li(k) is the total number of wag- 


i=l 


gle runs on the dance floor at step k. Notice that 
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if L,(k) = 0, there are no dancing bees on the dance 
floor, so pe(k) =1 and all the observers will ex- 
plore (i.e., 35% of the unemployed foragers). Here, 
we choose o = 1000 since it produces patterns of 
foraging behavior in simulations that correspond to 
experiments. 

Now, we take the observer bees that did not go to 
explore and some of them will be recruited in order to 
follow the dance of bee i with probability 


Lith) 

a PPR 
De HW 
In this way, bees that dance more strongly will tend to 
recruit more foragers for their site. 


Algorithm 70.1 summarizes the pseudo-code of the 
honey bee social foraging algorithm described above. 


Dik) = (70.5) 


Algorithm 70.1 Honey Bee Social Foraging Algo- 
rithm 

1: Set the parameter values 

2: while Stopping criterium is not reached do 

3: | Determine number of bees at each forage site, 

and compute the suitability of each site. 
4: for Each employed forager and explorer do 
5: Define a noisy profitability assessment F'(k) 
according to the location 


6: if Fİ (k) > €, then 
T: if Bee is an employed forager then 
8: Stays that way 
9: else 
10: Bee becomes an employed forager 
11: end if 
12: else 
13: Bee becomes an observer or rester 
14: end if 
15: end for 


16: Compute the total nectar profitability, and total 
nectar influx 


17: for All employed foragers do 


18: Compute wait time Wi, and the noise for un- 
load wait time ww. 

19: Compute estimate of scaled total nectar in- 
flux Fig 

20: Compute dance decision function L; 

21; if L; = 0 then 

22: Bee becomes unemployed 

23: end if 

24: if Employed forager should not recruit then 

25: Li = 0. Bee i is removed from those that 

dance 
26: end if 
27: end for 


28: Determine L,. Employed foragers and successful 
forager explorers may dance based on sampling 
of profitability 

29: Send all employed foragers back the their pre- 
vious site after recruitment for the next expedi- 


tion 
30: for Unemployed foragers do 
31: We set wW =L= =0 
32: Unemployed foragers become observers with 


probability po. The remaining unemployed 
foragers become resters 


33: end for 

34: Set pe 

35: for Unemployed foragers do 

36: if rand < pe then 

37: Bee becomes explorer. Set explorer location 
for the next expedition 

38: end if 

39: for Unemployed observers do 

40: Unemployed observer will be recruited by 
bee i with probability p; 

41: end for 

42: end for 


43: end while 


70.2 Application in a Multizone Temperature Control Grid 


In order to apply the proposed algorithm in the context 
of a physical resource allocation problem, we imple- 
ment the multi-zone temperature control grid intro- 
duced in [70.43], with four zones as shown in Fig. 70.1. 
The relations between our problem and the proposed al- 
gorithm can be summarized as follows: 


1. We assume that there is a population of B indi- 
viduals (i. e., bees, chromosomes, or particles) that 
contains the information of a position 6;, where 
are se 

2. The search space is composed of allocation sites, 
which are denoted with a position R; and a width e}. 
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Temperature sensor 


LAMP ; LAMP 
Zone 1 ' Zone 2 

LAMP ; LAMP 
Zone 3 ' Zone 4 


Fig. 70.1 Layout for the multizone temperature control 
grid 


3. Each allocation site corresponds to the zone j in the 
temperature grid, for j = 1,...,4. 

4. Let T4 and T; be a temperature reference and the 
temperature for each zone j, respectively. 

5. We consider the temperature error for zone j as 
e = T? —T;, and if an individual is located in a zone, 
its fitness is given by ye, for y being a positive con- 
stant that sets the fitness value in the range of [0, 1]. 


70.2.1 Hardware Description 


A zone contains a temperature sensor LM35 and a lamp 
that varies its intensity in order to increase or de- 
crease the temperature of the zone. The data acquisition 
and the lamps’ intensity variations are performed us- 
ing a microcontroller PIC18F4550, which receives the 
temperature values (voltages between 0 and 5 V) and 
transmits them through the USB port to a PC using 
the USB-bulk communication class. We cannot guaran- 
tee that the four sensors have the same characteristics 
(they have +0.2°C typical accuracy, and +0.5°C 
guaranteed). With these temperature values, a Matlab 
program executes an iteration of one of the algorithms 
and sends pulse width modulation (PWM) width in- 
formation back to the peripheral interface controller 
(PIC) (Fig. 70.2). PWM signals are generated using 
four digital outputs of the PIC and a couple of tran- 
sistors that drive the amount of current and voltage 
necessary to control the lamps. The width of the PWM 
signal depends on the number of individuals (i. e., bees, 


Temperatures 
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Sensors | ———~> S 
WwW 
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S 
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PWM 2 
Lamps —— A 


Fig. 70.2 Layout of data acquisition and temperature 
control 


chromosomes, or particles) allocated on a site. Each 
individual is equivalent to a portion of the PWM, and 
a 100% duty cycle corresponds to 12 V of direct current 
(DC). 

We assume that there is a total amount of volt- 
age that can be distributed between the four zones. 
The goal is to allocate that voltage in such a way 
that the reference temperature for each zone can be 
achieved. However, to achieve this goal is complicated 
due to external effects, such as ambient temperature and 
wind currents, interzone effects, differences between 
the components of a zone, and sensor noise. For this 
reason, the total voltage amount has to be dynamically 
allocated, despite the external and internal effects. 


70.2.2 Other Algorithms 
for Resource Allocation 


In order to compare the behavior of the honey bee social 
foraging algorithm to solve dynamic resource allocation 
problems, two evolutionary algorithms were selected, 
i.e., the genetic algorithm (GA) and particle swarm 
optimization (PSO). These methods have been imple- 
mented in a wide variety of applications because of 
their low computational cost and their huge capability 
for solving optimization problems. Some implementa- 
tions for resource allocation can be found in [70.22, 
23,47]. We will show below that these algorithms can 
be adapted in such a way that these resource allocation 
problems can be solved. 


Genetic Algorithms 
A genetic algorithm (GA) is a random search algorithm 
based on the mechanics of natural selection, genetics, 
and evolution [70.48]. The basic structure of the pop- 
ulation is the chromosome. During each generation, 
chromosomes are evaluated based on a fitness function 
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Fig. 70.3 Selection and crossover in GA 


and some of them are stochastically selected depend- 
ing on their fitness values. The population evolves from 
generation to generation through the application of ge- 
netic operators [70.49]. 

The simplest GA form uses chromosome infor- 
mation, which is encoded into a binary string. The 
chromosomes are modified using three operators: se- 
lection, crossover, and mutation [70.3]. Selection is 
an artificial version of natural selection, and only the 
fittest chromosomes from the population are selected. 
With crossover, two parents are chosen for reproduc- 
tion, and a crossover site (a bit position) is randomly 
selected. The subsequences after the crossover site 
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are exchanged with a probability pe, producing two 
offspring with information from both parents. Then, 
mutation randomly inverts a bit on a string with a very 
low probability, and introduces new information into 
the population at the bit level. Figure 70.3 illustrates 
the GA selection and crossover process for the simplest 
GA algorithm. 

To solve a resource allocation problem, we adjust 
this algorithm as follows: we set a population of B in- 
dividuals, each one with binary encoded information 
about its position 6;, for i = 1,...,B. There is a land- 
scape that contains 4 different resources sites, each one 
located in a position R; with a width e}, forj=1,...,4. 
Each resource site corresponds to a temperature zone 
in the multizone control grid. When an individual is lo- 
cated in a resource site j, the fitness of that individual 
is given by the error between the current temperature T; 
and the reference temperature T“, otherwise, the fitness 
is 0. As was pointed out before, each individual cor- 
responds to a portion of the total amount of voltage. 
For that reason, if the fitness is good, the population 
evolves, most of the individuals will have the same ge- 
netic information (position), and they will be allocated 
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Fig. 70.4a-d Average temperature (solid lines) and number of bees per zone (stem plots) using the honey bee foraging 


algorithm 
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Fig. 70.5a-d Average temperature (solid lines) and number of chromosomes per zone (stem plots) using the genetic 


algorithm < 
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Fig. 70.7a-d Average temperature (solid lines) and number of bees per zone (stem plots) using the honey bee foraging 
algorithm. Dashed lines indicate the beginning and end of the disturbance 


Fig. 70.6a-d Average temperature (solid) and number of particles per zone (stem plots) using the PSO algorithm < 


at that site. Then, the amount of voltage applied to that 
zone increases as well as the temperature, which pro- 
vokes that the fitness associated to that site decreases, 
and it will become less profitable. For the next genera- 
tion, the individuals tend to be reallocated into another 
more profitable zone, and after some generations, the 
population is distributed in such a way that a uniform 
temperature is achieved for all sites. 


Particle Swarm Optimization 
Particle swarm optimization (PSO) is a population- 
based stochastic optimization technique, inspired by 
the social behavior of animals (e.g., bird flocks, fish 
schools, or even human groups) [70.7]. In PSO, the 
potential solutions, called particles, fly through the 
problem space by following the currently best particles. 
They have two essential reasoning capabilities: mem- 


ory of their own best position and knowledge of the 
global or their neighborhood’s best. Each particle in 
a population has the information about its current po- 
sition that defines a potential solution, and its fitness 
value associated to that position. A change of posi- 
tion of a particle is defined by the velocity, which is 
a vector of numbers that are added to the position co- 
ordinates in order to move the particle from one time 
step to another. At each iteration, a particle’s veloc- 
ity is updated depending on the difference between the 
individual’s previous best and current positions, and 
the difference between the neighborhood’s best and the 
individual’s current position. With these simple rules, 
individuals tend to follow the particles associated to the 
more profitable sites, and optimization problems can 
be solved. The details of this method are summarized 
in [70.7]. 
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Fig. 70.8a-d Average temperature (solid lines) and number of chromosomes per zone (stem plots) using the genetic 
algorithm. Dashed lines indicate the beginning and end of the disturbance < 
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Fig. 70.10a-d Average temperature (solid lines) and number of bees per zone (stem plots) using the honey bee foraging 
algorithm. The dashed lines indicate the different set points for each zone 


Fig. 70.9a-d Average temperature (solid lines) and number of particles per zone (stem plots) using the PSO algorithm. 
Dashed lines indicate the beginning and end of the disturbance < 


To solve a dynamic resource allocation problem, 
an adjustment to this algorithm is made as follows. 
First, we assume a population of B particles, where 
each particle represents a one-dimensional position 6; 
in the search space. The fitness of each individ- 
ual is defined by the temperature error ye; shown 
above. When a particle has a good fitness, most of 


70.3 Results 


In order to compare the behavior of the proposed algo- 
rithms to solve the dynamic resource allocation prob- 
lem, three experiments are performed using the multi- 
zone temperature control grid described in Sect. 70.2.1. 
In the first experiment, we seek the maximum uniform 


the individuals will tend to fly to the same posi- 
tion, provoking a temperature increment and a de- 
crease of the error. For that reason, the particles 
are reallocated to a more profitable place, and after 
some generations, the population is distributed in such 
a way that a uniform temperature is achieved for all 
sites. 


temperature with a single population of B = 200 indi- 
viduals. The second experiment illustrates the response 
of the individuals when a disturbance is applied in the 
fourth zone. Finally, multiple temperature set points are 
assigned to each zone. Our results show the behavior 
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Fig. 70.1la-d Average temperature (solid lines) and number of chromosomes per zone (stem plots) using the genetic 


algorithm. The dashed lines show the set point for each zone 


of the three algorithms applied in a resource allocation 
implementation under the same circumstances. 


70.3.1 Experiment I: 
Maximum Uniform Temperature 


In this experiment, we want to achieve a maximum 
uniform temperature for all zones. For this reason, we 
set reference temperatures Tř = 30, for all j = 1,...,4 
(this value cannot be achieved by the system). We have 
a population of 200 individuals and a PWM frequency 
of 70 Hz, where each individual corresponds to a duty 
cycle of 0.5% (i.e., each individual corresponds to 
0.06 V.) For example, 50 individuals in a zone are equal 
to 25% of the duty cycle and they correspond to 3 volts. 
Figures 70.4—70.6 show the temperature results and the 
number of individuals allocated in each zone. 


70.3.2 Experiment Il: Disturbance 
This experiment is similar to the first one, but now we 


add a controlled disturbance. An extra lamp is placed 
next to zone 4; it is turned on after 4 min, and is turned 


off 2 min later. When we apply this disturbance, the 
temperature in zone 4 increases drastically and site 4 
becomes the least profitable. Then, the number of in- 
dividuals in that site are reallocated, provoking a small 
increase in temperatures of the other three zones. Fig- 
ures 70.7—70.9 illustrate the behavior of the system 
applying the three algorithms. 


70.3.3 Experiment Ill: Multiple Set Points 


In this experiment we want to achieve multiple set 
points (26, 24, 27, and 25 °C, respectively, for each of 
the four zones), which are lower than the ones achieved 
before. Figure 70.10 presents the results obtained using 
the honey bee foraging algorithm. We can observe that 
the set points are never achieved, but the temperatures 
get very close to it, and the steady states are reached 
quickly. This is because the algorithm requires an error 
for each zone e; > 0, which implies that the tempera- 
tures are always below the set point. 

On the other hand, with the GA and PSO, the be- 
havior is very different. When one of the set points is 
achieved, the resources tend to reallocate to the other 
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Fig. 70.12a-d Average temperature (solid lines) and number of chromosomes per zone (stem plots) using the PSO. The 


dashed lines show the set point for each zone 


zones. Therefore, when all the zones reach their set 
points, all the fitnesses (which are proportional to the 
error values) are equal to 0, and the remaining resources 
should not be allocated. However, for these two meth- 
ods, most of the resources need to be allocated to any 
site, and the remaining resources stay in the last site, 
provoking a drastic increase of the temperature in only 
that zone. When the temperature in any other zone de- 


70.4 Discussion 


The experiments that were performed using the bioin- 
spired algorithms show behaviors common to each of 
the three techniques, and some advantages and disad- 
vantages can be discussed. In Sect. 70.3.1, we saw how 
the maximum uniform temperature can be achieved by 
the three methods implemented. We obtained the av- 
erage values and standard deviation of temperatures 
after ten experiments (Table 70.1). We observe that the 
genetic algorithm achieves the highest average temper- 
atures, but the standard deviations are also high, which 
means that the GA behavior is unsteadier. This is be- 


creases, agents tend to be reallocated to that new site 
very quickly, until the temperature increases again. Fig- 
ures 70.11 and 70.12 show this behavior for the GA and 
PSO, respectively, with a fixed set point of 26 °C for all 
four zones. The drastic changes produced by the non- 
needed resources can be observed, and it can be seen 
that these methods are not feasible for these kinds of 
problems. 


cause the GA is very susceptible to changes and, as 
soon as a zone becomes more profitable (the temper- 
ature decreases), most chromosomes tend to abandon 
their current positions and go to the more profitable 
one. That is why the number of chromosomes changes 
abruptly (Fig. 70.5), and the temperature variation is 
also high for each one of the ten experiments. On the 
other hand, the other algorithm’s reactions are slower, 
and the number of individuals remains almost con- 
stant (very low variations). Table 70.1 illustrates that 
the honey bee social foraging algorithm has the lowest 
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Table 70.1 Average temperatures and standard deviations for the last 10 min, using each algorithm 


Algorithms Zone 1 Zone 2 
Honey bee 27.28 + 0.12 27.26 = 0.12 
GA 29.78 + 0.38 29.8 + 0.3 
PSO 27.84 + 0.17 27.86 + 0.18 


temperature variation, which means that for every ex- 
periment, the maximum temperature achieved for each 
zone is practically the same. This is an advantage, be- 
cause repeatability is a very important characteristic in 
practical applications. Besides, the low variations in the 
amount of individuals (i. e., low voltage changes), im- 
ply a low deterioration of the electric elements. 

Section 70.3.2 shows the results when a disturbance 
is applied to the fourth zone. When this disturbance 
is turned on, the temperature increases in that zone, 
and that site becomes less profitable, provoking a re- 
allocation of the individuals into the other zones. Fig- 
ures 70.7—70.9 show how the resources are reallocated 
and the temperatures in the other sites increase. It can 
be observed that during the disturbance, the number of 
individuals in the fourth zone tends to 0 and, as soon 
as the disturbance is turned off, the temperature in that 
zone decreases. These results illustrate the robustness 
of the three techniques for external disturbances, and 
we can also observe that regardless of the type of distur- 
bance, the maximum uniform temperature is achieved. 


70.5 Conclusions 


In this chapter, a novel bioinspired method for dynamic 
resource allocation based on the social behavior of 
honey bees during the foraging process was presented, 
and an application that illustrates the validity of the ap- 
proach was studied. The application that we used is 
a multizone temperature control grid, where the objec- 
tive is to achieve some reference temperatures for each 
one of the zones, taking into account the complexity in- 
duced by the interzone effects and external or internal 
noise. Some comparative analyses have been developed 


References 


Zone 3 Zone 4 

Dy PP) se(0) 1 27.18 + 0.11 
29.75 + 0.34 29.76 + 0.36 
28.55 + 0.14 27.81 + 0.16 


Experiment 3 illustrates the behavior of the algo- 
rithms when low temperature set points are considered. 
We observed that most of the resources should be allo- 
cated to the sites, and low temperature references could 
not be achieved. This is because the individuals move 
from one place to another, looking for the more prof- 
itable site, i.e., the one with the lowest temperature. 
When all the set points are achieved, all profitability 
values are 0, and the individuals will remain in their 
current position. Hence, temperatures continue to rise, 
even if the reference has been achieved. Figures 70.11 
and 70.12 illustrate this behavior for the GA algorithm 
and PSO when a reference temperature of 26°C is set. 
However, the honey bee foraging algorithm can solve 
this kind of problem because of its capability to allocate 
only the necessary bees into the foraging places, and the 
non-needed resources are simply not used. This charac- 
teristic of our technique may be very useful in practical 
applications, such as in smart building temperature con- 
trol, where multiple temperature references for different 
rooms need to be achieved. 


with two evolutionary algorithms, i.e., GA and PSO. 
We can see that the proposed method has some advan- 
tages compared to other bioinspired methods due to its 
capability of allocating only the necessary resources, 
and the low variability of the number of individuals in 
each one of the four zones. Clearly, there are other ap- 
plications for the social foraging method for resource 
allocation. For instance, in the area of task allocation 
of agents, formation control, economic dispatch, and 
smart building temperature control. 
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71.1 Designing Swarm Behaviours 


Imagine the following scenario: in a large area there 
are multiple items that must be reached, and possibly 
moved elsewhere or processed in some particular way. 
There is no map of the area to be searched, and the area 
is rather unknown, unstructured, and possibly danger- 
ous for the intervention of humans or any valuable asset. 
The items must be reached and processed as quickly as 
possible, as a timely intervention would correspond to 
a higher overall performance. This is the typical sce- 
nario to be tackled with swarm robotics. It contains all 
the properties and complexity issues that make a swarm 
robotics solution particularly appropriate. Parallelism, 
scalability, robustness, flexibility, and adaptability to 
unknown conditions are features that are required from 
a system confronted with such a scenario, and exactly 
those features are sought in swarm robotics research. 

Put in other terms, swarm robotics promises the so- 
lution of complex problems through robotic systems 
made up of multiple cooperating robots. With respect to 
other approaches in which multiple robots are exploited 
at the same time, swarm robotics emphasizes aspects 
like decentralization of control, limited individual abil- 
ities, lack of global knowledge, and scalability to large 
groups. 

One important aspect that characterizes a swarm 
robotics system concerns the robotic units, which are 
unable to solve the given problem individually. The 
limitation is given either by physical constraints that 
would prevent the single robot to individually tackle 
the problem (e.g., the robot has to move some items 
that are too heavy), or by time constraints that would 
make a solitary action very inefficient (e.g., there are 
too many items to be collected in a limited time). An- 
other source of limitation for the individual robot comes 
from its inability to acquire a global picture of the prob- 
lem, having only access to partial (local) information 
about the environment and about the collective activity. 
These limitations imply the need for cooperation to en- 
sure task achievement and better efficiency. Groups of 
autonomous cooperating robots can be exploited to syn- 
ergistically achieve a complex task, by joining forces 
and sharing information, and to distributedly undertake 
the given task and achieve higher efficiency through 
parallelism. 

The second important aspect in swarm robotics is 
redundancy in the system, which is intimately con- 
nected with robustness and scalability. Swarm robotics 
systems are made by homogeneous robots (or by rel- 
atively few heterogeneous groups of homogeneous 


robots). This means that the failure of a single or a few 
robots is not a relevant fact for the system as a whole, 
because the failing robot can easily be replaced by 
another teammate. Differently from a centralized sys- 
tem, in a swarm robotics system there is no single 
point of failure, and every component is interchange- 
able with other components. Redundancy, distributed 
control, and local interactions also allow for scalabil- 
ity, enabling the robotic system to seamlessly adapt to 
varying group sizes. This is a significant advantage with 
respect to centralized systems, which would present 
an exponential increase in complexity for larger group 
sizes. 

Because all the above features are desiderata, the 
problem remains as to how to design and implement 
such a robotic system. The common starting point in 
swarm robotics is the biological metaphor, for which 
the fundamental mechanisms that govern the organiza- 
tion of animal societies can be distilled in simple rules 
to be implemented in the robotic swarm. This approach 
allowed us to extract the basic working principles for 
many types of collective behavior, and several examples 
will be presented in this chapter. However, it is worth 
noting that swarm robotics systems are not constrained 
to mimicking nature. Indeed, in many cases there is 
no biological example to be taken as reference, or the 
mechanisms observed in the natural system are too dif- 
ficult to be implemented in the robotic swarm (e.g., 
odor perception is an open problem in robotics, prevent- 
ing easy exploitation of pheromone-based mechanisms 
by using real chemicals). Still, even in those systems 
that have no natural counterpart, the relevant property 
that should be present is self-organization, for which 
group behavior is the emergent result of the numer- 
ous interactions among different individuals. Thanks to 
self-organization, simple control rules repeatedly exe- 
cuted by the individual robots may result in complex 
group behavior. 

If we consider the scenario presented at the begin- 
ning of this chapter, it is possible to recognize a number 
of problems common to many swarm robotics systems, 
which need to be addressed in order to develop suitable 
controllers. One first problem in swarm robotics is hav- 
ing robots get together in some place, especially when 
the robotic system is composed by potentially many 
individuals. Getting together (i. e., aggregation) is the 
precondition for many types of collective behavior, and 
needs to be addressed according to the particular char- 
acteristics of the robotic system and of the environment 
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in which it must take place. The aggregation problem 
is discussed in Sect. 71.2. Once groups are formed, 
robots need some mechanism to stay together and to 
keep a coherent organization while performing their 
task. A typical problem is, therefore, how to maintain 
such coherence, which corresponds to ensuring the syn- 
chronization of the group activities (Sect. 71.3), and to 
keep the group in coordinated motion when the swarm 
must move across the environment (Sect. 71.4). An- 
other common problem in swarm robotics corresponds 
to searching together and processing some items in the 
environment. To this aim, different strategies can be 
adopted to cover the available space, and to identify 
relevant navigation routes without resorting to maps 


71.2 Getting Together: Aggregation 


Aggregation is a task of fundamental importance in 
many biological systems. It is the basic behavior for 
the creation of functional groups of individuals, and 
therefore, supports the emergence of various forms of 
cooperation. Indeed, it can be considered a prerequi- 
site for the accomplishment of many collective tasks. 
In swarm robotics too, aggregation has been widely 
studied, both as a standing-alone problem or within 
a broader context. Speaking in general terms, aggrega- 
tion is a collective behavior that leads a group of agents 
to gather in some place. Therefore, from a (more or less) 
uniform distribution of agents in the available space, the 
system converges to a varied distribution, with the for- 
mation of well recognizable aggregates. In other words, 
during aggregation there is a transition from a homoge- 
neous to a heterogeneous distribution of agents. 


71.2.1 Variants of Aggregation Behavior 


Aggregation can be achieved in many different ways. 
The main issue to be considered is whether or not the 
environment contains pre-existing heterogeneities that 
can be exploited for aggregation: light or humidity gra- 
dients (think of flies or sow bugs), corners, shelters, and 
so forth represent heterogeneities that can be easily ex- 
ploited. Their presence can, therefore, be at the basis 
of a collective aggregation behavior, which, however, 
may not exploit interactions between different agents. 
Instead, whenever heterogeneities are not present (or 
cannot be exploited for the aggregation behavior), the 
problem is more complex. The agents must behave in 
order to create the heterogeneities that support the for- 
mation of aggregates. In this case, the basic mechanism 


and global knowledge (Sect. 71.5). Finally, to maintain 
coherence and efficiency, the swarm robotics system 
is often confronted with the necessity to behave as 
a single whole. Therefore, it must be endowed with 
collective perception and collective decision mecha- 
nisms. Some examples are discussed in Sect. 71.6. 
For each of these problems, we describe some semi- 
nal work that produced solutions in a swarm robotics 
context. In each section, we describe the problem along 
with some possible variants, the biological inspiration 
and the theoretical background, the relevant studies in 
swarm robotics, and a number of other works that are 
relevant for some particular contribution given to the 
specific problem. 


of aggregation relies on a self-organizing process based 
on a positive feedback mechanism. Agents are sources 
of some small heterogeneity in the environment (e.g., 
being the source of some signal that can be chemi- 
cal, tactile or visual). The more aggregated agents, the 
higher the probability to be attracted by the signal. 
This mechanism leads to amplification of small hetero- 
geneities, leading to the formation of large aggregates. 


71.2.2 Self-Organized Aggregation 
in Biological Systems 


Several biological systems present self-organized ag- 
gregation behavior. One of the best studied examples 
is given by the cellular slime mold Dictyostelium 
discoideum, in which aggregation is enabled by self- 
generated biochemical signals that support the migra- 
tion of cells and the formation of a multi-cellular 
body [71.1, 2]. A similar aggregation process can be ob- 
served in many other unicellular organisms [71.3]. So- 
cial and pre-social insects also present multiple forms of 
aggregation [71.4, 5]. In all these systems, it is possible 
to recognize two main variants of the aggregation pro- 
cess. On the one hand, the agents can emit a signal that 
creates an intensity gradient in the surrounding space. 
This gradient enables the aggregation process: agents 
react by moving in the direction of higher intensity, 
therefore aggregating with their neighbors (Fig. 71.1). 
On the other hand, aggregation may result from agents 
modulating their stopping time in response to social 
cues. Agents have a certain probability to stop and re- 
main still for some time. The vicinity to other agents 
increases the probability of stopping and of remaining 
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Fig. 71.1a-d Aggregation process based on a diffusing signal that creates an intensity gradient. (a) Agents individually 
emit a signal and move in the direction of higher concentration. (b) The individual signals sum up to form a stronger 
intensity gradient in correspondence with forming aggregates. (c) A positive feedback loop amplifies the aggregation 


process until all agents are in the same cluster (d) 


within the aggregate, eventually producing an aggrega- 
tion process mediated by social influences (Fig. 71.2). 
In both cases, the same general principle is at work. 
Aggregation is dependent on two main probabilities: 
the probability to enter an aggregate, which increases 
with the aggregate size, and the probability to leave 
an aggregate, which decreases accordingly. This creates 
a positive feedback loop that makes larger aggregates 
more and more attractive with respect to small ones. 
Some randomness in the system helps in breaking the 
symmetry and reaching a stable configuration. 


71.2.3 Self-Organized Aggregation 
in Swarm Robotics 


On the basis of the studies of aggregation in biologi- 
cal systems, various robotic implementations have been 
presented, based on either of the two behavioral mod- 


els described above. Of particular interest is the work 
presented in [71.6], in which the robotic system was de- 
veloped to accurately replicate the dynamics observed 
in the cockroach aggregation experiments presented 
in [71.5]. In this work, a group of Alice robots [71.7] 
was used and their controller was implemented by 
closely following the behavioral model derived from 
experiments with cockroaches. The behavioral model 
consists of four main conditions: 


i) Moving in the arena center 

ii) Moving in the arena periphery 
iii) Stopping in the center 

iv) Stopping in the periphery. 


When stopping, the mean waiting time is influenced 
by the number of perceived neighbors (for more de- 
tails, see [71.6]). The group behavior resulting from the 


d) 


Fig. 71.2a-d Aggregation process based on variable probability of stopping within an aggregate. (a) Agents move ran- 
domly and may stop for some time (gray agent). (b) When encountering a stopped agent, other agents stop as well, 
therefore increasing the size of the aggregate. (c) The probability of meeting an aggregate increases with the aggregate 
size for geometric reasons. Social interactions modulate the probability of leaving the aggregate, which diminishes with 
the increasing number of individuals. (d) Eventually, all agents are in the same aggregate 
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interaction among Alice robots was analyzed with the 
same tools used for cockroaches [71.5, 6]. The compar- 
ison of the robotic system with the biological model 
shows a very good correspondence, demonstrating that 
the mechanisms identified by the behavioral model are 
sufficient to support aggregation in a group of robots, 
with dynamics that are comparable to that observed in 
the biological system. Additionally, the robotic model 
constitutes a constructive proof that the identified mech- 
anisms really work as suggested. 

This study demonstrates, in terms of simple rules, 
the approach of distilling the relevant mechanisms that 
produce a given self-organizing behavior. A different 
approach consists in exploiting artificial evolution to 
synthesize the controllers for the robotic swarm. This 
allows the user to simply define some performance met- 
ric for the group and let the evolutionary algorithm 
find the controllers capable of producing the desired 
behavior. This generic approach has been exploited to 
evolve various self-organizing behaviors, including ag- 
gregation [71.8]. In this case, robots were rewarded 
to minimize their distance from the geometric cen- 
ter of the group and to keep moving. The analysis of 
the evolved behavior revealed that in all cases robots 
are attracted by teammates and repelled by obstacles. 
When a small aggregate forms, robots keep on moving 
thanks to the delicate balance between attractive and re- 
pulsive forces. This makes the aggregate continuously 
expand and shrink, moving slightly across the arena. 
This slow motion of the aggregate makes it possible to 
attract other robots or other aggregates formed in the 
vicinity, and results in a very good scalability of the ag- 


gregation behavior with respect to the group size. This 
experiment revealed a possible alternative mechanism 
for aggregation, which is not dependent on the proba- 
bility of joining or leaving an aggregate. In fact, robots 
here never quit an aggregate to which they are attracted. 
Rather, the aggregates themselves are dynamic struc- 
tures capable of moving within the environment, and in 
doing so they can be attracted by neighboring aggre- 
gates, until all robots belong to the same group. 


71.2.4 Other Studies 


The seminal papers described above are representative 
of other studies, which either exploit a probabilistic ap- 
proach [71.9, 10], or rely on artificial evolution [71.11]. 
Approaches grounded on mathematical models and 
control theory are also worth mentioning [71.12, 13]. 
Other variants of the aggregation behavior can be con- 
sidered. The aggregate may be characterized by an 
internal structure, that is, agents in the aggregate are 
distributed on a regular lattice or form a specific shape. 
In such cases, we talk about pattern/shape forma- 
tion [71.14]. Another possibility is given by the admis- 
sibility of multiple aggregates. In the studies mentioned 
so far, multiple aggregates may form at the beginning 
of the aggregation process, but as time goes by smaller 
aggregates are disbanded in favor of larger ones, eventu- 
ally leading to a single aggregate for the whole swarm. 
However, it could be desirable to obtain multiple ag- 
gregates forming functional groups of a specific size. In 
this case, it is necessary to devise mechanisms for con- 
trolling the group size [71.15]. 


71.3 Acting Together: Synchronization 


Synchronization is a common phenomenon observed 
both in the animate and inanimate world. In a syn- 
chronous system, the various components present 
a strong time coherence between the individual types of 
behavior. In robotics, synchronization can be exploited 
for the coordination of actions, both within a single or 
a multi-robot domain. In the latter case, synchronization 
may be particularly useful to enhance the system effi- 
ciency and/or to reduce the interferences among robots. 


71.3.1 Variants of Synchronization Behavior 


Synchronization in a multi-agent system can be of 
mainly two forms: loose and tight. In the case of loose 


synchronization, we observe a generic coordination in 
time of the activities brought forth by different agents. 
In this case, single individuals do not present a periodic 
behavior, but as a group it is possible to observe bursts 
of synchronized activities. Often in this case there are 
external cues that influence synchrony, such as the day- 
light rhythm. On the other hand, it is possible to observe 
tight synchronization when the individual actions are 
perfectly coherent. To ensure tight synchronization in 
a group, it is possible to rely on either a centralized or 
a distributed approach. In the former, one agent acts as 
a reference (e.g., a conductor for the orchestra or the 
music theme for a ballet) and drives the behavior of the 
other system components. In the latter, a self-organizing 
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process is in place, and the system shows the ability 
to synchronize without an externally-imposed rhythm. 
It is worth noting that tight synchronization does not 
necessitate individual periodic behavior, neither in the 
centralized nor in the self-organized case. For instance, 
synchronization has also been studied between cou- 
pled chaotic systems [71.16]. In the following, we focus 
on self-organized synchronization of periodic behavior, 
which is the most studied phenomenon as it is com- 
monly observed in many different systems. 


71.3.2 Self-Organized Synchronization 
in Biology 


Although synchronization has always been a well- 
known phenomenon [71.17], its study did not arouse 
much interest until the late 1960s, when Winfree be- 
gan investigating the mechanisms underlying biological 
rhythms [71.18]. He observed that many systems in 
biology present periodic oscillations, which can get 
entrained when there is some coupling between the 
oscillators. A mathematical description of this phe- 
nomenon was first introduced by Kuramoto [71.19], 
who developed a very influential model that was after- 
wards refined and applied to various domains [71.17]. 
Similar mechanisms are at the base of the syn- 
chronous signaling behavior observed in various animal 
species [71.3]. Chorusing is a term commonly used to 
refer to the coordinated emission of acoustic commu- 
nication signals by large groups of animals. To cite 
a few examples, chorusing has been observed in frogs, 
crickets, and spiders. However, probably the most fasci- 
nating synchronous display is the synchronous flashing 
of fireflies from South-East Asia. This phenomenon was 
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thoroughly studied until a self-organizing explanation 
was proposed to account for the emergence of syn- 
chrony [71.20]. 

A rather simple model describes the behavior of 
fireflies as the interactions between pulse-coupled oscil- 
lators [71.21]. In Fig. 71.3, the activity of two oscilla- 
tors is represented as a function of time. Each oscillator 
is of the integrate-and-fire type, which well represents 
a biological oscillator such as the one of fireflies. The 
oscillator is characterized by a voltage-like variable that 
is integrated over time until a threshold is reached. At 
this point, a pulse is fired and the variable is reset to the 
base level (Fig. 71.3). Interactions between oscillators 
take the form of constant phase shifts induced by in- 
coming pulses, which bring other oscillators close to the 
firing state, or make them directly fire. These simple in- 
teractions are sufficient for synchronization; in a group 
of similarly pulse-coupled oscillators, constant adjust- 
ments of the phase made by all the individuals lead to 
a global synchronization of pulses (for a detailed de- 
scription of this model, see [71.21]). 


71.3.3 Self-Organized Synchronization 
in Swarm Robotics 


The main purpose of synchronization in swarm robotics 
is the coordination of the activities in a group. This can 
be achieved in different ways, and mechanisms inspired 
by the behavior of pulse-coupled oscillators have been 
developed. In [71.15], synchronization is exploited to 
regulate the size of traveling robotic aggregates. Robots 
can emit a short sound signal (a chirp), and enter a re- 
fractory state for a short time after signaling. Then, 
robots enter an active state in which they may signal 


Fig. 71.3 Synchronization between 
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16 pulse-coupled oscillators. The os- 
Time (s) cillator emits pulses each time its 
state variable reaches the threshold 
level (corresponding to | in the plot). 
When one oscillator emits a pulse, 
its state is reset while the state of the 
other oscillator is advanced by a con- 


stant amount, which corresponds 
to a phase shift, or to the oscillator 
firing if it overcomes the threshold 
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at any time, on the basis of a constant probability per 
time-step. Therefore, the chirping period is not constant 
and depends on the chirping probability. In this state, 
robots also listen to external signals and react by im- 
mediately emitting a chirp. This mechanism, similar to 
chorusing in frogs and crickets, leads to synchronized 
emission of signals. Thanks to this simple synchroniza- 
tion mechanism, the size of an aggregate can somehow 
be estimated. Given the probabilistic nature of chirping, 
a robot has a probability of independently initiating sig- 
naling that depends on the number of individuals in the 
group; estimating this probability by listening to own 
and others’ chirps allows an approximate group size 
estimation. Synchronization, therefore, ensures a mech- 
anism to keep coherence in the group, which is the 
precondition for group size estimation. 

In [71.22], synchronization is instead necessary to 
reduce the interferences between robots, which pe- 
riodically perform foraging and homing movements 
in a cluttered environment. Without coordination, the 
physical interferences between robots going toward and 
away from the home location lead to a reduced overall 
performance. Therefore, a synchronization mechanism 
based on the firefly behavior was devised. Robots emit 
a signal in correspondence to the switch from foraging 
to homing. This signal can be perceived by neighboring 
robots within a limited radius and induces a reset of the 
internal rhythm that corresponds to a behavioral shift 
to homing. Despite the limited range of communica- 
tion among robots, a global synchronization is quickly 
achieved, which leads the group to reduce interferences 
and increase the system performance [71.22]. 

A different approach to the study of synchronization 
is described in [71.23]. Here, artificial evolution is ex- 
ploited to synthesize the behavior of a group of robots, 
with the objective of obtaining minimal communication 
strategies for synchronization. Robots were rewarded to 
present an individual periodic movement and to signal 
in order to synchronize the individual oscillations. The 
results obtained through artificial evolution are then an- 
alyzed to understand the mechanisms that can support 


synchronization, showing that two types of strategies 
are evolved: one is based on a modulation of the oscil- 
lation frequency, the other relies on a phase reset. These 
two strategies are also observed in biological oscilla- 
tors: for instance, different species of fireflies present 
different synchronization mechanisms, based on de- 
layed or advanced phase responses [71.20]. 


71.3.4 Other Studies 


While self-organized synchronization is a well-known 
phenomenon, its application in collective and swarm 
robotics has not been largely exploited. The coupled- 
oscillator synchronization mechanism was applied to 
a cleaning task to be performed by a swarm of micro 
robots [71.24]. Another interesting implementation of 
the basic model can be found in [71.25]. Here, syn- 
chronization is exploited to detect and correct faults in 
a swarm robotics system. It is assumed that robots can 
synchronize a periodic flashing behavior while moving 
in the arena and accomplishing their task. If a robot 
incurs some fault, it will forcedly stop synchronizing. 
This fault can be detected and recovered by neighboring 
robots. Similar to the heartbeat in distributed com- 
puting, correct synchronization corresponds to a well- 
functioning system, while the lack of synchronization 
corresponds to a faulty condition. 

Finally, synchronization behavior may emerge 
spontaneously in an evolutionary robotics setup, even 
if they are not explicitly rewarded. In [71.26], synchro- 
nization of group activities evolved spontaneously as 
a result of the need to limit the interferences among 
robots in a foraging task. In [71.27], robots were 
rewarded to maximize the mean mutual information 
between their motor actions. Mutual information is 
a statistical measure derived in information theory, and 
roughly corresponds to the correlation between the out- 
put of two stochastic processes. Evolution, therefore, 
produced synchronous movements among the robots, 
which could actually maximize the mutual information 
while maintaining a varied behavior. 


71.4 Staying Together: Coordinated Motion 


Another fundamental problem for a swarm is ensuring 
coherence in space. This means that the individuals in 
the swarm must display coordinated movement in order 
to maintain a consistent spatial structure. Coordinated 
motion is often observed in groups of animals. Flocks of 


birds or schools of fish are fascinating examples of self- 
organized behavior producing a collective motion of the 
group. Similar problems need to be tackled in robotics, 
for instance for moving in formation or for distributedly 
deciding a common direction of motion. 
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71.4.1 Variants of the Coordinated Motion 
Behavior 


The coordinated motion of a group of agents can be 
achieved in different ways. Also in this case, we can 
distinguish mainly between a centralized and a dis- 
tributed approach. In a centralized approach, one agent 
can be considered the leader and the other agents fol- 
low (e.g., the mother duck with her ducklings). In the 
distributed approach, instead, there is no single leader 
and some coordination mechanism must be found to 
let the group move in a common direction. Of par- 
ticular interest for swarm robotics are the coordinated 
motion models based on self-organization. Such models 
consider multi-agent systems that are normally homo- 
geneous and characterized by a uniform distribution of 
information: no agent is more informed than the others, 
and there exists no a priori preference for any direction 
of motion (i. e., agents start being uniformly distributed 
in space). However, through self-organization and am- 
plification of shared information, the system can break 
the symmetry and converge to a common direction 
of motion. A possible variant of the self-organized 
coordinated motion consists in having a non-uniform 
distribution of information, which corresponds to hav- 
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Fig. 71.4a-c Self-organized coordinated motion in a group of 
agents. In the bottom part, a group of agents is moving in roughly 
the same direction. According to the model presented in [71.28], 
agents react to the closest neighbor within their perception range 
and follow three main rules: (a) agents move toward a neighbor 
when it is too far; (b) agents move away from a neighbor when it 
is too close; (c) agents rotate and align with a neighbor situated at 
intermediate distances. The iterated application of these rules leads 
the group to move in a same direction 


ing some agents that are more informed than the others 
on a preferred direction of motion. In this case, a few 
informed agents may influence the motion of the entire 


group. 
71.4.2 Coordinated Motion in Biology 


Many animal species present coordinated motion be- 
havior, ranging from bacteria to fish and birds. Not all 
animal species employ the same mechanisms, but in 
general it is possible to recognize various types of in- 
teractions among individuals that have a bearing on the 
choice of the motion direction. Coordinated motion has 
mainly been studied in various species of fish, in birds, 
and in insect swarms [71.29, 30]. The most influential 
model was introduced by Huth and Wissel to describe 
the behavior of various species of fish observed [71.28]. 
In this model, it is assumed that each fish is influenced 
solely by its nearest neighbor. Also, the movement of 
each fish is based on the same behavioral model, which 
also includes some inherent random fluctuation. Ac- 
cording to the proposed behavioral model, each fish 
follows essentially three rules: 


i) Approach a far away individual 
ii) Get away from individuals that are too close 
iii) Align with the neighbor direction (Fig. 71.4). 


When the nearest neighbor is within the closest re- 
gion, the fish reacts by moving away. When the nearest 
neighbor is in the farthest region, the fish reacts by 
approaching. Otherwise, if the neighbor is within the 
intermediate region, the fish reacts by aligning. These 
simple rules are sufficient to produce collective group 
motion, and the final direction emerges from the inter- 
actions among the individuals. 

Starting from the above model, a number of variants 
have been proposed, which take into account differ- 
ent parameters and different numbers of individuals. 
In [71.31], a model including all individuals in the 
perceptual range was introduced, and a broad analysis 
of the parameters was performed, showing how minor 
differences at the individual level correspond to large 
differences at the group level. In [71.32], an experimen- 
tal study on bird flocks in the field was performed, and 
position and velocity data were obtained for each bird in 
a real flock through stereo-photography and 3-D map- 
ping. The data obtained data were used to verify the 
assumption about the number of individuals that each 
bird monitors during flocking, showing that this num- 
ber is constant (and corresponds to about 7 individuals) 
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notwithstanding the varying density of the flock. Fi- 
nally, in [71.33], a model was developed in which some 
of the group members have individual knowledge on 
a preferential direction. The model describes the out- 
come of a consensus decision in the flock as a result 
of the interaction between informed and uninformed 
individuals. 


71.4.3 Coordinated Motion 
in Swarm Robotics 


The models introduced for characterizing the self- 
organized behavior of fish schools or bird flocks have 
also inspired a number of interesting studies. The most 
influential work is definitely that of Reynolds, who de- 
veloped virtual creatures called boids [71.34]. In this 
work, each creature executes three simple types of be- 
havior: 


i) Collision avoidance, to avoid crashing with nearby 
flockmates 

ii) Velocity matching, to move in the same way of 
nearby flockmates 

iii) Flock centering, to stay close to nearby flockmates. 


Notice that the behavioral model corresponds to the 
models proposed in biological studies. The merit of this 
work is that it is the first implementation of the rules 
studied for real flocks in a virtual 3-D world, showing 
a close correspondence of the behavior of boids with 
that of flocks, herds, and schools. Reynolds’ research 
has been taken as inspiration by many other studies on 
coordinated motion, mainly in simulation. In [71.35], 
an implementation of the flocking behavioral model 
was proposed and tested on real robots. Robots use 
infrared proximity sensors to recognize the presence 
of other robots and their distance, which is necessary 
for collision avoidance and flock centering behavior. 
Additionally, a dedicated sensor to perceive the head- 
ing of neighbors was developed to support aligning 
behavior. This system, called the virtual heading sys- 
tem (VHS), is based on a digital compass and wireless 
communication. Despite the fact that a digital compass 
cannot reliably work in an indoor environment, it is 
assumed that neighboring robots have similar percep- 
tions. The heading perceived with respect to the local 
north is communicated over the wireless channel, and 
it is exploited for alignment behavior. This system al- 


lowed testing the flocking behavior of small robotic 
groups in a physical setting and studying the dynam- 
ics of flocking with up to 1000 simulated robots. This 
work was later extended in [71.36], by having a sub- 
group of informed individuals which could steer the 
whole flock, following the model presented in [71.33]. 
The dynamics of steered flocking have been studied by 
varying the percentage of informed robots in simula- 
tion, and tests with real robots have been performed as 
well. 


71.4.4 Other Studies 


As mentioned above, there exist numerous studies that 
were inspired by the schooling/flocking models. All 
these studies adopt some variants of the behavioral rules 
described above, or analyze the group dynamics under 
some particular perspective. A different approach to co- 
ordinated motion can be found in [71.37]. In this work, 
robots have to transport a heavy object and have imper- 
fect knowledge of the direction of motion. They can, 
however, negotiate the goal direction by displaying their 
own preferred direction of motion and by adjusting it 
on the basis of the direction displayed by others. On 
the whole, this mechanism implements similar dynam- 
ics to the alignment behavior of the classical flocking 
model. Here, however, robots are connected together to 
the object to be transported, adding a further constraint 
to the system that obliges a good negotiation to allow 
motion. A similar constraint characterizes the coordi- 
nated motion studies with physically assembled robots 
presented in [71.38, 39]. Here, robots form a physical 
structure of varying shape and can rotate their chas- 
sis in order to match the direction of motion of the 
other robots. In this case, there is no direct detection 
of the motion direction of neighbors. Instead, robots 
can sense the pulling and pushing forces that are ex- 
erted by the other connected robots through the physical 
connections. These pulling/pushing forces are naturally 
averaged by the force sensor, which returns their resul- 
tant. Artificial evolution was exploited to synthesize an 
artificial neural network that could transform the forces 
sensed to motor commands. The results obtained show 
the impressive capability of self-organized coordination 
between the robots, as well as scalability and gener- 
alization to different size and shapes [71.38], and the 
ability to cope with obstacles and to avoid falling out- 
side the borders of the arena [71.39]. 
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71.5 Searching Together: Collective Exploration 


Exploring and searching the environment is an impor- 
tant behavior for robot swarms. In many tasks, the 
swarm must interact with the environment, sometimes 
only to monitor it, but sometimes also to process ma- 
terials or other kinds of resources. Usually, the swarm 
cannot completely perceive the environment, and the 
environment may also change during the operation of 
the robots. Hence, robots need to explore and search 
the environment to monitor for changes or in order to 
detect new resources. 

To cope with its partial perception of the environ- 
ment, a swarm can move, for instance using flocking, in 
order to explore new places (some locations may be un- 
available, though). Hence, most of the environment can 
be perceived, but not at the same time. As in many other 
artificial systems, a tradeoff between exploration and 
exploitation exists and requires careful design choices. 


71.5.1 Variants of Collective Exploration 
Behavior 


There is no perfect exploration and search strategy be- 
cause the structure of the environment in which the 
swarm is placed can take many different shapes. Strate- 
gies only perform more or less well as a function of 
the situation with which they are faced [71.40]. For 
instance, the swarm could be in a maze, in a open envi- 
ronment with few obstacles, or in an environment with 
many obstacles. 

We identified a restricted number of environmen- 
tal characteristics that play an important role in the 
choice of searching behavior in swarm robotics. These 
characteristics are commonly found in swarm robotics 
scenarios, and are the presence of a central place, the 
size of the environment, the presence of obstacles. 

The central place is a specific location where robots 
must come back regularly, for instance for maintenance 
or to deposit foraged items. A scenario that involves 
a central place requires a swarm able to either remem- 
ber or keep track of that location. 

If the environment is closed (finite area) and not too 
large, the swarm may use random motion to explore, 
with fair chances to rapidly locate resources (or even 
the central place). In an open environment, robots can 
get lost very quickly. In this type of environment, it is 
necessary to use a behavior that allows robots to stay 
together and maintain connectivity. 

Obstacles are environmental elements that constrain 
the motion of the swarm. If the configuration of the ob- 


stacles is known in advance, the swarm can move in 
the environment following appropriate patterns. In most 
cases, however, obstacles are unexpected or might be 
dynamic and may prevent the swarm from exploring 
parts of the environment. 


71.5.2 Collective Exploration in Biology 


In nature, animals are constantly looking for resources 
such as food, sexual partners, or nesting sites. Animals 
living in groups may use several types of behavior to 
explore their environment and locate these resources. 

For instance, fish can take advantage of the number 
of individuals in a shoal to improve their capabilities to 
find food [71.41—43]. To do so, they move and maintain 
large interdistances between individuals. In this way, 
fish increase their perceptual coverage as well as their 
chances to find new resources. 

Animals also heavily rely on random motion to 
explore their environment [71.44—46]. Usually the ex- 
ploratory pattern is not fully random (that is, isotropic), 
because animals use all possible environmental cues at 
hand to guide themselves. Random motion can be bi- 
ased towards a given direction, or it can be constrained 
in a specific area, for instance around a previously 
memorized location [71.47]. Some desert ants achieve 
high localization performance with odometry (counting 
their footsteps) and relying on gravity and the polariza- 
tion of natural light. They may move randomly to look 
for resources but they are able to quickly return to their 
nest and also to return to an interesting location previ- 
ously identified. 


71.5.3 Collective Exploration 
in Swarm Robotics 


One of the most common exploration strategies used 
in robotics is random exploration. In a typical imple- 
mentation, robots wander in the environment until they 
perceive a feature of interest [71.48—50]. By doing this, 
robots possibly lose contact with each other and, there- 
fore, their ability to work together. Hence this strategy 
is not suited for large or open environments. Due to the 
stochastic nature of the strategy, its performance can 
only be evaluated statistically. On average, the time to 
locate a feature is proportional to the squared distance 
with robots [71.44]. 

Systematic exploration strategies are very different. 
Robots use some a priori knowledge about the structure 
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Fig. 71.5a-d Gas expansion behavior to monitor the surroundings of a central place. (a) The swarm starts aggregated 
around a central place (represented by a black spot). (b,c) Robots try to move as far as possible from their neighbors, 
while maintaining some visual or radio connection. (d) As a result, the whole swarm expands in the environment, like 


a gas, covering part of the environment 


of the environment in order to methodically sweep it 
and find features. To ensure that robots do not repeat- 
edly cover the same places, they may need to memorize 
which places have already been explored. This is often 
implemented with localization techniques and mapping 
of the environment [71.51]. The advantage of this tech- 
nique is that an answer will be found with certainty, and 
the time of exploration has a lower and upper bound if 
the environment is not open. However, memory require- 
ments may be excessive, and the strategy is not suited 
for open environments. 

Between the two extreme strategies reported in the 
previous paragraph lie a number of more specialized 
strategies that present advantages and drawbacks de- 
pending on the structure of the environment and the 
distribution of the resources. 


Collective motion (which has already been detailed 
in Sect. 71.4) allows swarms to maintain their cohesion 
while moving through the environment. Flocking be- 
havior can be employed in an open environment with 
a limited risk of losing contact between robots. The 
swarm behaves like a sort of physical mesh that covers 
part of the environment; to maximize the area covered 
during exploration robots can increase their interdis- 
tance during motion as much as possible. 

Gas expansion behavior (Fig. 71.5) allows robots to 
quickly and exhaustively explore the surroundings of 
a central place [71.52-55]. While one or several robots 
keep track of a central place, other robots try to move 
as far as possible from their neighbors, while still main- 
taining direct line of sight with at least one of them. 
The swarm behaves like a fluid or gas that penetrates 
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Fig. 71.6a-d Chaining behavior in action with a central place represented by a large black dot, bottom left. (a) Robots 
start aggregated around the central place. (b) While maintaining visual or radio contact with neighbors, some robots 
change role and become part of a chain (grayed out). (c) Other robots move around the central place and encounter the 
early chain of robots. With some probability, they also turn into new parts of the chain. (d) At the end of the iterative 
process, robots form a long chain that spans through the environment and maintains a physical link to the central place 
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the asperities of the environment. The exploration is 
very effective and any change or new resource within 
the perception range of the swarm is immediately per- 
ceived. However, since robots are bound to the central 
place, the area that they can explore is limited by the 
number of robots in the swarm. If robots do not stick to 
a central place, the resulting behavior shifts to a type of 
flocking or moving formation. 

With chaining behavior, swarms can form a chain 
with one end that sticks to a central place and the 
other end that freely moves through the environ- 


ment (Fig. 71.6). In [71.56], minimalistic behavior 
produces a static chain, but different types of chain 
motions can be imagined. In [71.57], for instance, 
chains can build up, move, and disaggregate until 
a resource is found. Contrary to gas expansion behav- 
ior, a chaining swarm may not immediately perceive 
changes in the environment because it has to con- 
stantly sweep the space. Chaining allows robots to 
cover a more important area than gas expansion behav- 
ior, ideally a disc of radius proportional to the number 
of robots. 


71.6 Deciding Together: Collective Decision Making 


Decision making is a behavior used by any artificial 
system that must produce an adapted response when 
facing new or unexpected situations. Because the best 
action depends on the situation encountered, a swarm 
cannot rely on a pre-programmed and systematic re- 
action. Monolithic artificial systems make decisions all 
the time, by gathering information and then evaluating 
the different options at hand. However, when it comes to 
swarms, each group member might have its own opin- 
ion about the correct decision. If all individuals perceive 
the same information and process it in the same way, 
then they might independently make the same decision. 
However, in practice, the more common case is that in- 
dividuals perceive partial and noisy information about 
the situation. Thus, if no coordination among group 
members occurs, a segregation based on differing opin- 
ions might take place, thereby removing the advantages 
of being a swarm. Therefore, the challenge is to have 
the whole group collaborate to make a collective deci- 
sion and take action accordingly. 


71.6.1 Variants of Collective Decision Making 
Behavior 


There are mainly three mechanisms reported in the 
literature that allow swarms to make collective deci- 
sions. The first and most simple mechanism is based 
on opinion propagation. As soon as a group member 
has enough information about a situation to make up its 
mind, it propagates its opinion through the whole group. 

The second mechanism is based on opinion averag- 
ing. All individuals constantly share their opinion with 
their neighbors and also adjust their own opinion in con- 
sequence. This iterative process leads to the emergence 
of a collective decision. The adjustment of the opinion 


is typically achieved with an average function, espe- 
cially if opinions are about quantitative values such as 
a location, a distance, or a weight, for instance. 

The third and last mechanism relies on amplifi- 
cation to produce a collective decision. In a nutshell, 
all individuals start with an opinion, and may decide 
to change their opinion to another one. The switch to 
a new opinion happens with a probability calculated on 
the basis of the frequency of this opinion in the swarm. 
Practically, this means that if an opinion is represented 
often in the group, it has also more chances of being 
adopted by an individual, which is why the term ampli- 
fication is used. 

Each of the three aforementioned mechanisms has 
some advantages over the others and may be preferred, 
depending on the situation faced. The factors that play 
an important role in collective decision processes in- 
clude the speed and the accuracy needed to make the 
choice, the robustness of communication, and the relia- 
bility of individual information. 

In terms of speed, opinion propagation allows 
fast collective decisions, in contrast with the two 
other mechanisms, which require numerous interac- 
tions among individuals. However, this speed generally 
comes at the cost of robustness or accuracy [71.58—60]. 
If communication is not robust enough, messages can 
be corrupted. The mechanism of opinion propagation 
is particularly sensitive to such effects, and a wrong or 
random collective decision might be made by a swarm 
in that case. 

The averaging mechanism would produce a more 
robust decision because wrong information from erro- 
neous messages is diluted in the larger amount of infor- 
mation present in the swarm [71.61]. However, opinion 
averaging works best if all individuals have roughly 
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identical knowledge. If a small proportion of individ- 
uals have excellent knowledge to make the decision, 
while the remaining individuals have poor information, 
opinion propagation may produce better results than 
opinion averaging [71.33]. 

Lastly, the amplification mechanism is the main 
choice for a gradually emerging collective decision if 
opinions cannot be merged with some averaging func- 
tion. Instead of adjusting opinions, individuals simply 
adopt new opinions with some probability. It is worth 
noting that this mechanism can produce good decisions 
even if individuals have poor information. 


71.6.2 Collective Decision Making in Biology 


The powerful possibilities of decision making in groups 
were already suggested by Galton back in 1905 [71.62]. 
In that paper, Galton reports the results of a weight- 
judging competition in which competitors had to esti- 
mate the weight of a fat ox. With slightly less than 800 
independent estimates, Galton observed that the aver- 
age estimate was accurate to 1% of the real weight of 
the ox. This early observation opened interesting per- 
spectives about the accuracy of collective estimations, 
but it did not describe a collective decision mechanism, 
since Galton himself had to gather the estimates and 
apply some calculation to evaluate the estimate of the 
crowd. 

More recent studies about group navigation have 
shown that groups of animals cohesively moving to- 
gether towards a goal direction reach their objective 
faster than independent individuals [71.63,64]. The 
mechanism of collective navigation not only allows the 
individuals to move and stay together, but it also acts 
as a distributed averaging function that locally fuses the 
opinions of individuals about the direction of motion, 
allowing them to improve their navigation performance. 

In the last decades, the amplification mechanism 
has been identified as a source of collective decision in 
a broad range of animal species such as ants [71.65, 
66], honeybees [71.67,68], spiders [71.69], cock- 
roaches [71.70], monkeys [71.71], and sheep [71.72]. 

Ants that choose one route to a resource probably 
constitute the most well-known example of the amplifi- 
cation mechanism. In [71.66], an ant colony is offered 
two paths to two identical resource sites. Initially, the 
two resources are exploited equally, but after a short 
time ants focus on a single resource. This collective 
choice happens because ants that have found the re- 
source come back to the nest, marking the ground with 
a pheromone trail. The next ants that try to reach the 


resource are sensitive to this odor and have higher 
chances of following the path with higher pheromone 
concentration. As a result of this amplified response, 
a collective decision rapidly emerges. In addition, it 
was shown in [71.73] that when ants are presented two 
paths of different lengths to the same resource, the same 
pheromone-based mechanism allows them to choose 
the shortest path. This can be explained by the fact 
that ants using the shortest path need less time to make 
round trips, making the pheromone concentration on 
this path grow faster. 

Quorum sensing is a special case of the ampli- 
fication mechanism which has been notably used to 
explain nest site selection in ants and bees [71.74— 
76]. The most basic example of quorum sensing uses 
a threshold to dictate if individuals should change their 
opinion. If an individual perceives enough neighbors 
(above the threshold) that already share the opposite 
opinion, then it will in turn adopt this opinion. It has 
been shown that this threshold makes quorum sensing 
more robust to the propagation of erroneous infor- 
mation during the decision process. In addition, the 
accuracy of collective decisions made with quorum 
sensing may improve with group size, and cognitive ca- 
pabilities of groups may outperform the ones of single 
individuals [71.77, 78]. In the case of nest site selection, 
cohesion is mandatory for the group. A cross inhibition 
mechanism complementing amplification was identi- 
fied as a key feature to ensure that groups do not 
split [71.79]. 


71.6.3 Collective Decision Making 
in Swarm Robotics 


In swarm robotics, opinion averaging has been used 
to improve the localization capabilities of robots. 
In [71.50], a swarm of robots carries out a foraging task 
between a central place and a resource site. The robots 
have to navigate back and forth between the two places 
and use odometry to estimate their location. As odom- 
etry provides noisy estimates, robots using solely this 
technique may quickly get lost. Here, robots can share 
and merge their localization opinions when they meet, 
by means of local infrared communication. By doing 
so, robots manage better localization and improve their 
performance in the foraging task. Moreover, robots as- 
sociate a confidence level to their estimates, which is 
used to decide how information is merged. If a robot 
advertises an opinion with very high confidence, then 
the mechanism produces opinion propagation. Hence 
the two mechanisms of averaging and propagation are 
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Fig. 71.7a-d A swarm of robots is presented with two resource sites in its environment and must collectively choose 
one. (a) Initially, robots are randomly scattered. (b) Using a random walk, they move until a resource site is found. On 
average, the swarm is split in the two sites. (c) The more neighbors they perceive, the longer the robots stay. A competition 
between the two sites takes place and any random event may change the situation. Here a robot just left the top right site, 
further reducing the chances that other robots stay there. (d) The swarm has made a choice in favor of the bottom left 
site. The choice is stable, although some robots may frequently leave the site for exploration 


blended in a single behavior, and the balance between 
them is tuned by the user with a control parameter. 

The aggregation behavior previously mentioned in 
Sect. 71.2.3 can be exploited to trigger collective de- 
cision making in situations where there are several en- 
vironmental heterogeneities. In [71.80], the robots are 
presented two shelters and they choose one of them as 
a resting site by aggregating there. The behavior of the 
robots closely follows the one observed in cockroaches 
(Fig. 71.7). In [71.81], both robots and cockroaches 
are introduced in an arena with two shelters, demon- 
strating the influence of the two groups on each other 
when making the collective decision. The collective 
decision is the result of an amplification mechanism, 
implemented via the probability of a robot leaving an 
aggregate. This probability diminishes with the number 
of perceived neighbors, allowing larger aggregates to 
attract more robots. 


71.6.4 Other Studies 


The opinion averaging mechanism was deeply investi- 
gated with a general mathematical approach in [71.82, 


71.7 Conclusions 


In this chapter, we have presented a broad overview 
of the common problems faced in a swarm robotics 
context, and we pointed to possible approaches to 
obtain solutions based on a self-organizing process. 
We have discussed aggregation, synchronization, co- 


83]. These studies demonstrate convergence of the 
mechanism and emphasize the importance of the topol- 
ogy of the communication network through which in- 
teractions take place. 

Another amplification mechanism inspired from the 
behavior of honeybees was implemented in [71.84]. 
With this mechanism, it was shown that robots are able 
to make a collective decision and between two sites 
reliably choose the one offering the best illumination 
conditions. 

The amplification mechanism based on pheromone 
trails, which is used by ants, has also inspired 
several swarm robotics studies. In [71.85,86], the 
pheromone is replaced by light projected by a beamer. 
This implementation is limited to laboratory stud- 
ies, but it allowed demonstrating path selection with 
robot swarms. In [71.87], the process is abstracted 
inside a network of robots that are deployed in 
the environment. Virtual ants hop from robots to 
robots and deposit pheromone inside them. Even- 
tually, the shortest path to a resource is marked 
out by robots with high and sustained levels of 
pheromone. 


ordinated motion, collective exploration, and decision 
making, and we argued that many application scenarios 
could be solved by a mix of the above solutions. So, 
are we done with swarm robotics research? Definitely 
not. 
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First of all, the fact that possible solutions exist does 
not mean that they are the most suitable for any possible 
application scenario. Hardware constraints, miniatur- 
ization, environmental contingencies, and performance 
issues may require the design of different solutions, 
which may strongly depart from the examples given 
above. Still, the approaches we presented constitute 
a logical starting point, as well as a valid benchmark 
against which novel approaches can be compared. 

Another important research direction consists in 
characterizing the self-organizing behavior we pre- 
sented in terms of abstract properties, such as the time 
of convergence toward a stable state, sensitivity to pa- 
rameter changes, robustness to failures, and so forth. 
From this perspective, the main problem is to ensure 
a certain functionality of the system with respect to 
the needs of the application and to predict the sys- 
tem features before actual development and testing. In 
many cases, a precise characterization of the system is 
not possible, and only a statistical description can be 
achieved. Still, such an enterprise would bring swarm 
robotics closer to an engineering practice, eventually 
allowing us to guarantee a certain performance of the 
developed system, as well as other properties that engi- 
neering commonly deals with. 

The examples we presented all refer to homoge- 
neous systems, in which all individuals are physically 
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72. Collective Manipulation and Construction 


Lynne Parker 


Many practical applications can make use of robot 
collectives that can manipulate objects and con- 
struct structures. Examples include applications in 
warehousing, truck loading and unloading, trans- 
porting large objects in industrial environments, 
and assembly of large-scale structures. Creating 
such systems, however, can be challenging. When 
collective robots work together to manipulate 
physical objects in the environment, their inter- 
actions necessarily become more tightly coupled. 
This need for tight coupling can lead to important 
control challenges, since actions by some robots 
can directly interfere with those of other robots. 
This chapter explores techniques that have been 
developed to enable robot swarms to effectively 
manipulate and construct objects in the environ- 
ment. The focus in this chapter is on decentralized 
manipulation and construction techniques that 
would likely scale to large robot swarms (at least 10 
robots), rather than approaches aimed primarily 
at smaller teams that attempt the same objectives. 
This chapter first discusses the swarm task of object 
transportation; in this domain, the objective is for 


72.1 Object Transportation 


Some of the earliest work in swarm robotics was aimed 
at the object transportation task [72.1—6], which re- 
quires a swarm of robots to move an object from its 
current position in the environment to some goal des- 
tination. The primary benefit of using collective robots 
for this task is that the individual robots can combine 
forces to move objects that are too heavy for individ- 
ual robots working alone or in small teams. However, 
the task is not without its challenges; it is nontrivial to 
design decentralized robot control algorithms that can 
effectively coordinate robot team members during ob- 
ject transportation. A further complication is that the 
interaction dynamics of the robots with the object can 
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robots to collectively move objects through the 
environment to a goal destination. The chapter 
then discusses object clustering and sorting, which 
requires objects in the environment to be aggre- 
gated at one or more locations in the environment. 
The final task discussed is that of collective con- 
struction and wall building, in which robots work 
together to build a prespecified structure. While 
these different tasks vary in their specific objec- 
tives for collective manipulation, they also have 
several commonalities. This chapter explores the 
state of the art in this area. 


be sensitive to certain object geometries [72.7, 8] and 
object rotations during transportation [72.8], thus exac- 
erbating the control problem. 

There are many ways to compare and contrast 
alternative distributed techniques to collective object 
transport. The most common distinctions are: 


v 
fa] 
e 
i 
“I 
N 
. 
= 


@ Local knowledge only versus some required global 
knowledge (e.g., of team size, state, position). 

@ Homogeneous swarms versus heterogeneous 
swarms (e.g., teams with leaders and followers). 

@ Manual controller design versus autonomously 
learned control. 
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@ 2-D (two-dimensional) vs. 3-D (three-dimensional) 
environments. 

@ Obstacle-free environments versus cluttered envi- 
ronments. 

@ Static environments versus dynamic environments. 

@ Dependent on fully functioning robots versus sys- 
tems robust to error. 


Alternatively, we can compare transportation tech- 
niques by focusing on the specific manipulation tech- 
nique employed. The manipulation techniques used for 
collective object transportation can be grouped into 
three primary methods [72.9]: pushing, grasping, and 
caging. The pushing approach requires contact between 
each robot and the object, in order to impart force in the 
goal direction; however, the robots are not physically 
connected with the object. In the grasping approach, 
each robot in the swarm is physically attached to the 
object being transported. Finally, the caging approach 
involves robots encircling the object so that the object 
moves in the desired direction, even without the con- 
stant contact of all the robots with the object. 

This section outlines some of the key techniques 
developed to address this object transportation task, or- 
ganized according to these three main techniques. 


72.1.1 Transport by Pushing 


A canonical task often used as a testbed in distributed 
robotics is the box pushing task. The number, size, 
or weight of the boxes can be varied to explore dif- 
ferent types of multirobot cooperation. This task typ- 
ically involves robots first locating a box, positioning 
themselves at the box, and then moving the box co- 
operatively toward a goal position. Typically, this task 
is explored in 2-D. The domain of box pushing is 
also popular because it has relevance to several real- 
world applications [72.10], including warehouse stock- 
ing, truck loading and unloading, transporting large 
objects in industrial environments, and assembling of 
large-scale structures. 

The pushing technique was first demonstrated in 
the early work of Kube and Zhang [72.1], inspired 
by the cooperative transport behavior in ants [72.7]. 
They proposed a behavior-based approach that com- 
bined behaviors for seeking out the object (illuminated 
by a light), avoiding collisions, following other robots, 
and motion control. An additional behavior to detect 
stagnation was used to ensure that the collective did 
not work consistently against each other. In this ap- 
proach, all robots acted similarly; there was no concept 


of a leader and followers. While some of the robots in 
the swarm might not contribute to the pushing task due 
to poor alignment or positioning along the nondominant 
pushing direction, Kube and Zhang showed that care- 
ful design of these behaviors enabled the robot swarm 
to distribute along the boundary of the object and push 
it. Figure 72.1 shows five robots cooperatively pushing 
a lighted box. 

Other researchers have explored different aspects of 
box pushing in multirobot systems. While much of this 
early work involved demonstrations of smaller robot 
teams, many of these techniques could theoretically 
scale to larger numbers of robots. Task allocation and 
action selection are often demonstrated using collec- 
tive box pushing experiments; examples of this work 
include that of Parker [72.11, 12], who illustrated as- 
pects of adaptive task allocation and learning; Gerkey 
and Martaric [72.13], who present a publish/subscribe 
dynamic task allocation method; and Yamada and 
Saito [72.14], who develop a behavior-based action 
selection technique that does not require any commu- 
nication. 

Other work using box pushing as an implemen- 
tation domain for multirobot studies includes Donald 
etal. [72.15], who illustrates concepts of informa- 
tion invariance and the interchangeability of sensing, 
communication, and control; Simmons et al. [72.16], 
who demonstrate the feasibility of cooperative con- 
trol for building planetary habitats, Brown and Jen- 
nings [72.17], and Béhringer et al. [72.18], who ex- 
plored notions of strong cooperation without communi- 
cation in pusher/steerer models, Rus et al. [72.19], who 


Fig. 72.1 Demonstration of five robots collectively push- 
ing a lighted box (after [72.7]) 
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studied different cooperative manipulation protocols in 
robot teams that make use of different combinations 
of state, sensing, and communication, and Jones and 
Mataric [72.20], who developed general methods for 
automatically synthesizing controllers for multirobot 
systems. 

Most of this existing work in box pushing has fo- 
cused, not on box pushing as the end objective, but 
rather on using box pushing for demonstrating various 
techniques for multirobot control. However, for studies 
whose primary objective is to generate robust cooper- 
ative transport techniques, work has more commonly 
focused on manipulation techniques involving grasp- 
ing and caging, rather than pushing, since grasping and 
caging provide more controllability by the robot team. 


72.1.2 Transport by Grasping 


Grasping approaches for object transportation in swarm 
robotics typically make use of form closure and force 
closure properties [72.21]. In form closure, the ob- 
ject motion is constrained via frictionless contact con- 
straints; in force closure, frictional contact forces ex- 
erted by the robots prevent unwanted motions of the 
manipulated object. The earliest work representing the 
grasping technique is that of Wang et al. [72.4]. This ap- 
proach uses form closure, along with a behavior-based 
control approach that is similar to the early swarm 
robot pushing technique of Kube and Zhang [72.1]. 
The technique of Wang etal. called BeRoSH (for 
Behavior-based Multiple Robot System with Host for 
Object Manipulation), incorporates behaviors for push- 
ing, maintaining contact, moving, and avoiding objects. 
In this approach, the goal pose of the object is provided 
directly to each robot from an external source (i.e., 
the Host); otherwise, the robots work independently ac- 
cording to their designed behaviors. As a collective, the 
swarm exhibits form closure. Wang et al. showed that 
this form closure technique can successfully transport 
an object to its desired goal pose from a variety of dif- 
ferent starting locations. 

Another early work using the force closure grasping 
technique is that of Stilwell and Bay [72.2] and Johnson 
and Bay [72.3]. They developed distributed leader- 
follower techniques that enable swarms of tank-like 
robots to transport pallets collectively while maintain- 
ing a level height of the pallet during transportation 
(Fig. 72.2). In their approaches, a pallet sits atop sev- 
eral tank-like robots; the weight of the pallet creates 
a coupling with the robots that could be viewed sim- 
ilar to a grasp. To transport the pallet, one vehicle is 


designated as the leader. This leader then perturbs the 
dynamics of the system to move the swarm in the de- 
sired direction, and with the desired pallet height. The 
remaining robots in the swarm react to the perturbations 
to stabilize the forces in the system. The system is fully 
distributed, and requires robots to only use local force 
information to achieve the collective motion. The in- 
dividual robots do not require knowledge of the pallet 
mass or inertia, the size of the collective, or the robot 
positions relative to the pallet’s center of gravity. They 
showed the control stability of their approach for this 
application, even in the presence of inaccurate sensor 
data. 

A related approach is that of Kosuge and 
Oosumi [72.5], who also used a decentralized leader— 
follower approach for multiple holonomic robots grasp- 
ing and moving an object, in a manner similar to that 
of [72.2]. Their approach defines a compliant motion 
control algorithm for each velocity-controlled robot. 
The main difference of this work compared to [72.2] 
is that the control algorithm specifies the desired in- 
ternal force as part of the coordination algorithm. This 
approach was validated in simulation for robots carry- 
ing an aluminum steel pipe. 

Another related approach is that of Miyata 
et al. [72.6], who addressed the need for nonholonomic 
vehicles to regrasp the object during transport. Their ap- 
proach includes a hybrid system that makes use of both 
centralized and decentralized planners. The centralized 
planner develops an approximate motion plan for the 
object, along with a regrasping plan at low resolution; 
the decentralized planner precisely estimates object mo- 
tion and robot control at a much higher resolution. 


Fig. 72.2 Cooperative transport of a pallet using tank-like 
robots (after [72.2]) 
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They demonstrated the effectiveness of this approach 
in simulation. 

Sugar and Kumar [72.22] developed distributed 
control algorithms enabling robots with manipulators to 
grasp and cooperatively transport a box. In this work, 
a novel manipulator design enables the locomotion 
control to be decoupled from the manipulation con- 
trol. Only a small number of the team members need 
to be equipped with actively controlled end effectors. 
This approach was shown to be robust to position- 
ing errors related to the misalignment between the two 
platforms and errors in the measurement of the box 
size. 

Cooperative stick pulling [72.23, 24] was explored 
by Jjspeert et al.; this task requires robots to pull sticks 
out of the ground (Fig. 72.3). The robot controllers 
are behavior-based, and include actions such as look- 
ing for sticks, detecting sticks, gripping sticks, obstacle 
avoidance, and stick release. Experiments show that the 
dynamics are dependent on the ratio between the num- 
ber of robots and sticks; that collaboration can increase 
superlinearly with certain team sizes; that heterogene- 
ity in the robots can increase the collaboration rate 
in certain circumstances; and that a simple signalling 
scheme can increase the effectiveness of the collabo- 
ration for certain team sizes. A main objective of this 
research was to explore the effectiveness of various 
modeling techniques for group behavior. These model- 
ing techniques are discussed in more detail in a separate 
chapter. 

The SWARM-BOTS project is a more recent ex- 
ample of the use of grasping for collective transport; 
it also makes use of self-assembly as a novel approach 
for achieving distributed transport. In this work [72.25], 
s-bot robots are developed that have grippers en- 
abling the robots to create physical links with other 
s-bots or objects, thus creating assemblies of robots. 
These assemblies can then work together for naviga- 
tion across rough terrain, or to collectively transport 
objects. The s-bots are cylindrical, with a flexible arm 


Fig. 72.3 Stick pulling experiment using robot collectives 
(after [72.23]) 


and toothed gripper that can connect one s-bot to an- 
other (Fig. 72.4). 

The decentralized control of the SWARM-BOT 
robots is learned using evolutionary techniques in sim- 
ulation, then ported to the physical robots. The learned 
s-bot control [72.26] consists of an assembly module, 
which is a neural network that controls the robot prior 
to connection, and a transport module, which is a neu- 
ral network that enables the s-bot to move the object 
toward the goal after a grasp connection is made. The 
self-assembly process involves the use of a red-colored 
seed object, to which other s-bots are attracted. S-bots 
initially light themselves with a blue ring, and then are 
attracted to the red color, while being repulsed by the 
blue color. Once robots make a connection, they color 
themselves red. 

The interaction of these attractive and repulsive 
forces across the s-bots enables the robots to self- 
assemble into various connection patterns. Once the s- 
bots have self-assembled, they use the transport module 
to align toward a light source, which indicates the tar- 
get position. The s-bots then apply pushing and pulling 
motions to transport the object to the destination. Simi- 
lar to the approach of Kube and Zhang [72.1], the s-bots 
also check for stagnation and execute a recovery move 
when needed. The authors demonstrate [72.8] how the 
evolutionary learning approach allows the collective 


Fig. 72.4 An s-bot, developed as part of the SWARM- 
BOTS project (after [72.25]) 
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to successfully deal with different object geometries, 
adapt to changes in target location, and scale to larger 
team sizes. 

This technique for collective transport using self- 
assembly was demonstrated [72.25] in an interesting 
application of object transport, in which 20 s-bots self- 
assembled into four chains in order to pull a child across 
the floor (Fig. 72.5). In this experiment, the user spec- 
ifies the number of assembled chains, the distribution 
of the s-bots into the chains, the global localization of 
the child, and the global action timing. The s-bots then 
autonomously form the chains using self-assembly and 
execute the pull. 

Several additional interesting phenomena regarding 
collective transport were discovered in related studies 
with the SWARM-BOTS. Nouyan et al. [72.27] showed 
that the different collective tasks of path formation, 
self-assembly, and group transport can be solved in 
a single system using a homogeneous robot team. They 
further introduce the notion of chains with cycle di- 
rectional patterns, which facilitate swarm exploration 
in unknown environments, and assist in establishing 
paths between the object and goal. The paths estab- 
lished by the robot-generated chains mimic pheromone 
trails in ants. In [72.28], Grof and Dorigo determined 
that, while robots that behave as if they are solitary 
robots can still collectively move objects, robots that 
learn transport behaviors in a group can achieve a better 
performance. In [72.29], Campo et al. showed that the 
SWARM-BOTS robots could effectively transport ob- 


Fig. 72.5 SWARM-BOTS experiment in which s-bots 
self-assemble to pull a child across the floor (after [72.25]) 


jects even with only partial knowledge of the direction 
of the goal. They investigated four alternative control 
strategies, which vary in the degree to which the robots 
negotiate regarding the goal position during transport. 
Their results showed that negotiating throughout ob- 
ject transport can improve motion coordination. All of 
these works are based on inspiration from biological 
systems. 

The work of Berman et al. [72.31] is not only bio- 
inspired, but also seeks to directly model the group 
retrieval behavior in ants. Their studies examined the 
ants’ roles during transport in order to define rules that 
govern the ants’ actions. They further explored mea- 
surements of individual forces used by the ants to guide 
food to their nest. They found that the distributed ant 
transport behavior exhibits an initial disordered phase, 
which then transitions to a more highly coordinated 
phase of increased load speed. From these studies, 
a computational dynamic model of the ant behavior was 
designed and implemented in simulations, showing that 
the derived model matches ant behavior. Ultimately, 
this approach could be adapted for use on physical robot 
teams. 

Once a robot collective has begun transporting an 
object, the question arises as to how new robots can 
join the group to help with the transport task. Es- 
posito [72.30] addresses this challenge by adapting 
a grasp quality function from the multifingered hand 
literature. This approach assumes that robots know the 
object geometry, the total number of robots in the 
swarm, and the actuator limitation. Individual robot 
contact configurations are defined relative to the ob- 
ject center and object boundary. The objective is to 
find an optimal position for a new robot by opti- 
mizing across the grasping wrench space. A numeri- 
cal algorithm was developed to address this problem, 
which incorporates the force closure criteria. This ap- 


Fig. 72.6 Illustration of unmanned tugboats autonomously 
transporting a barge (after [72.30]) 
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Fig. 72.7a-e Illustration in simulation of object closure 
by 20 robots (after [72.32]) 


Fig. 72.8 Demonstration of the use of vector fields for col- 
lective transport via caging (after [72.33]) 


proach was demonstrated on unmanned tugboats col- 
lectively moving a barge, as illustrated in Fig. 72.6. 
In this demonstration, the robots are equipped with 
articulated magnetic attachments that allow them to 
grasp the barge. This approach is scalable to larger 
numbers of robots, with constant best case runtime, 


and median runtimes polynomial in the number of 
robots. 


72.1.3 Transport by Caging 


The caging approach simplifies the object manipulation 
task, compared to the grasping approach, by making 
use of the concept of object closure [72.34]. In ob- 
ject closure, a bounded movable area is defined for 
the object by the robots surrounding it. The benefit 
of this approach is that continuous contact between 
the object and the robots is not needed, thus making 
for simpler motion planning and control techniques, 
compared to grasping techniques based on the form or 
force closure. Wang and Kumar [72.32] developed this 
object-closure technique under the assumptions that the 
robots are circular and holonomic, the object is star- 
shaped, the robots know the number of robots in the 
collective, and can estimate the geometric properties 
of the object, along with the distance and orientation 
to other robots and the object. Their approach causes 
the robots to first approach the object independently, 
and then search for an inescapable formation, which 
is a configuration of the robots from which the object 
cannot escape. Finally, the robots execute a formation 
control strategy to guide the object to the goal des- 
tination. The object approach technique is based on 
potential fields, in which force vectors attract the robot 
toward the object and generally away from other robots. 
Song and Kumar [72.35] proved the stability of this po- 
tential field approach for collective transport. Robots 
search for proper configurations around the object by 
representing the problem as a path finding problem in 
configuration space. This work describes a necessary 
and sufficient condition for testing for object closure. 
Later work [72.36] presents a fast algorithm to test for 
object closure. Experiments with 20 robots validate the 
proposed approach (Fig. 72.7). 

A further enhancement of this vector-based control 
strategy was developed in [72.33], which can account 
for inter-robot collisions. This latter strategy imple- 
ments three primary behaviors — approach, surround, 
and transport. In this variant of the work, robots con- 
verge to a smooth boundary using control-theoretic 
techniques. This work was implemented on a collective 
of physical robots, as illustrated in Fig. 72.8. 
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72.2 Object Sorting and Clustering 


Collective object sorting and clustering requires robot 
teams to sort objects from multiple classes, typically 
into separate physical clusters. There are different types 
of related tasks in this domain [72.37], including clus- 
tering, segregation, patch sorting, and annular sorting. 
Early discussions of this task in robot swarms were 
given by Deneubourg etal. [72.38], with the ideas 
inspired by similar behaviors in ant colonies. The ob- 
jective is to achieve clustering and sorting behaviors 
without any need for hierarchical decision making, 
inter-robot communication, or global representations of 
the environment. Deneubourg et al. showed that stig- 
mergy could be used to cluster scattered objects of 
a single type, or to sort objects of two different types. To 
achieve the sorting behavior, the robots sensed the local 
densities of the objects, as well as the type of object 
they were carrying. Clustering resulted from a simi- 
lar mechanism operating on a single type of object. 
Beckers etal. [72.39] achieved clustering from even 
simpler robots and behaviors, via stigmergic threshold 
mechanisms. 

Holland and Melhuish [72.37] explored the ef- 
fect of stigmergy and self-organization in swarms of 
homogeneous physical robots. The robots are pro- 
grammed with simple rule sets with no ability for 
spatial orientation or memory. The experiments show 
the ability of the robots to achieve effective sort- 
ing and clustering, as illustrated in Fig. 72.9. In this 
work, a variety of influences were explored, includ- 
ing boundary effects and the distance between ob- 
jects when deposited. The authors concluded that the 
effectiveness of the developed sorting behaviors is 
critically dependent on the exploitation of real-world 
physics. An implication of this finding is that simu- 
lators must be used with care when exploring these 
behaviors. 

Wang and Zhang [72.40, 41] explored similar aims, 
but focused on discovering a general approach to the 
sorting problem. They conjecture that the outcome of 
the sorting task is dependent primarily on the capabil- 
ities of the robots, rather than the initial configuration. 
This conjecture is validated in simulation experiments, 
as illustrated in Fig. 72.10. 

Other work on this topic includes that of Yang 
and Kamel [72.42], who present research using three 
colonies of ants having different speed models. The 
approach is a two-step process. The first step is for clus- 
terings to be visually formed on the plane by agents 
walking, picking up, or setting down objects according 


to a probabilistic model, which is based on Deneubourg 
et al. [72.38]. The second step is for clusters to be com- 
bined using a hypergraph model. Experiments were 
conducted in simulation to show the viability of the 
approach. The authors also discovered that having too 
many agents can lead to a deterioration in the swarm 
performance. 

Martinoli and Mondada [72.43] implemented an- 
other bio-inspired approach to clustering, in which the 
robot behavior is similar to that of a Braitenberg vehi- 
cle. They also discovered that large numbers of robots 
can cause interference in this task, concluding that non- 
cooperative task cannot always be improved with more 
robots. 


~——S ee! 


Fig. 72.9 Results of physical robot experiments in sorting. Panel 
(a) shows the starting configuration, while (b) shows the sorting 
results after 1.75 h (after [72.37]) 


Fig. 72.10 Results of simulations of sorting tasks, with 8 
robots and 40 objects of two types (after [72.40]) 
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72.3 Collective Cor 
The objective of the collective construction and wall 
building task is for robots to build structures of a spec- 
ified form, in either 2-D or 3-D. This task is distin- 
guished from self-reconfigurable robots, whose bodies 
themselves serve as the dynamic structure. This sec- 
tion is focused on the former situation, in which ma- 
nipulation is required to create the desired structure. 
The argument in favor of this separation of mobil- 
ity and structure is that, once formed, the structure 
does not need to move again, and thus the ability 
to move could serve as a liability [72.44]. Further- 
more, robotic units that serve both as mobility and 
structure might not be effective as passive structural 
elements. 


truction and Wall B 


uilding 

Werfel etal. have extensively explored this topic, 
developing distributed algorithms that enable simplified 
robots to build structures based on provided blueprints, 
both in 2-D [72.45-47] and in 3-D [72.44]. In their 
3-D approach, the system consists of idealized mobile 
robots that perform the construction, and smart blocks 
that serve as the passive structure. The robots’ job is to 
provide the mobility, while the blocks’ role is to identify 
places in the growing structure at which an additional 
block can be placed that is on the path toward obtain- 
ing the desired final structure. The goal of their work is 
to be able to deploy some number of robots and free 
blocks into a construction zone, along with a single 
block that serves as a seed for the structure, and then 
have the construction to proceed autonomously accord- 
ing to the provided blueprint of the desired structure. 
Several simplifying assumptions are made in this 


ei i work [72.44], such as the environment being weight- 
Hih less and the robots being free to move in any direction 

in 3-D, including along the surface of the structure un- 

512 blocks = 451 blocks 220blocks 258 blocks 465blocks| {CT Construction. This work does not address physical 


robot navigation and locomotion challenges, grasping 


Fig. 72.11 Experiments for a variety of 3-D structures, built au- challenges, etc. 


tonomously by a system of simple robots and blocks (after [72.44]) 


Fig. 72.12a-f Proof-of-principle experiments for 2-D con- 
struction, using a single robot (after [72.45]) 


In this approach, blocks are smart cubes; they 
can communicate with attached neighbors, they share 
a global coordinate system, and they can communi- 
cate with passing robots regarding the validity of block 
attachments to exposed faces. Once robots have trans- 
ported a free block to the structure, they locate attach- 
ment points in one of three ways: random movement, 
systematic search, or gradient following. A signifi- 
cant contribution of this work is the development of 
the block algorithm that enables the blocks to specify 


Fig. 72.13 Geometric structures built by a team of 30 
robots, in simulation (after [72.48]) 
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Step 1-1 Step 1-2 


Fig. 72.14 Experiments with pro- 


\ totype hardware designed for 


multirobot construction tasks (af- 
ter [72.49]) 


Step 2 


Step 3-1 Step 3-2 


how to grow the developing structure with guarantees, 
and with only limited required communication. More 
specifically, the communication requirement between 
blocks scales linearly in the size of the structure, while 
no explicit communication between the mobile robots 
is needed. 

Experiments using this approach have shown the 
ability of the system to build a variety of structures 
in simulation, as illustrated in Fig. 72.11. A proof-of- 
principle physical robot experiment using a single robot 
in the 2-D case [72.45] is illustrated in Fig. 72.12. 

Werfel [72.48] also describes a system for arranging 
inert blocks into arbitrary shapes. The input to the robot 
system is a high-level geometric program, which is then 
translated by the robots into an appropriate arrange- 
ment of blocks using their programmed behaviors. The 
desired structure is communicated to the robots as a list 
of corners, the angles between corners, and whether 
the connection between corners is to be straight or 
curved. Robots are provided with behaviors such as 
clearing, doneClearing, beCorner, collect, seal, and 
off. Figure 72.13 shows some example structures built 
using this system in simulation. 


Fig. 72.15 Experimental trial demonstrating a swarm 
building a loose wall via a spatiotemporal varying template 
(after [72.50]) 


Hardware challenges of collective robot construc- 
tion are addressed by Terada and Murata [72.49]. In 
this work, a hardware design is proposed that defines 
passive building blocks, along with an assembler robot 
that constructs structures with the robots. Figure 72.14 
shows the prototype hardware completing an assem- 
bly task. In principle, multiple assembler robots could 
work together to create larger construction teams more 
closely aligned with the concept of swarm construc- 
tion. 

Other related work on the topic of collective con- 
struction includes the work of Wawerla et al. [72.51], 
in which robots use a behavior-based approach to build 
a linear wall using blocks equipped with either posi- 
tive or negative Velcro, distinguished by block color. 
Their results show that adding | bit of state informa- 
tion to communicate the color of the last attached block 
provides a significant improvement in the collective 
performance. The work by Stewart and Russell [72.50, 
52] proposes a distributed approach to building a loose 
wall structure with a robot swarm. The approach makes 
use of a spatiotemporal varying light-field template, 
which is generated by an organizer robot to help di- 
rect the actions of the builder robots. Builder robots 
deposit objects in locations indicated by the template. 


Fig. 72.16 Experiments of blind bulldozing for site clearing using 
physical robots (after [72.53]) 
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Figure 72.15 shows the results from one of the experi- 
ments using this approach on physical robots. 

Another type of construction is called blind bulldoz- 
ing, which is inspired by a behavior observed in certain 
ant colonies. Rather than constructing by accumulat- 
ing materials, this approach achieves construction by 
removing materials. This task has practical application 
in site clearing, such as would be needed for planetary 
exploration [72.54]. Early ideas of this concept were 
discussed by Brooks etal. [72.55], which argues for 
large numbers of small robots to be delivered to the lu- 
nar surface for site preparation. Parker et al. [72.53], 


72.4 Conclusions 


This chapter has surveyed some of the important tech- 
niques that have been developed for collective object 
transport and manipulation. While many advances have 
been made, there are still many open challenges that re- 
main. Some open problems include: How to deal with 
faults in the robot team members during task execution; 
how to address construction in dynamic and cluttered 


further developed this idea by proposing robots using 
force sensors to clear an area by pushing material to 
the edges of the work site. In this approach, the robot 
system collective behavior is modeled in terms of how 
the nest grows over time. Stigmergy is used to control 
the construction process, in that the work achieved by 
each robot affects the other robots’ behaviors through 
the environment. Figure 72.16 shows some experiments 
using this approach on physical robots. The authors ar- 
gue that blind bulldozing is appropriate in applications 
where the cost, complexity, and reliability of the robots 
is a concern. 


environments; how to enable humans to interact with 
the robot swarms; how to extend more of the existing 
techniques to 3-D applications; how to design formal 
techniques for predicting and guaranteeing swarm be- 
havior; how to realize larger scale systems on physical 
robots; and how to apply swarm techniques for manip- 
ulation and construction to practical applications. 
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Kasper Støy 


Reconfigurable robots are robots built from mecha- 
tronics modules that can be connected in different 
ways to create task-specific robot morphologies. 
In this chapter we introduce reconfigurable robots 
and provide a brief taxonomy of this type of robot. 
However, the main focus of this chapter is on the 
four most important challenges in realizing re- 
configurable robots. The first two are mechatronics 
challenges, namely the challenge of connector de- 
sign and energy. Connectors are the most important 
design element of any reconfigurable robot be- 
cause they provide it with much of its functionality, 
but also many of its limitations. Supplying energy 
to a connected, distributed multi-robot system 
such as a reconfigurable robot is an important, but 
often underestimated problem. The third challenge 
is distributed control of reconfigurable robots. It is 
examined both how reconfigurable robots can be 
controlled in static configurations to produce loco- 
motion and manipulation and how configurations 
can be transformed through a self-reconfiguration 
process. The fourth challenge that we will discuss 
is programability and debugging of reconfigurable 
robotsystems. The chapter is concluded with a brief 


Reconfigurable robots are a kind of robot built from 
modules that can be connected in different ways to 
form different morphologies for different purposes. The 
motivation for this is that conventional robots are lim- 
ited by their morphology. E.g., the size of the wheels 
of a wheeled robot determines what terrain it can tra- 
verse. If it has small wheels it can operate in confined 
spaces, but not traverse rugged terrain and, vice versa, if 
it has large wheels. A robot arm may be limited in terms 
of reach or inability to move around in the environ- 
ment. Reconfigurable robots aim to solve this problem 
by providing a robotic system that can be manually re- 
configured to be physically suited for the specific task 
at hand. It is conceivable that a reconfigurable robot on- 
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perspective. Overall, the chapter provides a general 
overview of the field of reconfigurable robots and 
is a perfect starting point for anyone interested in 
this exciting field. 


site can be fitted with several types of wheels, different 
segments can be added to form the body or arms. In this 
way, the reconfigurable robot can be a perfect physical 
fit for its task despite the task not being known or de- 
fined in advance. 

While this practically oriented motivation is suffi- 
cient for developing and working with reconfigurable 
robots, the vision for the field is much deeper. The un- 
derlying, long-term vision is to develop robots that are 
robust, self-healing, versatile, cheap, and autonomous. 
Hence most research in the field of reconfigurable 
robots has been on self-reconfigurable robots. These 
robots consist of modules just like reconfigurable 
robots, but in addition the modules can automatically 
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Fig. 73.1 A gathering of Dictyostelium discoideum amoe- 
bae cells can be seen migrating (some individually, and 
some in streams) toward a central point. Aggregation 
territories can be as much as a centimeter in diameter (af- 
ter [73.2]) 


connect, disconnect, and move with respect to neigh- 
boring modules, and thus the robot as a whole can 
change shape not unlike the robot systems envisioned 
in science fiction movies such as Terminator and Trans- 
formers or the children’s cartoon Barbapapa. Another 
way to view self-reconfigurable robots is as multi- 
cellular robots [73.1]. In this case, each module is com- 
parable to a cell in an organism. In fact, the small animal 
hydra is able to self-reconfigure in the sense that if it is 
cut in half, the two sections each form a smaller, but 
complete hydra. While not strictly a multi-cellular or- 
ganism, the slime mold Dictyostelium has also been of 
significant inspiration to the self-reconfigurable robot 
community. In this slime mold individual cells search 
for food in a local area, but once the food sources are 
used up the cells aggregate, as is shown in Fig. 73.1, to 
form slugs that number in the hundreds of thousands 
of cells, however, is still able to function as one or- 
ganism whose mission it is in fact to find a suitable 
place to disperse spores for the next generation of slime 
molds. 

The modular design gives reconfigurable robots 
a number of useful features. First of all the robots are 
robust since if modules fail, the rest of the modules will 
still continue working and thus the robot can maintain 


Fig. 73.2 The CKbot in a chain configuration with wheels 
attached (courtesy of Modlab at University of Pennsylva- 
nia) 


a level of functionality, although it is slightly reduced. 
Self-reconfigurable robots extend on this by being able 
to eject failed modules from the robot and replace them 
with modules from other parts of their bodies and effec- 
tively achieve self-healing. A powerful demonstration 
of this is the CKBot shown in Fig. 73.2, which af- 
ter a kick has broken into several clusters of modules. 
These clusters are able to locate each other again and 
recreate the original structure [73.3]. Another vision of 
reconfigurable robots is that they are versatile due to 
their modular structure. Reconfigurable robots are not 
limited by a fixed shape, but the number and capabilities 
of the modules are available. A final feature is that the 
individual modules can be produced relatively cheaply 
due to production at scale. The implication is that even 
though the individual module can be quite complex they 
can be mass produced and thus their cost can be rela- 
tively low compared to their complexity. 

It is also important to point out that reconfigurable 
robots can never be a universal robot that can take on the 
functionality of any robot. It is, of course, the ultimate 
dream, but given a known task-environment a robot can 
be custom-designed for this and thus will be simpler 
and better performing than a comparable reconfigurable 
robot. Thus reconfigurable robots are best suited for sit- 
uations where the task or environment is not known in 
advance or in locations where many different types of 
robots are needed, but it is not possible to bring them 
all. Optimal applications for reconfigurable robots are 
thus in extra-planetary missions, disaster areas, and so 
on. However, they may also find their use in more down 
to earth applications such as educational robotics [73.4] 
or as robot construction kits [73.5]. 

It is important to note that these are all visions that 
the reconfigurable robot community is striving towards, 
but has only realized in limited ways. However, it is the 
vision of these truly amazing reconfigurable robots that 
drives us forward. 
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73.1 Mechatronics System Integration 


In mechatronics system integration the main challenge 
is how to make a trade-off between different potential 
features of a module and the need to fit everything in 
a mechatronics module, which typically has a radius on 
the order of centimeters. The different classes of recon- 
figurable robots reflect different trade-offs. 

The oldest class is the mobile reconfigurable robot 
and can be traced back to the cellular robot (CEBOT) 
developed in the late 1980s [73.6], but new instanti- 
ations of this class, which are an order of magnitude 
smaller, have also recently been published [73.79]. 
This class of reconfigurable robots is characterized by 
the modules having a high-degree of self-mobility, typ- 
ically obtained by providing each module with a set of 
wheels. Modules of other classes also have a limited de- 
gree of self-mobility, e.g., a module can perform inch- 
worm like gaits, but it is so limited that we do not con- 
sider it mobile. 

Chain-type reconfigurable robots (Fig. 73.3a) were 
the first of the modern reconfigurable robots in the 
sense that they successfully demonstrated versatility: 
given the same reconfigurable robot several locomotion 
gaits could be implemented, including inch-worming, 
rolling, and even walking, as was demonstrated using 
the PolyPod robot [73.10, 11] and later its descendent 
PolyBot [73.12] and CONRO [73.13]. The chain-type 
modules are characterized by having a high degree of 
internal actuation which typically allow a module to 
bend or twist. However, there are examples of modules 
which provide for both bending and twisting [73.14]. 
These modules are also typically elongated, which 
makes them suitable for forming chains of modules 
with many degrees of freedom, making them appropri- 
ate for making limbs. 

Another class of reconfigurable robots is the lattice- 
type robots (Fig. 73.3b). This class of robot addresses 
one of the short-comings of chain-type robots. Namely, 
that it is difficult for chains to align and connect be- 
cause this requires precision that they often do not have. 
This makes it difficult to achieve self-reconfiguration 
with chain-type modular robots. The solution lattice- 
type reconfigurable robots represents is to have a ge- 
ometric design that allows them to fit in a lattice just 


Fig. 73.3a-c Examples of the three main types of recon- 
figurable robots: (a) CONRO chain-type, (b) molecule 
lattice-type, (c) M-TRAN hybrid, courtesy of USC’s Infor- 
mation Sciences Institute (a); Distributed Robotics Labo- 
ratory, MIT (b); AIST, Japan (c) > 


like atoms in a crystal. The movement of the modules 
is then limited to moving from lattice to lattice posi- 
tion, a task that only requires limited precision, sensing, 
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and actuation. However, this often means that modules 
have limited functionality outside of the lattice. Early 
two-dimensional lattice-type robots include the Fracta 
and Metamorphic robots [73.15, 16]; the first three-di- 
mensional lattice-type robots were the Molecule and the 
3-D-Unit [73.17, 18]. 

Chain-type and lattice-type robots (Fig. 73.3c) have 
largely been superseded by the hybrid reconfigurable 
robots. These robots combine the characteristics of 
chain-type and lattice-type robots in one system. That 
is, they can both fit in a lattice structure, which al- 
lows for relatively easy self-reconfiguration, and out- 
of-lattice, which allows for efficient locomotion or 
manipulation. The recent generation of self-reconfig- 
urable robots, including M-TRAN II, ATRON shown 
in Fig. 73.8, and SuperBot, are all of the hybrid 
type [73.14, 19, 20]. 

A final type of reconfigurable robot are the actua- 
tion-less robots. These robots depend on external forces 
to provide reconfiguration capabilities and are not able 
to move once they are in a lattice. There are stochastic 
versions of actuation-less modules that are suspended in 
a fluid [73.21] or float on an air-hockey table [73.22], 
and the random movements of the modules in the 
medium allow them by chance to get close enough to 
form connections. A slight twist of this approach is 
that modules only use the external forces to swing from 
lattice position to lattice position, but maintain control 
of when to disconnect and connect themselves [73.23]. 
There are also deterministic versions, which employ an 


73.2 Connection Mechanisms 


An element of the mechatronic design of reconfig- 
urable modular robots that has turned out to be a sig- 
nificant challenge is the mechanism that connects 
modules to one another. This may appear puzzling 
at first, but individual modules are functionally lim- 
ited and, hence, reconfigurable robots perform most 
tasks using groups of modules. This means that ev- 
erything has to be passed across connectors from 
module to module, including forces, communication, 
and energy. For self-reconfigurable robots connector 
design is even more difficult because the connector 
also has to be able to actively connect and discon- 
nect. The optimal connector would have the following 
features: 


@ Small size 
@ Fast 


assembly-by-disassembly process where modules start 
connected in a lattice and modules that are not needed 
in the specific configuration can then deterministically 
decide to disconnect from the structure (typically based 
on electro-magnetic forces) [73.24]. Finally, there are 
the manually reconfigurable robots that depend on the 
human user (or another robot) to perform the reconfig- 
uration [73.25, 26]. 

An orthogonal classification of reconfigurable 
robots is according to whether they are homogeneous 
or heterogeneous. Homogeneous modular robots con- 
sist of identical modules and have been favored in the 
community because they lend themselves to self-recon- 
figuration. However, it is becoming clear that if we want 
to keep modules simple and provide a certain level of 
functionality, we need to focus more on heterogeneous 
systems. It is not cost effective to provide all modules 
with the same level of functionality, and more impor- 
tantly, the modules become too complex, heavy, and 
large if they are to contain all the functionality needed, 
which in practice make them unsuited for practical ap- 
plications [73.27]. 

Another emerging idea is that of soft modules. In 
fact, quite a number of rigid modules that have been 
built come out of projects that aimed to build soft re- 
configurable robots. The motivation for soft modules is 
that they provide a certain level of compliance in the 
interaction with the environment and also within the 
robot. However, a good way to realize soft reconfig- 
urable robots has not yet been discovered. 


Strong 

Robust to wear and tear 

High tolerance to alignment errors 

Energy use only in the transition phases 

Transferal of electrical and/or communication sig- 
nals between modules 

Genderless 

Allows connection with different orientations 
Disconnects from both sides 

Dirt resistant. 


While there are a few connectors that incorpo- 
rate most of these features, none has implemented 
them all. It is easy to imagine that a solution 
is something along the lines of self-cleaning, con- 
ducting, with active velcro or gecko skin. Unfortu- 
nately, this does not exist (yet) and connector de- 
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signs are, therefore, based on conventional electro- 
magnetic, mechanical, or electro-static principles of 
connection. 

Magnetic connectors and combinations with elec- 
tro-magnets for active connection and disconnection 
have been quite successful given that they meet most 
of the above requirements except for being gender- 
less and strong. The gender issue is normally solved 
by laying out magnets in a geometrical pattern on the 
surface of the connector that allow male connectors 
and female connectors to be connected at a discrete 
number of angles. However, the main shortcoming of 
the magnetic solution is the strength it provides. It 
is clear that if the magnetic force is too weak mod- 
ules will fall apart easily. However, it is, in fact, also 
a problem if the magnetic force is too strong, because 
for active mechanisms the modules have to overcome 
the magnetic force to disconnect (or for manually re- 
configurable robots, the human user has to overcome 
the magnetic force). Therefore, a compromise has to 
be found, which is not optimal for any of the situa- 
tions. A recent solution to this problem is the use of 
switchable magnets. Switchable magnets come in sim- 
ple mechanical forms where magnets are physically 
turned to change the direction of the magnetic flux and 
thus the connection strength. The more advanced type 
use electro-magnets to change the magnetic polariza- 
tion of a permanent magnet achieving the same but in 
a much smaller form factor. These developments have 
opened up the possibility of using magnetic connectors 
again, but it remains largely unexplored in the com- 
munity except for the robot pebbles system (where the 
technology originated) [73.24]. 


73.3 Energy 


Reconfigurable robots are typically designed for auton- 
omy and hence rely on on-board batteries for power. 
The challenges here are to enable the modules to share 
the available energy and to allow the robot to recharge 
once batteries are depleted. 

It is important for the modules of a reconfigurable 
robot to be able to share energy because modules may 
have very different activity levels and hence very dif- 
ferent levels of energy consumption. Therefore, the life 
of the robot can be extended significantly by allow- 
ing inactive modules to donate their energy to more 
active ones. The issue is largely unexplored but there 
has been attempts of passing energy across connec- 
tors [73.32] through physical connections, and it has 


The most recent generations of self-reconfigurable 
robots have all favored mechanical solutions. A me- 
chanical solution is based on hooks coming out of one 
connector surface attaching to holes in the opposing 
connector surface. A mechanical solution immediately 
solves the problem of having strong connectors, but un- 
fortunately introduces others. The most important prob- 
lems are they are large and slow, e.g., in the ATRON 
self-reconfigurable robots the connector mechanism 
and associated actuators and electronics account for up 
to 60% of the modules’ size and weight and it takes 2 s 
to connect. However, the Terada connector [73.28] used 
in M-TRAN III and the SINGO connector [73.29] used 
in SuperBot appear to have provided potential solutions 
to the problem of size, but the time issue still remains. 

A last class of connectors are based on electro-static 
forces [73.30]. The idea is to charge two opposing metal 
surfaces causing the two surfaces to connect strongly. 
While being an interesting option, the realized systems 
are impractical, because they are large and sensitive to 
the distance between the connection surfaces. This ap- 
proach makes most sense at smaller scales, but despite 
some effort this has not been realized. Also, in this 
category of non-standard connector unisex velcro con- 
nectors [73.31] should also be mentioned, but here, of 
course, the main problem is to obtain enough connec- 
tion strength. 

Overall, connector technology is fairly advanced, 
but there is certainly room for improvement. However, 
at this point for significant progress to be made, new 
results probably have to emerge from material science 
and not from the reconfigurable robotic community it- 
self. 


been discussed in the context of electro-static connec- 
tors [73.30]. For self-reconfigurable robots there is also 
likely to be an algorithmic solution where modules 
change roles over time and, hence, distribute the energy 
consumption equally among modules over time. 

For recharging, a solution is to run an energy 
bus through the robot that both allows modules to 
recharge their batteries and run off an external power 
supply [73.33]. It has also been proposed for more sta- 
tionary applications that modules do not have onboard 
batteries, but are powered by the external power sup- 
ply. A way to achieve this while still giving the modules 
some autonomy of movement is to charge them through 
the floor; this has been investigated both mechatroni- 
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cally [73.15,34] and algorithmically [73.35]. A more 
flexible, but challenging approach is to allow a subset 
of robot modules to return to the charger and then re- 
turn to charge the remaining modules [73.36]. 


73.4 Distributed Control 


Reconfigurable robots were born out of the distributed 
autonomous robot systems community and there has, 
therefore, been a focus on distributed control algorithms 
since the early beginning. The reason why distributed 
control is such a good match for reconfigurable robots 
is that, if designed well, they have characteristics such 
as robustness and scalability. Robustness in this context 
refers to the ability of the control system and, hence, the 
robot to continue to function despite module failures 
and communication errors. This may seem as a rela- 
tively modest advantage, however, it turns out to be 
crucial. 

Communication systems on reconfigurable robots 
tend to be unreliable because communication in the 
case of wired communication has to be passed across 
the physical connector between modules and the con- 
nection may not have perfectly connected and, hence, 
the electrical terminals for passing the communica- 
tion signals have not made a completely stable phys- 
ical connection. It may also be that dust has tem- 
porarily ruined the physical interface between the two 
modules, not allowing electrical communication sig- 
nals to pass through. For these reasons, reconfigurable 
robots often rely on wireless communication in the 
form of either infrared communication or more global 
forms of wireless communication such as Bluetooth. 
However, this does not solve the problem, it just 
changes it. For infrared communication the transmit- 
ter and receiver on modules that are to communicate 
may not be aligned perfectly. There may be crosstalk 
caused by reflections of signals that cause interfer- 
ence between signals and even cause modules to re- 
ceive messages that were not for them in the first 
place [73.38]. Communication relying on electro-mag- 
netic waves do not have these problems, but then 
often the interference between modules and even back- 
ground wireless signals can cause communication to 
fail. The point here is that communication errors can- 
not only be attributed to poor design or the immature 
nature of module prototypes, but are fundamental prob- 
lems that the algorithms have to be able to handle 
robustly. 


At the more explorative end of the spectrum there 
may be interesting possibilities in wireless energy trans- 
fer [73.37], solar energy, or other alternative forms of 
obtaining or harvesting energy. 


Scalability is the other advantage of distributed 
control algorithms. The motivation here is that while 
current reconfigurable robots consist of tens of mod- 
ules, the ambition is that eventually we will have robots 
consisting of hundreds, thousands, or maybe even mil- 
lions of modules. It is, therefore, important that the 
controller does not rely on a central module for con- 
trol, since this module would be both a bottleneck for 
the responsiveness of the system and also be a single 
point of failure. Therefore, scalability of control al- 
gorithms is crucial, and distributed control algorithms 
have the potential to provide just that. Also, it is im- 
portant here to understand that algorithms also have to 
be able to deal with module failure, since as the num- 
ber of modules increases the chance of module failures 
and communication errors increase. In fact, given a high 
enough number of modules, modules will fail. Assume 
that the probability that a single module fails is pı, the 
probability that one module out of n fails, p, is given 
by 


Pn = 1—(1—p,)". (73.1) 


This is a very basic consideration, but it is important 
to understand that the probabilities are working against 
a controller that is not fault tolerant. For example, given 
ten modules with just a 1% probability of failing, the 
probably of one of them failing is 9.6%. Given this 
background we come to realize that distributed control 
algorithms are not just a luxury, but absolutely required 
if we are to realize reconfigurable robots. 


73.4.1 Communication 


A fundamental basis for all distributed control systems 
is the supporting communication system. In reconfig- 
urable robots there are local, global, and hybrid com- 
munication systems, see Fig. 73.4a-c, respectively. 
Local communication systems are based on mod- 
ule to module communication based on, for instance, 
infrared transceivers. While fairly primitive, this form 
of communication is essential in reconfigurable robots 
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Fig. 73.4a-c The underlying communication models used in reconfigurable robots are (a) local, (b) global, or (c) hybrid 


(after [73.39]) 


because it allows two modules to determine their rel- 
ative positions and orientations. This is not possible 
using global communication systems independently of 
whether they are wired or wireless. Local communica- 
tion also scales since there is no common communica- 
tion medium that becomes saturated. 

Global communication systems using wired bus 
systems or wireless communication are also useful for 
real-time control of the reconfigurable robot. Because 
in a purely local communication system there may sig- 
nificant lag even if the communication volume is low, 
because messages may have to travel many links to 
arrive. Hence, an optimal design is one that has both 
a local communication system to support topology dis- 
covery and a global communication system to perform 
high-speed global coordination. 

An alternative is hybrid communication [73.39]. 
The idea behind this is to make a bus system whose 
topology can be changed dynamically. Initially, mod- 
ules can connect to neighbors to discover the local 
topology. As the need arises modules may connect 
or disconnect, i.e., reconfigure, their busses to match 
a given communication load and distribution across the 
system. While the idea seems to hold potential, it has 
not been thoroughly investigated. 


73.4.2 Locomotion 


One of the basic tasks of a reconfigurable robot is 
locomotion, hence let us take a look a some of the 
algorithms which have been proposed for controlling 
locomotion. One of the first distributed control algo- 
rithms was gait control tables [73.11,40] (Fig 73.5). 
Despite being very simple, the algorithm is a pow- 
erful demonstration that often practical and robust 
algorithms are more useful than theoretically sound 
algorithms. 


Each cell in a control table corresponds to the posi- 
tion of one actuator of one module of a specific time 
interval. The column identifies the actuator and the 
row identifies the time interval. The algorithm is based 
on the assumption that all modules are synchronized. 
When the algorithm is activated each module moves its 
actuators to the position identified in the first row of 
the gait control table. It then waits until the time inter- 
val has passed and then move actuators to the position 
identified by the second row and so on. When the end 
of the gait control table is reached the controller loops 
back to the first row. 

Gait control tables are a simple form of distributed 
control since they only work with the specific number 
of modules for which they were designed and they make 
the relatively large assumption that all modules’ clocks 
are synchronized. There is no way around the first lim- 
itation; however, the second one in practice often holds 
long enough to make successful experiments relying on 
modules being initialized at the same time. While be- 


Fig. 73.5 The PolyBot robot in a loop configuration con- 
trolled by a gait control table (after [73.41]) 
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ing fairly primitive the algorithm has been successful, 
and given our motivation above it is fairly clear why: 
it is very robust. It does not rely on communication 
and, in fact there is no communication between mod- 
ules so the failure of one module does not influence 
the control of other modules, but may of course reduce 
the performance of the robot. However, it is often what 
we refer to as graceful degradation of performance, be- 
cause the degradation is proportional to the number of 
failed modules. 

Above we have described gait control tables in 
their most basic form, however, in reality each cell 
of a control table may refer to a general behavior in- 
stead of a specific actuator position. For example, it is 
possible that an actuator should implement a spring- 
like function, be turned off completely, or be com- 
pletely stiff. In this way, the behavior of individual 
modules can be influenced by the behavior of other 
modules and the environment through which the robot 
is navigating. 

Theoretically, the main problem of gait control 
tables is that modules have no mechanism to stay syn- 
chronized and, hence, over time modules will lose syn- 
chronization. Another more serious consequence of this 
is that the robot as a whole cannot react to the environ- 
ment as a whole and, for instance, change locomotion 
pattern or shape. One solution to this problem is rep- 
resented by hormone-based control algorithms [73.42]. 
Slightly simplified, the idea is that before each module 
executes a row of the gait control table a synchroniz- 
ing hormone is passed through the robot. This ensures 
that modules stay synchronized and, in addition, it is 
also possible to pass different hormones to reflect dif- 
ferent desired locomotion patterns. While theoretically 
well-developed hormone-based control slows down the 
robot due to the overhead of synchronization and even 
worse if synchronization hormones are lost the robot 
may stop for a while until a new hormone is gen- 
erated. Role-based control [73.43] is a compromise 
between the two. The main idea of role-based control 
is to have a looser coupling between action and syn- 
chronization. The modules have autonomy like in gait 
control tables and achieve synchronization over time. 
However, the robot is not able to react globally as 
fast as hormone-based controllers because synchroniza- 
tion signals are slow compared to the movement of the 
robot. 

These algorithms are mainly suited for open-looped 
control. However, an important challenge is to under- 
stand how to adapt locomotion patterns and config- 
uration to the environment. This is a less explored 


challenge [73.44, 45] and is a very important challenge 
for the future of reconfigurable robots. 


73.4.3 Self-Reconfiguration 


A challenge that has received significant attention is 
that of self-reconfiguration control. This challenge is, 
of course, tied to self-reconfigurable robots and not 
to reconfigurable robots in general. It turns out that 
the general problem of reconfiguring one configuration 
into another is computationally intractable. In fact, it 
is currently believed that to find the optimal solution 
is NP-hard [73.46]. However, it is not entirely clear if 
there exists a subspace of the problem where it is com- 
putationally more tractable. There may be special cases 
where this is the case, but we currently do not know. 
The current status is that self-reconfiguration control re- 
mains difficult, in particular because we also aim for 
distributed, not just centralized, algorithms for solv- 
ing this believed-to-be NP-hard problem. In Fig. 73.6 
we shown an example of a short self-reconfiguration 
sequence. 


Definition 
One way to define the distributed self-reconfiguration 
problem is: 


Given a start configuration A and a final configura- 
tion B, distributedly find and execute a sequence of 
disconnections, moves, and connections that trans- 
forms A into B. 


This formulation shies away from the optimality 
criteria because, in practice, good-enough solutions 
are what we are after. Of course, optimal solutions 
would be better, but given that the problem is NP- 
hard we cannot find them efficiently. A likely false 
assumption this formulation implies is that configu- 
rations A and B are known. While this is true for 
very simple cases where the self-reconfigurable robot 
is to transform itself into a pre-specified object like 
a chair, in general the robot should be able to dis- 
tributedly discover suitable configurations. That is, the 
final configuration B is often not known beforehand. 
This leads to the flip-side of the self-reconfiguration 
problem, which has only been addressed to a limited 
degree: 


Given a start configuration A, a task T, and an en- 
vironment E, distributedly find a configuration B 
better suited for task T in environment E. 
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It is likely that the split between configuration dis- 
covery and self-reconfiguration control is not entirely 
productive and thus maybe a formulation like this is bet- 
ter: 


Given a start configuration A, a task T, and an en- 
vironment E, distributedly find an action that makes 
configuration A better suited for task T in environ- 
ment E. 


A final comment on the problem formulation is that 
we may need all three variations of the problem. The 
last formulation is useful for incremental improvement, 
but occasionally the robot has to go through a paradigm 
shift, e.g., from a snake to a walking robot, and to 
achieve this we need the first two formulations of the 
problem. 


Fig. 73.6 A self-reconfiguration sequence that transforms 
M-TRAN from a walker to a snake configuration (af- 
ter [73.47]) 


Algorithms 
A self-reconfiguration algorithm consists of a repre- 
sentation of the final configuration and a movement 
strategy. 

The movement strategies that have been re- 
searched so far are random movement [73.15], lo- 
cal rules [73.48], coordinate attractors [73.49], gradi- 
ents [73.50], and recruitment [73.51]. The most concep- 
tually simple algorithm is one where modules know the 
global coordinates of the positions contained in the goal 
configuration. In this case, modules can move around 
randomly and stop when they are at a coordinate which 
is contained in the goal configuration. While this strat- 
egy is attractive due to its simplicity, it has a number of 
drawbacks. In a three-dimensional self-reconfigurable 
robot random movement is dangerous because mod- 
ules may by accident disconnect from the structure and 
fall down. While this problem may be solved by build- 
ing sturdy, soft modules that are able to reattach to the 
structure after a fall it is a difficult solution (at least 
nobody has attempted it so far). Another problem is 
that when modules settle randomly hollow subspaces or 
sealed off caves may be created where modules cannot 
enter. Again, in practice this may not be a problem since 
these subspaces are likely to be relatively small. Finally, 
and this is probably the least problematic, random walk 
is inefficient for a large number of modules and self- 
reconfiguration sequences consisting of a large number 
of actions. In other words, a movement strategy based 
on random walk has scalability issues. Coordinate at- 
tractors, gradients, and recruitments are all designed 
to improve scalability. The idea behind coordinate at- 
tractors is that modules that have reached a coordinate 
contained in the goal configuration that through local 
sensing discover that an adjacent position in the goal 
configuration is unfilled and can broadcast this coordi- 
nate to all the modules to attract free modules to this 
location. 

Gradients are again an improvement over this strat- 
egy because coordinate attractors are prone to local 
minima. That is, there may be no direct path from a free 
module to the free goal position. A gradient-based strat- 
egy does not broadcast a coordinate, but communicates 
an integer to neighbor modules; these communicate 
this integer minus 1 to their neighbors, and so on. 
Modules listen for integers and pass the highest one 
they have received minus 1 on. Once this process is 
complete, free modules can climb the gradient to find 
the location of the available goal position and, impor- 
tantly, they can do so by following the structure of 
the robot and avoid local minima (this is the strat- 
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egy used in the self-reconfiguration sequence shown in 
Fig. 73.7). 

Finally, the recruitment strategy is a more conserva- 
tive version of gradients because here the module next 
to the goal position sends out a single message whose 
purpose it is to recruit a single free module for the un- 
filled goal position that it knows about. Whether it pays 
off to be conservative or not is a matter of priorities. 
However, it may often be a good strategy to attract 
more modules because where one module is needed 
probably more are needed later, and then accept the 
movement overhead when this is not the case. The fi- 
nal movement strategy is based on local rules. These 
rules only allow a module to move if the local configu- 
ration satisfies a rule in a rule set. In this case, the rule 
will fire and an action will be executed. By defining 
the local rules cleverly the resulting configurations can 
be constrained. The movement strategy of local rules 
is typically used to control cluster-flow or water-flow 
locomotion where modules from the back of the robot 
move towards the front, resulting in a forward locomo- 
tion of the robot [73.52]. 

All these movement strategies, except for local 
rules, rely on a representation of the goal configura- 
tion. A simple representation is to represent all the goal 
coordinates in the final configuration. However, given 
that all modules need a copy of this representation it 
is important that it is space efficient and, therefore, 
a direct representation like this is only suited for small 
configurations. Another representation used is one of 
overlapping cubes [73.50], but other representations 
could also be used. Typically, which representation to 
use depends on a trade-off between space and compu- 
tational complexity. It is also possible to have indirect 


b) c) 


Fig. 73.7a-f Simulated, large-scale self-reconfiguration 


representations that code for growth patterns instead, 
but these are explored to a lesser degree. 

The standing challenge for self-reconfiguration is 
to make algorithms that are practical to use on physi- 
cal self-reconfigurable robots. The range of algorithms 
covered here is sufficient for robots consisting of tens 
of modules and, thus the self-reconfiguration challenge 
is currently more of a mechatronics problem than an al- 
gorithmic problem [73.53]; however there is certainly 
room for algorithmic improvements as well. 


73.4.4 Manipulation 


Manipulation is another task that is suitable for recon- 
figurable robots. If modules are connected in a chain 
configuration, they can form a serial manipulator with 
properties not unlike those of traditional robot manip- 
ulators. However, two important problems have to be 
solved before they can work as a traditional robot ma- 
nipulator. 

One problem is how to calculate the inverse kine- 
matics for a chain of modules. The other is how to 
increase the strength of a modular manipulator since 
it is relatively weak because of the limited actuator 
strength of the individual module. Inverse kinematics 
provide a way to calculate the position of the internal 
joint angles of the module needed for the outermost 
module to reach a given position and orientation in 
space [73.54]. Several answers to the question of how 
to calculate inverse kinematics have been explored. 
One option is to fit the serial chain of modules to 
a curve [73.55]. Another is to use constrained opti- 
mization techniques [73.56]. A third option is to use 
a fast method based on what is defined as dexterous 
workspaces of subchains [73.57]. An important aspect 
of this work is also the potential for the robot to dis- 


Fig. 73.8 The ATRON hybrid self-reconfigurable robot 
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cover its own kinematics [73.58]. However, maintaining 
a correspondences between the kinematic model and 
the physical robot remains a significant challenge for 
these approaches given that a chain of modules is often 
not rigid enough in part due to the connectors. 

The other problem that needs to be addressed is 
the relative weakness of the modules and, as a con- 
sequence, the modular manipulator as a whole. We 
need to find a way to make modules work together 
to produce cooperative actuation that allows them to 
produce larger forces than those that the modules can 
produce individually. This question is a little harder 
to answer, but one option is to exploit the large me- 
chanical advantages formed near singularities of the 
joints [73.59]. The idea is that a closed loop of mod- 
ules forms a manipulator. If specific sets of modules 
are alternately moved and locked, the chain of modules 
can continually maintain a large mechanical advan- 
tage and then generate a much larger force than the 
one an individual module can provide. However, prob- 


lems remain with internal forces of the chain and the 
weight of the modules involved. Alternative, mostly 
theoretical approaches include a biologically inspired 
approach to mechanical design where modules form 
the equivalent of bones, muscles, and tendons [73.60] 
and use these construction elements to propagate forces 
to where they are needed in the configuration. It may 
also be possible to coordinate the movement of all mod- 
ules so that the movements add up to produce a larger 
movement on the global level, i.e., perform collective 
actuation [73.61]. 

Besides the traditional arm-based approach to ma- 
nipulation, it is also possible to use reconfigurable 
robots as a distributed actuator array [73.62]. In this 
approach, modules are spread over a surface. When an 
object is placed on the surface, many modules can work 
together to manipulate it. The array can handle heavy 
objects as long as their surface area is relatively large. 
In combination with self-reconfiguration, it is also pos- 
sible to use this approach in three dimensions [73.63]. 


73.5 Programmability and Debugging 


An area that is receiving more and more focus in the 
community is the challenge of how to efficiently pro- 
gram and debug reconfigurable robots. In the previous 
section, we discussed how to control reconfigurable 
robots distributedly; while these distributed control al- 
gorithms are desired for their robustness and scalability 
they are notoriously difficult to implement and debug 
in general, but even more so on modular robots that are 
resource constrained, embedded platforms that often 
only allow for debugging output in the form of blink- 
ing LEDs. 


73.5.1 Iterative, Incremental Programming 
and Debugging 


Conventionally, the challenge is met by developing ap- 
plications for reconfigurable robots using an iterative, 
incremental approach to programming and debugging. 
This ensures that we can locate errors relatively quickly 
and correct them before we introduce additional func- 
tionality and hence complexity. A good programming 
practice is also to develop an application programming 
interface that hides the low-level hardware interface for 
the programmer more interested in the higher-level con- 
trol algorithms. This conventional approach is suitable 
for small demonstration programs, but as the complex- 


ity of the task and thus the program increases, this 
approach becomes intractable. The main reason is that 
it becomes increasingly difficult to obtain reliable de- 
bugging information due to the distributed and dynamic 
nature of reconfigurable robots and also more often than 
not due to the immaturity of the physical platforms. 


73.5.2 Simulation 


Given the shortcomings of iterative, incremental de- 
velopment, researchers use simulations, e.g., [73.64], 
to ensure that the distributed and dynamic aspects of 
the controller are thoroughly debugged before being 
deployed on the physical platform. However, build- 
ing a reliable simulator is a feat in its own right. The 
simplest form of simulators are logic in nature where 
events are discrete and instantaneous in time, e.g., the 
transmission of a message or the movement of a mod- 
ule from one lattice position to another. These logic- 
based simulators are useful for experimentally convinc- 
ing oneself that the logic of the distributed algorithm 
under development is correct. If this algorithm is then 
transferred to the physical, reconfigurable robot and 
allowed to run on a carefully debugged application pro- 
gramming interface, it is only real-world issues and 
hardware limitations that stand in the way of success. 
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However, these issues and limitations should not be 
underestimated and include, but are not limited to: un- 
reliable communication, limited communication band- 
width, time, parallelism, hardware failures, differences 
between modules in terms of actuation, sensing, and 
communication performance. These issues and limita- 
tions can to some degree be handled by proper simu- 
lation, but typically are not considered and thus leave 
algorithms stranded in simulation because they are un- 
able to deal with the real-world constraints of a physical 
reconfigurable robot. The gap between the simulated 
world and the real world is often referred to as the re- 
ality-gap [73.65]. This gap is widened even more by 
using simulations based on simplified physics engines, 
because precise modeling of the physics of a recon- 
figurable modular robot is almost impossible due to 
the complex interactions between the modules them- 
selves and the modules and the environment. While 
the physics-based simulations increase the reality gap, 
they do allow for a wider area of study, e.g., study of 
locomotion algorithms. However, often at the cost of re- 
duced transfer of results to the physical reconfigurable 
robot. 


73.5.3 Emerging Solutions 


Above we have presented current approaches and their 
advantages and in particular their disadvantages. It is 
clear that we are far from meeting the challenge of 
efficient programming and debugging of physical re- 
configurable robots. However, there are currently two 
approaches under investigation that may provide some 
solutions. 

One is the use of domain specific programming 
languages [73.66, 67]. The fundamental idea is to ex- 
pose programming primitives at the level of abstraction 
preferred by the programmer and hide the implemen- 
tation of the primitives. For instance, communication 


73.6 Perspective 


This concludes the overview of reconfigurable robots. 
The question is where does this leave us? What does 
the future of reconfigurable robots and reconfigurable 
robot research look like? 

First of all, the field of reconfigurable robots has 
matured significantly over the last decade, leading to 
the first applications of modular robot technology: 
Cubelets [73.4], a construction kit teaching children 


is not necessarily central to the programmer and can, 
therefore, be implicitly handled by the programming 
language. The advantage of this approach is that it 
frees the programmer from error-prone, repetitive pro- 
gramming and allows him to focus more energy on the 
programming challenges related to the task at hand. The 
language can also to some degree help deal with hard- 
ware limitations, allowing the programmer to build on 
reliable programming primitives. However, there may 
be a problem of leaky abstractions where it can be hard 
for a programmer to discover hardware problems be- 
cause the software that interfaces with the hardware is 
hidden from the programmer. 

Another development is that reconfigurable robots 
increasingly get more and more powerful processors 
and communication hardware, a development which is 
primarily driven and made possible by the cell-phone 
industry. This development opens an opportunity for ef- 
ficient programming and debugging of reconfigurable 
robots. The reason is that up until now reconfigurable 
robots have been resource constrained to the degree that 
it was not possible to run programs targeted at debug- 
ging in parallel to the executing program because either 
there was simply not enough available memory and pro- 
cessing energy or the debugging tool would interfere 
with the executing program to the degree that its behav- 
ior would be completely altered or simply not work at 
all. However, with the increase in processing and com- 
munication energy this problem may be reduced, and 
this will open the door to new forms of programming 
middleware that has been instrumental to the success of 
other areas of robotics such as Player/Stage [73.68] and 
robot operating system ROS [73.69]. 

In the broader context breakthroughs in program- 
ming and debugging of reconfigurable robots will be 
a significant contribution that can help us develop so- 
lutions to complex tasks that are currently beyond our 
reach. 


about emergent behavior in complex systems. Another 
example is LocoKit that has been developed to effi- 
ciently explore morphology related questions in the 
context of robot locomotion [73.5, 70]. The commu- 
nity, in general, is very engaged in understanding how 
reconfigurable robots can be adapted to this specific 
application, which in time is likely to lead to more 
applications. 
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From a research point of view, the vision of an 
autonomously distributed reconfigurable robot still re- 
mains to be realized. This requires advances in all 
areas covered in this chapter. Like other fields of 
robotics, reconfigurable robotics is nurtured by the 
progressive development of rapid prototyping tech- 
nology and smartphone technology, including wire- 
less charging, which may open the path to novel 
mechatronic designs. Also, the emerging field of 
soft robotics may hold potential for radical new de- 
signs of reconfigurable robots. Overall, there seems 
to be a growing opportunity to exploit these ad- 
vances to design a new generation of reconfigurable 
robots. 

In the area of distributed control, which is probably 
the best understood area of reconfigurable robots, there 


73.7 Further Reading 


This chapter has provided a high-level overview of 
the field of reconfigurable robots. Those readers in- 
terested in a complementary introduction should con- 
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74. Probabilistic Modeling of Swarming Systems 


Nikolaus Correll, Heiko Hamann 


This chapter provides on overview on probabilis- 
tic modeling of swarming systems. We first show 
how population dynamics models can be derived 
from the master equation in physics. We then 
present models with increasing complexity and 
with varying degrees of spatial dynamics. We will 
first introduce a model for collaboration and show 
how macroscopic models can be used to derive 
optimal policies for the individual robot analyti- 
cally. We then introduce two models for collective 
decisions; first modeling spatiality implicitly by 
tracking the number of robots at specific sites and 
then explicitly using a Fokker—Planck equation. 
The chapter is concluded with open challenges in 
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combining non-spatial with spatial probabilistic 
modeling techniques. 


74.1 From Bioligical to Artificial Swarms 


The swarming behavior of ants, wasps, and bees 
demonstrates the emergence of stupendously complex 
spatio-temporal patterns ranging from a swarm finding 
the shortest paths to the assembly of three-dimensional 
structures with intricate architecture and well-regulated 
thermodynamics [74.1,2]. In the bigger scheme of 
things, these systems represent just the tip of the ice- 
berg; their behavior is considerably less complex than 
that of the brain, cities, or galaxies, all of which are es- 
sentially swarming systems (and all of which can be 
reduced to first principles and interactions on atomic 
scale). Yet, social insects make the world of self- 
organization accessible to us as they are comparably 
easy to observe. Studying these systems is interesting 
from an engineering perspective as they demonstrate 
how collectives can transcend the abilities of the indi- 
vidual member and let the organism as a whole exhibit 
cognitive behavior. 

Cognition is derived from the Latin word cog- 
nescere and means to know, to recognize, and also to 
conceptualize. In the human brain, cognition emerges — 
to the best of our knowledge — from the complex in- 


teractions of highly connected, large-scale distributed 
neural activity. We argue that cognition can manifest it- 
self at multiple different levels of complexity, ranging 
from conceptualizing collective decisions such as as- 
suming a certain shape or deciding between different 
abstract choices in social insects to reasoning on com- 
plex problems and expressing emotions in humans, the 
combination of the latter two often framed as the Tur- 
ing test in artificial intelligence. This chapter aims at 
developing formal models to capture the characteris- 
tic properties of the most simple cognitive primitives in 
swarming systems. In particular, we wish to understand 
the relationship between the activities of the individ- 
ual member of the swarm and the dynamics that arise at 
collective level. The resulting models can be matched to 
data recorded from physical systems, be used to predict 
the outcome of a robot’s individual behavior on a larger 
swarm, and used in an optimization framework to de- 
termine the best parameters that help improve a certain 
metric [74.3]. 

This chapter reviews probabilistic models of three 
swarming primitives that are examples of conceptu- 
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alizations that are exclusively represented at the col- 
lective level: collaboration, collective decision making, 
and collective optimization. Guided by examples from 
social insects, we present models that generalize to arbi- 
trary agent systems and can serve as building blocks for 
more complex systems. The probabilistic component of 
the models arises from: 


1. The agent’s motion, which often has a random com- 
ponent 

2. Explicit random decisions made by individual 
agents 

3. Random encounters between agents. 


Randomness in an agent’s motion can be intro- 
duced, for example, by physical properties such as 
slip, by deficits in robot hardware, or by explicitly ex- 
plorative behavior, e.g., based on random turns. It is, 
therefore, reasonable to model at least the single-agent 
behavior with probabilistic methods. Yet, it is possible 
to model the expected swarm-level behavior using de- 
terministic models. In such a swarm-level model the 


74.2 The Master Equation 


Let a robot be in a discrete set of states with probabil- 
ity p; € P, with P a vector maintaining the probabilities 
of all possible states and ` P = 1. These states model 
internal states of the robot, determined by its program, 
or external states, determined by the state of the robot 
within its environment. Actions of the robot and envi- 
ronmental effects will change these probabilities. This 
is captured by a phenomenological set of first-order 
differential equations, also known as the master equa- 
tion [74.11], 


La AOP, 


a (74.1) 


underlying stochastic motion of agents is summarized 
in macroscopic properties, which are averages such as 
the expected swarm fraction in a certain state or at a cer- 
tain position [74.4—7]. 

Such probabilistic models are in contrast with deter- 
ministic models of swarming systems, which explicitly 
model the positions of individual robots. Representa- 
tive examples include controllers for flocking [74.8], 
consensus [74.9], and optimal sensor distribution for 
sampling a given probability density function [74.10]. 
While the robots’ spatial distribution is explicitly mod- 
eled, those models have difficulties dealing with ran- 
domness or robot populations in which robots can be in 
different states at the same time. 

After providing a brief background on phenomeno- 
logical probabilistic models based on the master equa- 
tion, this chapter will first review population dynamic 
models that ignore the spatial distribution of the indi- 
vidual robots and the swarm and then present models 
that explicitly model the spatial distribution of the robot 
swarm using time-dependent, spatial probability den- 
sity functions. 


where A(t) is the transition matrix consisting of entries 
pij(t) that correspond to the probability of a transition 
from state i to state j at time t. Multiplying both sides 
with the total number of robots No allows us to calculate 
the average number of robots in each state. For brevity, 
we write 


N,(t) = Nopi(t) . 


Similarly, when expanding the master equation for 
a continuous space variable, one finds the Fokker- 
Planck equation, also known as the Kolmogorov for- 
ward equation or the Smoluchowski equation [74.12, 
13]. 


74.3 Non-Spatial Probabilistic Models 


We will first consider two models that assume the spa- 
tial distribution of the agents in the environment to be 
uniform: collaboration and collective decision. 


74.3.1 Collaboration 


An important swarming primitive is collaboration, 
which requires a number of agents (7) to get together at 


a site. Collaboration is different from the more general 
task allocation problem, in which the number of agents 
is not explicitly specified. In swarm robotics, a site 
can have spatial meaning, but can also be understood 
in an abstract way as means to form teams. Although 
there are many different algorithms for team formation, 
we focus on a collaboration mechanism that was intro- 
duced in the stick-pulling experiment [74.14] and turned 
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out to be a recurrent primitive in swarm robotic systems, 
e.g., in swarm robotic inspection, where robots can 
serve as temporary markers in the environment [74.15]. 
Here, collaboration happens when an inspecting robot 
encounters a marker, which informs it that this spe- 
cific area has already been inspected. The collaboration 
model, therefore, finds application in studying trade- 
offs between serving as memory to the swarm and 
actively contributing to the swarming behavior’s met- 
ric. 

In the stick-pulling experiment No robots are con- 
cerned with pulling Mọ sticks out of the ground in 
a bounded environment. This task requires exactly n = 
2 robots. Physically, this can be understood as a stick 
that is too long to be extracted from the ground by 
a single robot. Rather, every robot that grabs the stick 
can pull it out a little further and keeps it there un- 
til the next robot arrives. In this work, we abstract the 
classical stick-pulling experiment to a generic collab- 
oration model in which robots are simply required to 
meet, see also Fig. 74.1. Intuitively, the amount of time 
spent waiting for collaboration to happen is a trade- 
off between (1) waiting at a site to find a collaborator 
and (2) having the chance to find a collaborator oneself 
by actively browsing the environment. Finding the col- 
laboration rate, and the individual parameters that lead 
to it, that is optimal for a given environment, i.e., the 
number of collaboration sites and the number of agents, 
illustrates how probabilistic models can be employed to 
design this process and find optimal collaboration poli- 
cies. 

The following model is loosely based on the de- 
velopment in [74.16], which applies discrete time 


Fig. 74.1 A collaboration example. No = 5 robots (black) 
in a bounded environment with Mo = 3 collaboration sites, 
each requiring n = 2 robots to be present simultaneously 
for collaboration to happen 


difference equations. For simplicity, we assume that 
collaboration happens instantaneously and focus on 
a continuous-time representation and stochastic waiting 
times. The reader is referred to [74.16] for an extensive 
treatment of deterministic waiting times and [74.17] for 
an extension to n > 2 agents. Variables used in the equa- 
tions that follow are summarized in Table 74.1. 

Let n,(t) with n,(0) = No be the number of search- 
ing agents at time ż € Rt and No the total number of 
agents. Let ny (t) = No —n,(t) be the number of waiting 
agents at time t. With p the probability to encounter or 
match a waiting agent and T, the average time an agent 
will wait for collaboration, we can write 


As (t) = —p(Mo — nw (t))ns(t) (74.2) 
T ZO 
+ pns(t)ny (ft) . (74.3) 


Thus, ,(t) decreases by the rate at which searching 
agents encounter empty collaboration sites (of which 
there exist Mo —n,(t) at time t), and it increases by 
those agents that return either from unsuccessful (at rate 
1/T,) or successful collaboration, i.e., find any of the 
Nw (t) waiting agents. 

In order to maximize the collaboration rate in the 
system we are interested in maximizing the rate at 
which robots return from successful collaboration, i. e., 
c(t) = pns(t)nw(t). 

Solving for s(t) = 0 and substituting n,,(t) = No — 
ns(t) allows us to calculate the number of robots at 
steady-state n% 


| (2No — Mo)pT, — 1 | 
až = L + VNTC + (Mo = 2No)pT 3? 
i 4pT, 
(74.4) 


As ny, = No — nž by definition, we can write 
cœ“ =p (n*No =n’) ; (74.5) 


Table 74.1 Notation used in the collaboration model 


ns(t) Average number of searching agents 

Nw (t) Average number of waiting agents 

p Probab. to encounter/match a waiting agent 
c(t) Average rate of collaboration matches 

No Total number of agents 

Mo Total number of collaboration sites 

ey Waiting time 
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The collaboration rate as a function of T, and No is 
shown in Fig. 74.2. By solving dc* /dnž = 0, we can 


calculate n* „„ = 4No that maximizes c*. Substituting 


s,opt 2 
Ng opt into (74.4) and solving for T,, we can calculate the 


optimal waiting time T, opt as 


1 


e 74.6 
(Mo — No)p ane 


Ii opt = 


As T; opt cannot be negative, an optimal waiting time 
can only exist if Mp > No. This intuitively makes sense, 
because if there are less agents than collaboration sites, 
waiting too long might consume all agents in waiting 
states. We can also see that the more collaboration sites 
there are, the less an agent should wait. There are two 
interesting special cases: first, Nọ = Mo. In this case 
T; opt is undefined. Considering that collaboration sites 
exceed agents by exactly one, T, opt is fully defined by 
1/p. Thus, the higher the likelihood is that agents find 
a collaboration site, the lower the waiting time should 
be. In this case, it makes sense to release searching 
agents from wait states to find another agent to collabo- 
rate. If this likelihood is low, however, agents are better 
off waiting to serve as collaborators for few searching 
agents. 

With T, op given by (74.6) we can derive the 
following guidelines for agent behavior. First, an op- 
timal wait time exists only if there are less agents 
than collaboration sites. Otherwise, longer waits im- 
prove the chance of collaboration. Second, if the num- 
ber of agents, the number of collaboration sites, and 
the likelihood to encounter a collaboration site are 
known to each agent at all times, e.g., due to global 
communication or shared memory, agents could cal- 


Fig. 74.2 The collaboration rate as a function of T, and No 
for Mo = 10 collaboration sites. There exists an optimal 
T, for No < Mo, whereas the collaboration rate increases 
steadily otherwise for increasing values of T, 


Table 74.2 Notation used in the collective decision model 


ns(t) Average number of searching agents 

nj(t) Average number of agents committed to choice i 
Pi Unbiased probability to select choice i 

T Unbiased time to stay with choice i 

No Total number of agents 

Mo Total number of choices 

Ti Waiting time 


culate T, op at all time. If these quantities are not 
known, however, agents can estimate these quanti- 
ties based on their interactions in the environment 
by observing the rates at which they encounter col- 
laboration and empty sites. Individual agent learning 
algorithms that accomplish this goal are discussed in 
detail in [74.18]. 


74.3.2 Collective Decisions 


Another collective intelligent swarming primitive is 
collective decisions. These can be observed in the 
path selection of ants [74.19] or shelter selection of 
cockroaches [74.20] or robots [74.21], but can also 
have non-spatial meaning, for example when a con- 
sensus on Mọ different discrete values is needed. An 
example of such a situation is shown in Fig. 74.3. 
While the above references provide models that 
are specific to their application, this chapter pro- 
vides a generalized model for collective decisions 
that rely on different ways of social amplification, 
i.e., a change of the behavior based on the ac- 
tivities of other swarm members, or the absence 
thereof. 


Fig. 74.3 Collective decision example. No = 6 robots de- 
cide between My =2 choices. Three plus one robot 
have already made decisions, two robots remain 
undecided 
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Model parameters for the collective decision model 
are summarized in Tab. 74.2. Let n,(t) with n,(0) = No 
be the number of searching/undecided agents at time 
tE RE and No the total number of agents. Let p;, 0 < 
i < Mp, be the unbiased probability for an agent to se- 
lect value i from Mp different values. This probability is 
unbiased as it does not depend on social amplification. 
We can then write the following differential equations 
for the number of agents n;(t) that have the selected 
value i 


nindi- EnO M =0, 


(74.7) 
Mo 


ns(t) =No— > ni(t), 


i=1 


(74.8) 


where 7; is the average time spent on solution i be- 
fore resuming search, and R,(t), Q;(t) : n;(t),ns(t) > 
R*+ are functions that might or might not depend on the 
number of agents in other states, and therefore making 
the differential equation for n;(t) linear or non-linear, 
respectively. There are four interesting cases: both R;(f) 
and Q,(t) being constant, both being functions of one or 
more states of the system, e.g., n;(t) or n,(t), and com- 
binations thereof. 

If both R(t) and Q(t) are constants, one can show 
that the number of agents selecting choice i at steady- 
state n* is given by 


n = Ri s (74.9) 
Qi 

with n= the number of agents that remain undecided 

at steady-state. (This results from agents discarding 

choices at rate 1/T;.) For example, for a two-choice sys- 

tem, using n* = No —n¥ — nj yields the steady states 


ny = = (74.10) 
QQ. + Q2Rı + QıR2 
ž QIR (74.11) 


a OQ + QR, + Oe 


A solution for Rj = 0.01, Ro = 0.04, and Q; = Q2 = 
1/10 is depicted in Fig. 74.4 and leads to ~ 7 and 
~ 27% of agents in states one and two, respectively, 
while most agents remain undecided. In this system, 
the speed at which the steady-state is reached depends 
on the values of R;, with higher values of R; leading 


to faster decisions, whereas the steady-state of unde- 
cided agents is determined by Q;, with lower values of 
Qi corresponding to lower values of n*. In particular, 
values for Q; = 1/100 or Q; = 1/1000 will drastically 
increase convergence, in this example to 67 and 78% 
for the majority choice, respectively. 

If Q;(t) is constant, but R;(t) is a non-linear func- 
tion of the form R;(t) = f[n;(1)]% with a; > 1 a constant, 
we observe n;(t) to grow faster due to social amplifica- 
tion of attraction; the larger n;(t), the larger the positive 
influx into 7i;(t). Systems with this property usually con- 
verge much faster than linear systems. For example, 
a system with 


(74.12) 


shows faster convergence than a linear system for 
a; > 1. Here, normalizing social attraction with No pro- 
vides independence of the dynamics of the number of 
agents. An example with a; = 5 is shown in Fig. 74.4b. 

Similarly, if R;(t) is constant, but Q;(t) is a non- 
linear function of the form Q;(t) = f[n: O]? with B <0 


a) n b) n 
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t 
c)n d) n 
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Fig. 74.4a-d Time evolution of a collective decision where solu- 
tion two is picked four times as likely as solution one, and both 
solutions are re-evaluated after an average of 100s, for different 
non-linear dynamics. Graphs show the fraction of agents picking 
solutions one and two. (a) Linear system achieving steady-states of 
x 20% and ~ 80%, matching analytical results. (b) Time evolution 
of a system with social amplification of attraction using R; given 
by (74.12) and œ; = 5 for both choices. (c) Social amplification of 
rest using (74.13) and B; = 5. (d) Social amplification of both at- 


traction and rest with a; = f; = 5 
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a constant, we observe n;(t) to grow faster due to so- 
cial amplification of rest; the larger n;(t), the smaller 
the out-flux from 7i;(t). For example, a system with 


(74.13) 


nj(t) 
oxy = (1+) 
also shows faster convergence than a linear system. No- 
tice that we do not consider positive exponents for f;, 
as this will drive agents away from decisions expo- 
nentially fast and will simply increase n,(t), i.e., the 
number of undecided agents. Results for a two-choice 
system with 8; = 5 are shown in Fig. 74.4c. 

Finally, systems that rely both on social amplifica- 
tion of attraction and rest exhibit the best convergence, 
when compared with a purely linear system as well 
as systems that rely only on either social amplifica- 
tion mechanism. Results for a two-choice system with 
a; = pi = 5 are shown in Fig. 74.4d. 

Similar models, i. e., models that rely on non-linear 
amplification of either attraction, rest, or both have been 
proposed for a series of social insect experiments. For 
example, in [74.19] an ant colony is presented with a bi- 
nary choice to select the shortest of two branches of 
a bridge that connect their nest to a food source. Here, 
a model with social amplification of attraction — by 
means of an exponentially higher likelihood to choose 
a branch with higher pheromone concentration — is cho- 
sen and successfully models the dynamics observed 
experimentally. In [74.20] a model that uses social am- 
plification of rest is chosen to model the behavior of 
a swarm of cockroaches deciding between two shelters 
of equal size but different brightness. The preference 


of cockroaches for darker shelters is expressed with 
a higher p; for this shelter. Convergence to the dark 
shelter is then achieved by social amplification of rest, 
increasing the time cockroaches remain in a shelter 
exponentially with the number of individuals that are 
already in the shelter. Here, all cockroaches converge 
to a single shelter, even though the model proposed 
in [74.20] employs negative social amplification of at- 
traction by introducing a notion of shelter capacity, 
which cancels the positive term in 7;(t) when the shel- 
ter reaches a constant carrying capacity. Finally, [74.22] 
presents a model for cockroach aggregation in which 
the likelihood to join an aggregate of cockroaches in- 
creases with the size of the aggregate, whereas the 
likelihood to leave a cluster exponentially decreases 
with its size. 

The examples from the social insect domain are 
trade-offs between the expressiveness of the model and 
its complexity. As the true parameter values of œ; and 
pi are unknown, the same experimental data can be 
accurately matched by models with different dynam- 
ics. For example, social amplification of attraction as 
observed on larvae of German cockroaches in [74.22] 
was deemed to have negligible influence on mature 
American cockroaches selecting shelters with limiting 
carrying capacity in [74.20]. 

With respect to artificial agent and robotic systems, 
the models presented can instead provide design guide- 
lines for achieving a desired convergence rate. At the 
same time, the models are able to support decisions 
on sensing and communication sub-systems that are 
required to implement one or the other social amplifi- 
cation mechanism. 


74.4 Spatial Models: Collective Optimization 


The concept of optimization in collective systems is 
difficult to separate from the concept of collective 
decisions. Rather, there seems to be a continuous tran- 
sition. Collective decisions are made between several 
distinct alternatives, implying a discrete world of op- 
tions (e.g., left and right branches in path selection, 
two shelters, etc.). Typically, one refers to the term op- 
timization in collective systems in the case of tasks 
that allow for a vast (possibly even infinite) num- 
ber of alternatives implying a continuous world of 
options. 

For this optimization scenario we apply the prob- 
abilistic model reported in [74.4, 5, 23-25]. It is based 


ona stochastic differential equation (SDE, the Langevin 
equation) and a partial differential equation (PDE, the 
Fokker—Planck equation), which can be derived from 
the former. While the Langevin equation is a stochastic 
description of the trajectory in space over time of a sin- 
gle robot, the Fokker—Planck equation describes the 
temporal evolution of the probability density in space 
for these trajectories. Hence, it can be interpreted as the 
average over many samples of robot trajectories (i. e., 
ensembles of trajectories). Even a second, more daring 
interpretation arises. We can interpret this probability 
density directly as a swarm density, that is, the expected 
fraction of the robot swarm for a given area and time. 
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The deterministic PDE describes the mean swarm frac- 
tion in space and time. Interactions between robots can 
be modeled via dependence on the swarm density it- 
self [74.4]. 

We introduce our formalism (see Table 74.3 for 
a summary of all variables used). The Langevin equa- 
tion that gives the position of a robot R at time t is 

RO = —A(R(A), 1) + B(R(A), DEO , (74.14) 
where A defines directed motion via drift depending 
on the current position R and B(R(‘), 1)F defines ran- 
dom motion based on F, which is a stochastic process 
(e.g., white noise). Based on the Langevin equation the 
Fokker—Planck equation can be derived [74.4, 11, 12, 
26] 


ate I _ _v(aer,np(r,t)) 


+ LoVer, t)p(r,t)), (74.15) 
for a swarm density p(r,t) (according to the above 
interpretation) at position r and time f, a drift term 
(—V(A(r, f)p(r, t))) due to directed motion and a dif- 
fusion term (40V? (B? (r, t)p(r, t))) due to random mo- 
tion, whereas typically we set Q = 2 for simplicity. 
According to our general approach [74.4] we introduce 
a Fokker-Planck equation for each robot state and man- 
age the transitions between states by rates similar to the 
rate equation approach of the above sections. 

The optimization scenario considered here was in- 
spired by the behavior of young honeybees. The algo- 
rithm, which defines the robots’ behavior, is derived 
from a behavioral model of honeybees [74.27,28]. 
Honeybees of an age of less than 24h stay in the hive, 
cannot yet fly, navigate towards spots of a preferred 
warmth of 36 °C, and stay mostly inactive. An interest- 
ing example of swarm intelligent behavior is how they 


Table 74.3 Notation used in the optimization model 


R Robot position 

A Direction and intensity of robots’ directed motion 
B Intensity of robots’ random motion 

F Stochastic process (fluctuating directions) 

r Point in space 

Q Theoretic term describing intensity of collisions 
Ps Expected density of robots in state stopped 

Pm Expected density of robots in state moving 

w Waiting time 

o Rate of stopping robots 


search and find the right temperature that their bodies 
need. It turns out that they do not seem to do a gradient 
ascent in the temperature field but rather a correlated 
random walk with inactive periods triggered by so- 
cial interaction. Both the above-mentioned behavioral 
model and the robot controller — called BEECLUST — 
are defined by the following: 


1. Each robot moves straight until it perceives an ob- 
stacle 2 within sensor range. 

2. If 2 is a wall the robot turns away and continues 
with step 1. 

3. If 2 is another robot, the robot measures the lo- 
cal temperature. The higher the temperature is the 
longer the robot stays stopped. When the waiting 
elapses, the robot turns away from the other robot 
and continues with step 1. 


The temperature field that we investigate in the sce- 
nario here has one global optimum (36 °C) at the right 
end of the arena and one local optimum (32 °C) at the 
left end of the arena. In analogy to the behavior ob- 
served in young honeybees, the swarm is desired to 
aggregate fully at the global optimum but, at the same 
time, should also stay flexible within a possibly dy- 
namic environment. The latter is implemented by robots 
(bees) that leave the cluster from time to time and ex- 
plore the remaining arena. If a more preferable spot 
were to emerge elsewhere they would start to aggregate 
there, and the former cluster might shrink in size and 
finally vanish fully. 

Now we apply the above modeling approach to 
this scenario. We have two states: moving and stopped. 
It turns out that in the moving state we do not have 
any directed motion, hence, we will turn off the bias 
in the Langevin equation (74.14, A = 0). Without any 
directed motion in BEECLUST (no gradient ascent, 
actually movement fully independent from the temper- 
ature field) the Fokker—Planck equation can be reduced 
to a mere diffusion equation in order to model the mov- 
ing robots 


dp(r, t) 
ot 


This equation is our approach for state moving yet with- 
out addressing state transition rates. 

The state stopped is even easier to model as it nat- 
urally lacks motion. That way it can be viewed as 
a reduction to a mere rate equation defined in each 
position r. The state transition rates are defined by 
a stopping rate g, which can be determined, for exam- 


= V’ (B (r, t)p(r,t)) . (74.16) 
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ple, empirically or by geometrical investigations (e.g., 
calculation of collision probabilities) [74.4]. For the 
stopped state we obtain 


Ops (r, t) 
ot 


for a stopping swarm fraction p;(r,t)ọ at spot r and 
time ¢, and an awakening swarm fraction p,(r,t— 
w(r))g. The robots stop and wait for a time period w(r), 
which depends on the temperature at spot r. 

Here, we choose to approximate the robots’ corre- 
lated random walk as mere diffusion in a rough estima- 
tion. The function B in (74.16) is reduced to a diffusion 
constant D. We add the rates of stopping/awakening and 
obtain the equation for state moving 


= Poll, 9 — Pm r,t- w(r))p, (74.17) 


ðPm (T, t) 
p = PV Palt, 1) = Palt, Do 


+ Pm(r,t—w(r))o. (74.18) 


If we ignore diffusion and focus on one point in space 
we would have a mere rate equation similar to the above 
sections (except for the time delay) 


Pm(t) = —Pm(NY + Pm(t—w(r))¢ . (74.19) 


Using (74.18) ((74.17) is mathematically not necessary) 
we can model the BEECLUST behavior. For a provided 
initial distribution of the robots we end up with an initial 
value problem for a PDE that we can solve numerically. 


Fig. 74.5a-f Comparison of histograms of 
swarm density obtained by an agent-based 
simulation and the corresponding model 
based on (74.18) for different times and an 
initial state with equal distribution of robots. 
An optimal temperature peak of 36 °C is at 
the right end of the arena, at the left end there 
is a suboptimal peak in temperature of 32°C, 
the middle part is cooler. We observe that 

on average at first clusters form at both ends 
of the arena, but later those on the left van- 
ish. Swarm size is N = 25. The histograms 
obtained by simulation are based on 10° sam- 
ples. (a) Simulation, t = 30; (b) model, t = 
30; (c) simulation, t = 130; (d) model, t = 
130 (e) simulation, t = 200; (f) model, t = 
200 


The solution of this initial value problem is the tem- 
poral evolution of the swarm density. In Fig. 74.5 we 
compare the model to the results obtained by a simple 
agent-based simulation of BEECLUST. This compari- 
son is meant to be qualitative only. The model catches 
most of the qualitative features that occur in simulation, 
although we approximate the robots’ motion in a rough 
estimation by diffusion. 

Our approach shows how borders between the fields 
of engineering and biology vanish in swarm robotics. 
The BEECLUST algorithm is at the same time a con- 
troller for robots but also a behavioral model of an 
animal. The same Fokker—Planck model is used to 
model the macroscopic behavior of honeybees and 
robot swarms. 

The Fokker—Planck model gives good estimates 
for expected swarm densities in space, the tran- 
sient/asymptotic behavior of the swarm, and density 
flows. Modeling space explicitly allows for specific in- 
vestigations such as objective areas and obstacles of 
certain shapes. Other case studies included an emer- 
gent taxis task which relies on one group of robots that 
is pushing another group by collision avoidance [74.4], 
a collective perception task in which robots have to dis- 
criminate aggregation areas of different sizes [74.29], 
and a foraging task [74.30]. This model is mostly rele- 
vant to scenarios with spatially inhomogeneous swarm 
densities, that is, swarms forming particular spatial 
structures that cannot be averaged over several runs. 
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74.5 Conclusion 


We presented mathematical models for three distributed 
swarming behaviors: collaboration, deciding between 
different choices, and optimization. Each of these pro- 
cesses are collective decisions of increasing complexity. 

While the behaviors and trajectories of individual 
robots might be erratic and probabilistic, the aver- 
age swarm behavior might be considered deterministic. 
This holds for both the models and the observed real- 
ity in robotic and biological experiments. An analogy 
is the distinction between the complex, microscopic 
dynamics of multi-particle systems and the much sim- 
pler properties of the corresponding ensembles of such 
systems in thermodynamics. This insight is important 
as it allows us to design the individual behavior so 
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75. A Robust Evolving Cloud-Based Controller 


Plamen P. Angelov, Igor Škrjanc, Sašo Blažič 


In this chapter a novel online self-evolving cloud- 
based controller, called Robust Evolving Cloud- 
based Controller (RECCo) is introduced. This type 
of controller has a parameter-free antecedent 
(IF) part, a locally valid PID consequent part, and 
a center-of-gravity based defuzzification. A first- 
order learning method is applied to consequent 
parameters and reference model adaptive con- 
trol is used locally in the ANYA type fuzzy rule- 
based system. An illustrative example is provided 
mainly for a proof of concept. The proposed con- 
troller can start with no pre-defined fuzzy rules 
and does not need to pre-define the range of the 
output, number of rules, membership functions, 
or connectives such as AND, OR. This RECCo con- 
troller learns autonomously from its own actions 
while controlling the plant. It does not use any 
off-line pre-training or explicit models (e.g. in 
the form of differential equations) of the plant. 
It has been demonstrated that it is possible to 
fully autonomously and in an unsupervised man- 
ner (based only on the data density and selecting 
representative prototypes/focal points from the 
control hypersurface acting as a data space) gen- 
erate and self-tune/learn a non-linear controller 
structure and evolve it in online mode. Moreover, 
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the results demonstrate that this autonomous 
controller has no parameters in the antecedent 
part and surpasses both traditional PID controllers 
being a non-linear, fuzzy combination of locally 
valid PID controllers, as well as traditional fuzzy 
(Mamdani and Takagi-Sugeno) type controllers 
by their lean structure and higher performance, 
lack of membership functions, antecedent pa- 
rameters, and because they do not need off-line 
tuning. 


75.1 Overview of Some Adaptive and Evolving Control Approaches 


Fuzzy logic controllers where proposed some four 
decades ago by Mamdani and Assilian [75.1]. Their 
main advantage is that they do not require the model 
of the plant to be known and their linguistic form is 
closer to the way human reasoning is expressed and 
formalized. It is difficult to identify all possible events 
or the frequency of their occurrences while modeling 
a system. The lack of this knowledge requires use of an 
approximate model of a system. 


Due to the fact that a fuzzy logic algorithm has the 
characteristic of a universal approximator, it is possible 
to model systems containing unknown nonlinearities 
using a set of IF-THEN fuzzy rules. 

The main challenges in designing conventional 
fuzzy controllers are that they are sometimes designed 
to work in certain modeling conditions [75.2]. More- 
over, fuzzy controllers include at least two parameters 
per fuzzy set, which are usually predefined in advance 
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and tuned off-line [75.3]. Many techniques have been 
presented for auto-tuning of the parameters of con- 
trollers in batch mode [75.4], mostly using genetic 
algorithms [75.5,6] or neural networks [75.7, 8] off- 
line. From a practical point of view, however, there 
is no guarantee that pre-training parameters have sat- 
isfactory performance in online applications when the 
environment or the object of the controller changes. To 
tackle this problem several approaches have been pro- 
posed for online adaptation of fuzzy parameters [75.9-— 
15]. 

Nevertheless, only a few approaches have been 
introduced for online adaptation of fuzzy controller 
structures when no prior knowledge of the system is 
available. Evolving fuzzy rule-based controllers were 
introduced in 2001 by Angelov et al. [75.16]. They al- 
low the controller structure (fuzzy rules, fuzzy sets, 
membership functions, etc.) to be created based on 
data collected online. This is based on a combination 
of inverse plant dynamic modeling [75.17] using self- 
evolving fuzzy rule-based systems [75.18]. The pro- 
posed approach is applied to autonomously learning 
controllers that are self-designed online. 

The advantage of this method is that there is no 
need for pre-tuning of the control parameters. More- 
over, the proposed method can start with an empty 
topology, and the structure of the controller is modified 
online based on the data obtained during the opera- 
tion of the closed loop system. Two main phases were 
introduced for parameter learning of the controller’s 
consequents and modifying the structure of the con- 
troller. The proposed approach was successfully applied 
to a nonlinear servo system consisting of a DC motor 
and showed satisfactory performance [75.19], as well 
as to control of mobile robots [75.20]. The drawback of 
this approach is that the addition of new membership 
functions increases the number of rules exponentially, 
and each membership requires at least two parameters 
in the antecedent part to be specified, plus connec- 
tives such as AND, OR, NOT. Many of these problems 
have been overcome with the latest version of the ap- 
proach [75.21], which combines the Angelov—Yager 
(ANYA) type fuzzy rule-based system (FRB) [75.22] 
with the inverse plant dynamics model. ANYA can 
be seen as the next form of FRB system types after 
the two well-known Mamdani and Takagi—Sugeno type 
FRBs. It does not require the membership functions 
to be defined for the antecedent part, nor the connec- 
tives such as AND, OR, NOR. It still has a linguistic 
form and is non-linear. It is fuzzy in terms of the de- 
fuzzification. In order to clarify what that means let us 


recall that all three types of FRB: Mamdani, Takagi- 
Sugeno, and ANYA can be represented as a set of 
fuzzy rules of the form IF (antecedent) and THEN 
(consequent). While in the Mamdani type FRB both 
antecedent and consequent parts are fuzzy, in the so- 
called Takagi-Sugeno type FRB the consequent part 
is a functional, f(x) (most often linear) with the an- 
tecedent part being fuzzy. In both types of FRB the 
defuzzification can be either of so-called center-of- 
gravity (COG) or winner takes all (WTA) type. There 
are variations such as few winners take all, etc., but usu- 
ally COG is applied unless a classification problem is 
considered, where WTA performs better. In ANYA type 
FRBs the antecedent part is defined using an alterna- 
tive, density-based representation which is parameter- 
free and reduces the problems of definition and tuning 
of membership functions (one of the stumbling blocks 
in the application of the fuzzy set theory overall). The 
consequent part of the ANYA type FRB can still be 
same as in Takagi-Sugeno type FRBs. For more de- 
tail, the reader is referred to [75.21, 22]. The so-called 
SPARC self-evolving controller, however, has poorer 
performance in the first moments when applied from 
scratch (with no pre-trained model and no rules). In this 
chapter, model reference control and gradient-based 
learning of the consequents of the individual locally 
valid rules. In the proposed method the antecedent part 
is determined using focal points/prototypes (selected 
descriptive actual data points) instead of pre-defining 
the membership functions in an explicit manner. The 
fuzzy rules are formed around selected representative 
points from the control surface; thus, there is no need 
to define the membership functions per variable. It has 
a much simplified antecedent part which is formed us- 
ing so-called data clouds. Fuzzy data clouds are fuzzy 
sets of data samples which have no specific shape, pa- 
rameters, or boundaries. With ANYA type FRBs the 
relative density is used to define the relative member- 
ship to a particular cloud. It takes into account the 
distance to all previous data samples and can be cal- 
culated recursively. 

In order to show the effective performance of the 
proposed controller, it is applied to a simulated problem 
of temperature control in a water bath [75.21]. 

The remainder of this chapter is organized as fol- 
lows. In Sect. 75.2 the new simplified FRB system 
is introduced, including rule representation and the 
associated inference process. The evolving methodol- 
ogy used for the online learning of both the struc- 
ture and the parameters of RECCo is described in 
Sect. 75.3. First, we present the mechanism for the 
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75.2 Structure of the Cloud-Based Controller 


online adaptation of the consequents in Sect. 75.3.1 
and then, we illustrate the structure evolution process 
in Sect. 75.3.2. In Sect. 75.4, the simulation exam- 


ple is presented as a proof of concept for the pro- 
posed methodology. Finally, conclusions are drawn in 
Sect. 75.5. 


75.2 Structure of the Cloud-Based Controller 


ANYA [75.22] is a recently proposed type of FRB 
system characterized by the use of non-parametric 
antecedents. Unlike traditional Mamdani and Takagi- 
Sugeno FRB systems, ANYA does not require an ex- 
plicit definition of fuzzy sets (and their corresponding 
membership functions) for each input variable. On the 
contrary, ANYA applies the concepts of fuzzy data 
clouds and relative data density to define antecedents 
that represent exactly the real data density and distri- 
bution and that can be obtained recursively from the 
streaming data online. 

Data clouds are subsets of previous data samples 
with common properties (closeness in the data space). 
Contrary to traditional membership functions, they rep- 
resent directly and exactly all the previous data samples. 
Some given data can belong to all the data clouds with 
a different degree y € [0, 1], thus the fuzziness in the 
model is preserved and used in the defuzzification, as 
will be shown later. It is important to stress that clouds 
are different from traditional clusters in that they do not 
have specific shapes and, thereby, do not require the 
definition of boundaries. 

First it was proposed to use ANYA to design fuzzy 
controllers for the situations in which the lack of knowl- 
edge about the plant makes it difficult to define the rule 
antecedents in [75.21]. SPARC autonomous controllers 
have a rule base with N rules of the following form 


R' : IF (x~ X’) THEN (u') , (75.1) 


where ~ denotes the fuzzy membership expressed lin- 
guistically as is associated with, X' € R” is the i-th data 
cloud defined in the input space, x = [x1,x2,...,Xn]7 is 
the controller’s input vector, and u' is the control action 
defined by the i-th rule. 

It is to be noted that no aggregation operator is re- 
quired to combine premises of the form IF x; is x, as 
in traditional fuzzy systems. All the remaining compo- 
nents of the FRB system (e.g., the consequents and the 
defuzzification method) can be selected as in any of the 
traditional fuzzy systems. 

A tule base of the form (75.1) can describe complex, 
generally non-linear, non-stationary, non-deterministic 


systems that can be only observed through their inputs 
and outputs. Hence, autonomous controllers based on 
ANYA type FRB systems are suitable to describe de- 
pendence of the type JF X THEN U based on the history 
of pairs of data observations of the form g = [x/; uj)’ 
(withj = 1,...,k— 1 andz € R”+!), and the current k- 
th input x7. 

The degree of membership of the data sample x, 
to the cloud X' is measured by the normalized relative 
density as follows 


vi 
N i? 
ei Yk 


where y; is the local density of the i-th cloud for that 
data sample. 

This local density is defined by a suitable kernel 
over the distance between x; and all the other samples 
in the cloud, i. e., 


ài = co Peres (75.2) 


vi=K| del: i=1,...N, (75.3) 


where dj denotes the distance between the data samples 
x, and x;, and M' is the number of input data samples 
associated with the cloud XÍ. 

In a similar manner, we consider that a sample is 
associated with the cloud with the highest local density. 
In addition, we use the Euclidean distance, i.e., dj = 


\|xx —.||?. Nonetheless, any other type of distance could 
also be used [75.22]. 

In this study we used a Cauchy kernel. Thereby, 
(75.3) can be recursively determined as follows [75.23] 


1 


= 3 (75.4) 
1+ [|x — poll? + £r — [eel |? 


vi 


where yj denotes the relative density to the i-th data 
cloud calculated in the k-th time instant, and X; denotes 
the scalar product of the data x; 

k-1 


1 
dig = —— Dy + zl Pe 


2 
: 75.5 
i (75.5) 
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with starting condition ©, = ||x;||?. The update of the 
mean value, jz is straightforward 


k-1 1 
Hk = ——BHk-1 + Xk 


i i (75.6) 


with the starting condition 4; = x). 

As for the defuzzification, it is to be noted that 
ANYA can work with both Mamdani and Takagi- 
Sugeno—Kang (TSK) consequents. In this case, we use 
the latter type, as is usual in control applications [75.19, 
24-26]. Hence, if we consider the weighted average for 
the defuzzification, the output of the ANYA controller 


N N Bi 
j = Y iui = at 
km Ke Ng 
i= i=1 Vk 


(75.7) 
where u’ denotes the i-th rule consequent. 

From a local point of view, the goal of the con- 
troller is to bring the plant’s output from its current 
value to the desired reference value as soon as possi- 
ble, i. e., ideally y,4) = rg, where k and k + 1 represent 
consecutive control steps. It is well known that it is not 
possible to do this immediately due to several limita- 
tions in the system (most notably — the actuator ones). 
A useful practice is to introduce a reference model that 
represents the desired closed-loop dynamics of the con- 
trolled systems [75.27]. The simplest choice is to use 
a linear reference model of the first order. Then the pre- 
diction of the reference output y" can be obtained using 
the following equations 


Yki = 4k + (l-a)n, O<a, <1, (75.8) 
where a, is the pole of the first-order filter. 

It can be tuned according to the desired speed of the 
closed-loop system. Comparing the output of the plant 
yx to the output of the reference model y;, the tracking 
error € is obtained 

Ek = Yk — Yk - (75.9) 
The goal of the controller in terms of the tracking error, 
&; is to keep it as low as possible. Since the reference 
model output yj, is a filtered version of the reference 


signal rz, this means that the tracking error has no step 
changes due to reference signal changes. It also has 
to be noted that tracking error is used as a driving 
error during parameter adaptation, as we shall see in 
Sect. 75.3.1. 

As noted above, the proposed approach is com- 
patible with a wide spectrum of control laws in rule 
consequents. Here, the PID-based rule consequents are 
proposed 

Uj, = Pree tH Se+Di Ac, i=1,...,N, (75.10) 
where X, and A, denote the discrete-time integral and 
derivative of the tracking error, respectively, 


(75.11) 


while PŻ, L, and Di are parameters that will be tuned 
by means of adaptation of rule consequents. 

The approach offers the possibility of implement- 
ing several subsets of PID-based controllers such as P, 
PI, PD, etc. For simplicity, only proportional controllers 
will be used in the rule consequent for the rest of this 
chapter 

u =P er, i=1,...,N. (75.12) 

It has to be kept in mind that most real-life con- 
trollers are limited in their operation and can only 
provide control actions within a specific range, namely, 
the actuator’s interval [Umin, Umax]. If the computed 
control signal uz, is outside the actuator’s interval, 
it can simply be projected onto the interval if P or 
PD type controllers are used. If integral controllers 
are used as well, some classical approaches to avoid 
integral windup should be implemented. When the 
violations of the actuator’s constraints are more dras- 
tic, because of the chosen dynamics of the refer- 
ence model and a narrow actuator interval, then also 
the interruption of the control parameters adaptation 
can be employed to make the adaptation more ro- 
bust. This modification will also be explained in detail 
later. 
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In this section, we present the methodology applied 
for evolving the structure and parameters of the con- 
sequents of RECCo online. Initially, the controller is 
empty, so it has to be initialized from the first data sam- 
ple received. After this, the same steps are repeated for 
all incoming data. First, the consequents of the current 
rules are updated according to the error at the plant out- 
put. Then, a new control action is generated by applying 
the inference process described in Sect. 75.2. Finally, 
the structure of the controller is updated. If the appro- 
priate conditions are satisfied, a new cloud (and hence, 
a rule) is created; otherwise, the new sample is used to 
update the information about the data density and the 
consequent parameters of the current configuration of 
the controller. 

In the following sections, the entire process is de- 
scribed in more detail. Section 75.3.1 is devoted to 
the mechanism for online adaptation of the rule conse- 
quents. In Sect. 75.3.2 the process of adding new clouds 
is described. 


75.3.1 Online Adaptation 
of the Rule Consequents 


Assuming that the plant is monotonic with respect to the 
control signal, the partial derivative of the plant’s output 
with respect to the control signal has a definite constant 
sign Gsign = +1, which has to be known in advance. 
Therefore, the combination of the error at the plant’s 
output and the sign of the monotonicity of the plant 
with respect to the control signal provides information 
about the right direction in which to move the rule con- 
sequents to achieve the local control objective [75.10]. 

As is already known, the parameters of the rule 
consequents are obtained by means of adaptation. In 
normal circumstances the parameter changes are calcu- 
lated as follows 


AP = YPGsignAi(@)ek, i=1,...N, (75.13) 


where yp is an adaptive gain for the proportional con- 
troller gains. 

Equation (75.13) is obtained by using gradient de- 
scent and having the square of the tracking error as 
a cost function. The controller gains are obtained by 
summing up the terms obtained in (75.13) 


P} =P; +A4AP, i=1,...N. (75.14) 


Note that parameters keep changing until the tracking 
error is driven towards 0. Note also that only parameters 
corresponding to the active clouds are adapted, while 
the others are kept constant. 

Systems with parameter adaptation are subjected to 
parameter drift, which can lead to performance degra- 
dation and, eventually, to system instability [75.28]. 
There exist many known approaches to make adaptive 
laws more robust [75.29-31]. We will employ pa- 
rameter projection, parameter leakage, introduce dead 
zone into adaptive laws, and employ the saturation 
of the adaptive parameters when the actuator is in 
saturation. 


Dead Zone in the Adaptive Law 

As already has already been said, adaptation of the 
parameters in the closed loop always presents poten- 
tial danger to the system’s stability. The adaptation 
is driven by an error signal that is always composed 
of the useful component and the harmful one. The 
latter is due to disturbances and parasitic dynam- 
ics, and is usually bounded. Large errors are usually 
mostly composed of useful components, while small 
tracking error is very often due to harmful signals. 
Having the adaptation active during the time that the 
error is small results in a false adaptation. The idea 
behind the dead zone in the adaptive law is that 
the adaptation is simply switched off if the abso- 
lute value of the error that governs the adaptation is 
small [75.32] 


AP = VPGsignà i (x)ek [Ex] = daeaa 
0 lex < dead 


i= Ta. (75.15) 


Parameter Projection 
Parameter projection is a natural way to prevent param- 
eter drift. The idea is to project the parameters onto 
a compact set [75.27]. In our case, each individual pa- 
rameter is projected on a certain interval or a ray. When 
projecting the parameters some prior knowledge must 
always be available. Since in our case proportional con- 
troller gains are adapted, their sign is always known 
and is equal to Gsign. In the case of positive plant gain 
all the consequent parameters should be bounded by 0 
from below, while an upper bound may or may not (if 
not enough prior knowledge is available) be provided. 
The adaptive law given by (75.14) is generalized as 
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follows if the controller gains are projected onto the in- 
terval [P, P] 


Pi +AP, P<Pi_,+AP,<P 
PP Pi» + AP. <P ; 
P Pi, +APi >P 
i= leM, 
(75.16) 


where P and P are two design parameters. In our ap- 
proach P = 0 and P = 00 will be used. 


Leakage in the Adaptive Law 
The idea of leakage is that the discrete integration in 
the adaptive law in (75.14) presents a potential danger 
to the adaptive system, and the pole due to the integra- 
tor should be pushed inside the unit disc [75.33], which 
results in the adaptive law 


Pi = (1—op)Pi_ + APL, i=1,...N, (75.17) 


where op defines the extent of the leakage. The intro- 
duction of the leakage results in adaptive parameter 
boundedness. This is why leakage is sometimes referred 
to as soft projection. 


Interruption of Adaptation 

When the chosen dynamics and the actuator constraints 
are in conflict, such that tracking of the reference model 
cannot be achieved in a sufficiently small interval, then 
drift of the control parameters often occurs, because the 
adaptive law is driven with a tracking error €, which 
cannot be reduced, due to the control signal constraints. 
The interruption of the adaptation results in the follow- 
ing modification 


APi YPGsignA g (X)Ex , Umin S Uk S Umax 
ki ; 
0, else 


i=1,...N. (75.18) 


75.3.2 Evolution of the Structure: 
Adding New Clouds 


The adaptation of the parameters of the consequents 
is performed online in a closed loop manner (while 
the controller operates over the real plant). The control 
is applied from the first moment (no a-priori informa- 
tion or controller structure is needed). Adaptive systems 


traditionally [75.34] concern tuning parameters of the 
controllers for which the structure has been pre-selected 
by the designer. Self-evolving controllers [75.16, 24] 
offer the possibility of evolving the structure of the con- 
troller as well as adapting parameters. This helps design 
on the fly controllers, which are non-linear and with 
no pre-defined structure or knowledge about the plant 
model. This requires us to define a mechanism for the 
online evolution of the controller’s structure, i.e., for 
adding new antecedents and fuzzy rules. 

We already defined the local density earlier. Now, 
the global density J”, will be defined. Its definition is 
analogous to the one given for the local density, except 
that it takes into account the distance to all the previ- 
ously observed samples z; (j = 1,...,k—1). It has to be 
noted that the global density is computed for the points 
zk = [x]; ue)’, whilst the local density is defined only 
for the input vectors xz. Using again the Cauchy kernel, 
this density can be defined as 


1 
i, = ————; » (75.19) 


and can be computed recursively by [75.23] 


1 
+ [ze — mll? + ZF- lugi 


T, (75.20) 


where J; denotes the global density to all the data cal- 
culated in the k-th time instant, and X Hy denotes the 
scalar product of the data z 


k-1 1 
es ih + glial’, (75.21) 


with starting condition Xf = ||z\||°. The update of the 
mean value u| is straightforward 


ke} Ae l (75.22) 
k Mk-1 Re . 


Hk = 
with starting condition © = z1. 

Since this measure considers all existing samples, 
it provides an indication of how representative a given 
point zx is with respect to the entire data distribution. 

Additionally, and only for learning purposes, a fo- 
cal point X} and a radius rig are defined for each cloud. 
The focal point is a real data sample that has highly 
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representative qualities. The fact that the focal point is 
always areal sample is important, as it avoids problems 
that may appear when only a descriptive measure is 
used instead (as in the case of the average of the points 
in the cloud). In order to follow the philosophy of the 
proposed methodology (i. e., avoiding the need to pre- 
define the parameters of the rules), the focal point is 
updated online. Thus, for each new sample zę, the fol- 
lowing process is applied: 


1. Find the associated cloud Cİ, according to 


Cİ = arg max (vi) é (75.23) 


2. Check the representative qualities of the new data 


point using the following conditions, 


(75.24a) 
(75.24b) 


vi> yi, 
Ren, 


where y; and Tý represent the local and global den- 
sity of the current focal point, respectively. 

3. If both conditions are satisfied, then replace the fo- 
cal point by applying Xf < x. 


The radius provides an idea of the spread of the 
cloud. Since the cloud does not have a definite shape or 
boundary, the radius represents only an approximation 
of the spread of the data in the different dimensions. It 
is also recursively updated as follows [75.35] 

Tik = P Fig) + = pyar ; rA =i; (75.25) 
where p is a constant that regulates the compatibility 
of the new information with the old one and is usually 
set to p = 0.5 [75.35]. The value oh denotes the cloud’s 
local scatter over the input data space and is given by 


; E esos 
Ce = | DK: o=l. (75.26) 
l=1 


It is important to note that the radii and focal points 
are only used to provide an idea of the location and dis- 
tribution of the data in the clouds during the structure 
evolution process. However, they are not actually used 
to represent the clouds or the fuzzy rules and do not 
affect the inference process at any point. 

The structure-learning mechanism applied for the 
proposed RECCo is based on the following princi- 
ples [75.36, 37]: 


a) Good generalization and summarization are 
achieved by forming new clouds from data samples 
with high global density I”. 

b) Excessive overlap between clouds is avoided by 
controlling the minimum distance between them. 


Hence, the evolution of the structure is based on the 
addition of new clouds and the associated rules. First, 
RECCo is initialized by creating a cloud C! from the 
first data sample zı = [x1 ;u;]’. The antecedent of the 
first rule is then defined by this cloud and its consequent 
equals the value u;. Next, for all the further incoming 
data samples zg, the following steps are applied: 


1. The sample zę = [x7;u,]’ is considered to have 
good generalization and summarization capabilities 
if its global density is higher than the global den- 
sity of all the existing clouds. Thus, the following 
condition is defined 

Dale Vis iawgN< (75.27) 
Note that this is a very restrictive condition that 
requires that the inequality is satisfied for all the ex- 
isting clouds, which is not very often for real data. 

2. Check if the existing clouds are sufficiently far 
from zę with the following condition 


di> E, Wi=1,....N, (75.28) 
where dpi; represents the distance from the current 
sample to the focal point of the associated cloud, Xi}. 

3. According to the result of the previous steps, take 
one of the following actions: 

a) If conditions (75.27) and (75.28) are both satis- 
fied, then create a new cloud C+!, The focal 
point of the new cloud is K = xz. Its local 
scatter is initialized based on the average of the 
local scatters of the existing clouds [75.35] 


(75.29) 


Additionally, the corresponding rule has to be 
added to the rule base. The antecedent of the 
new rule is defined by the newly created cloud 
CNI, For the consequent, we provide an ini- 
tial value that guarantees that the output of the 
new controller when the input vector is equal 
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to the focal point (i. e., x = X? +1) equals the 


controller’s output under its previous config- 
uration for that same input [75.21]. The ra- 
tionale behind this initialization is to provide 
a smooth transition from the old configura- 
tion of the controller to the new one. This 
avoids sudden changes in the output surface 
that could damage the controller’s performance 
in the first time instants after the rule has 
been added (and before the consequents are 
adapted) [75.25]. 


75.4 Simulation Study 


A simulation study of the proposed self-organizing 
controller is presented in this section. The main atten- 
tion is given to the study of different modifications of 
self-organizing controllers to make the adaptive laws 
more robust. In the study, we show the implementa- 
tion of parameter projection, parameter leakage, and 
the introduction of dead zone into the adaptive laws. 
The study was carried out with the assumption of no 
prior knowledge about the plant dynamics. The math- 
ematical model was only used to simulate the plant 
dynamics. 

The plant for the simulation study is the thermal 
process of a water bath. The main goal is the control of 
the temperature in the water bath. The plant is described 
with the following mathematical model in a discrete 
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Fig. 75.1 Open-loop response of the plant: output temper- 
ature and input variable 


b) Ifthe conditions are not satisfied, update the pa- 
rameters of the cloud C’ associated with zę, as 
previously explained. 


It is important to stress that the methodology pre- 
sented for the evolution of the controller’s structure 
starts from an empty controller. However, if an initial 
set of rules is known beforehand (e.g., provided by an 
expert or obtained from any other training method), it 
can be used for the initial controller. In this case, the 
algorithm’s initialization step can be omitted. 


form 


y(k+ 1) = ay(k) + bu(k) + (1—a) yo, (75.30) 
where a =e” 7s and b= BUST, The parameters 
of the plant are estimated as œ = 1074, B =8.7 x1073, 
y =40, and yo = 20°C. The sampling period, Ts, is 
equal to 25s. 

The open loop response of the plant is shown in 
Fig. 75.1. It is shown that the behavior of the plant 
exhibits a huge nonlinearity in the static gain of the pro- 
cess. The reference signal is chosen to show the ability 
of self-learning and dealing with nonlinearity, which 
is the main advantage of the proposed algorithm. The 
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Fig. 75.2 Reference, model reference, output signal track- 
ing, and control signal in the case of no robust adaptive 
laws 
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clouds are defined in a way so as to enable dealing with 
nonlinearity, and for that reason the input variables for 
the controller are the reference value ry and the tracking 
error Ex. 

All the simulations started from zero fuzzy rules and 
membership functions, and new rules were generated 
during the process. The first simulation was done for 
the case without any robust adaptive laws. The parame- 
ters of the control law do not converge and drift, which 
leads to performance degradation and, after some time, 
also to instability. In this example, the actuator interval 
was given as [—6, 6], the reference model parameter was 
defined as a, = 0.925, and the adaptive gain yp = 0.1. 
The upper plot of Fig. 75.2 shows the reference r, the 


Fig. 75.3 Clouds in the case of no robust adaptive laws 
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Fig. 75.4 Adaptive parameters P* in the case of no robust 
laws 


model reference y,, the output signal y, and the lower 
plot shows the control signal. The seven clouds gen- 
erated during the self-learning procedure are given in 
Fig. 75.3. 

The drifts of the adaptive parameters Pi, are shown 
in Fig. 75.4, where the parameters for all clouds are 
shown, and in Fig. 75.5, where the adaptive parameter 
P; is shown. 

The second simulation is done with dead zone mod- 
ification to make the adaptation robust. The dead zone 
was chosen as ddeaq = 2. All the other parameters of 
the algorithm were the same as in the first simulation: 
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Fig. 75.5 Drift of the adaptive parameter Px in the case of 
no robust laws 
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Fig. 75.6 Reference, model reference, output signal track- 
ing, and control signal in the case of a dead zone 
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the actuator’s interval was given as [—6, 6], the refer- 
ence model parameter was defined as a, = 0.925, and 
the adaptive gain yp = 0.1. In the upper plot, Fig. 75.6 
shows the reference, the model reference, and the out- 
put signal, and in the lower part the control signal. The 
model reference tracking is suitable and the parameters 
of the control law converge and enable a reasonable per- 
formance. 

The clouds generated during the self-learning pro- 
cedure are the same as those obtained in the first ex- 
ample in Fig. 75.3. During the procedure seven clouds 
were generated again. The adaptive parameters Pi, are 
shown in Fig. 75.7, where the parameters for all clouds 


0 
0 02 04 06 08 1 e La OES 2 


kx 10+ 
Fig. 75.7 Adaptive parameters P% in the case of a dead 
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Fig. 75.8 Drift of the adaptive parameter Px in the case of 
dead zone 


are shown, and in Fig. 75.8, where the adaptive param- 
eter P; is shown. 

The tracking error ex in the case of dead zone mod- 
ification is shown in Fig. 75.9. 

The results of the last 500 samples are shown in 
detail in Fig. 75.10. It can be seen that the tracking us- 
ing the proposed modification of the adaptive laws has 
a very good control performance. 

The relatively big dead zone stops the adaptation of 
the control parameters and results in a bigger tracking 
error. On the other hand, a smaller dead zone would 
result in longer settling of the adaptive parameters and 
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Fig. 75.9 Tracking error 
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Fig. 75.10 Reference, model reference, output signal 
tracking, and control signal in the case of a dead zone, in 
detail 
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also in possible drifting. This way, the combination of 
a dead zone and a leakage adaptive law is proposed in 
the third simulation, where the rest of the parameters 
are the same as in the previous simulation, except the 
dead zone, which is now chosen to be dgeaq = 0.25, and 
the leakage term which is defined as op = 1075. 

Figure 75.11 shows the reference, the model ref- 
erence, the output signal, and the control signal. The 
model reference tracking is satisfactory and the parame- 
ters of the control law converge and enable a reasonable 
performance. 
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Fig. 75.11 Reference, model reference, output signal track- 
ing, and control signal in the case of leakage in the adaptive 
law 
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Fig. 75.12 Adaptive parameters P* in the case of leakage 
in the adaptive law 


The clouds generated during the self-learning pro- 
cedure with leakage in adaptive law are the same as 
those obtained in the previous two examples, and are 
given in Fig. 75.3. The positions of the clouds remain 
the same in all three approaches using different robust 
modifications of the adaptive laws. The adaptive param- 
eters PÌ are shown in Fig. 75.12, where the parameters 
for all clouds are shown, and in Fig. 75.13 where the 
adaptive parameter Px is shown. 

The tracking error eg in the case of leakage in the 
adaptive law is shown in Fig. 75.14. Due to the use of 
a smaller dead zone and leakage in the adaptive law, the 
tracking is better and also the parameter convergence is 
good. 
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Fig. 75.13 Drift of the adaptive parameter P% in the case of 
leakage in the adaptive law 
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Fig. 75.14 Tracking error 
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The results of the last 500 samples are shown in de- 
tail in Fig. 75.15. It is shown that the tracking using 
the proposed leakage in the modification of the adap- 
tive laws has a high control performance. 

In the fourth simulation study we would like to 
show an example of drastic constraints in the process 
actuator. In this case, the actuator constraints are given 
by the interval [-1,2]. The dead zone is now chosen 
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Fig. 75.15 Reference, model reference, output signal 
tracking, and control signal in the case of leakage in the 
adaptive law, in detail 
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Fig. 75.16 Reference, model reference, output signal 
tracking, and control signal in the case of adaptation in- 
terruption 


to be daeaad = 0.5. Figure 75.16 shows the reference, the 
model reference, the output signal, and the control sig- 
nal in the case of adaptation interruption. The model 
reference tracking is satisfactory and the parameters of 
the control law converge and enable a reasonable per- 
formance. 

The clouds generated during the self-learning pro- 
cedure are shown in Fig. 75.17. The positions of the 
clouds is now different because of the constraints and 
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Fig. 75.17 Clouds in the case of adaptation interruption 
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Fig. 75.18 The adaptive parameters P* in the case of adap- 
tation interruption 
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Fig. 75.19 The drift of the adaptive parameter Py in the 
case adaptation interruption 
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75.5 Conclusions 


In this chapter, a new approach for an online self-evolv- 
ing cloud-based fuzzy rule-based controller (RECCo), 
which has no antecedent parameters, was proposed. 
One illustrative example was provided to support the 
concept. It has been shown that the proposed controller 
can start with no a-priori knowledge. All the fuzzy 
tules are defined during the self-evolving phase. The 
controller performs the self-evolving algorithm simulta- 
neously with the control of the plant. The advantage of 
the proposed controller is the self-evolving procedure, 
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Fig. 75.21 Reference, model reference, output signal 
tracking, and control signal in the case of adaptation in- 
terruption 


different tracking errors. The adaptive parameters P} 
are shown in Fig. 75.18, where the parameters for all 
clouds are shown, and in Fig. 75.19, where the adaptive 
parameter P; is shown. 

The tracking error £x in the case of adaptation inter- 
ruption is shown in Fig. 75.20. 

The results of the last 500 samples are shown in de- 
tail in Fig. 75.21. It is shown that perfect tracking in 
the case of a highly constrained control signal can be 
achieved by using the proposed modification based on 
adaptation interruption. 


Fig. 75.20 Tracking error < 


which enables a working algorithm that starts from no 
a-priori knowledge; it can cope perfectly with nonlin- 
earity because of the use of fuzzy data clouds, which 
actually divide the input space and enable the use of 
different control parameters in each cloud, and also en- 
ables adaptation to changes of the process parameters 
during the control. No explicit membership function 
is needed, no pre-training or any explicit model in 
any form. The proposed algorithm combines the well- 
known concept of model-reference adaptive control al- 
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gorithms with the concepts of evolving fuzzy systems 
of ANYA type (no antecedent parameters and density- 
based fuzzy aggregation of the linguistic rules). In this 
work, we analyzed problems related to the adaptive ap- 


proach. Different modifications of adaptive laws were 
studied. Those modifications make the adaptive laws 
more robust to parameter drift which often leads to per- 
formance degradation and instability. 
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76. Evolving Embedded Fuzzy Controllers 


Oscar H. Montiel Ross, Roberto Sepulveda Cruz 


The interest in research and implementations of 
type-2 fuzzy controllers (T2FCs) is increasing. It has 
been demonstrated that these controllers pro- 
vide more advantages in handling uncertainties 
than type-1 FCs (TIFCs). This characteristic is very 
appealing because real-world problems are full 
of inaccurate information from diverse sources. 
Nowadays, it is no problem to implement an in- 
telligent controller (IC) for microcomputers since 
they offer powerful operating systems, high-level 
languages, microprocessors with several cores, and 
co-processing capacities on graphic processing 
units (GPUs), which are interesting characteristics 
for the implementation of fast type-2 ICs (T2ICs). 
However, the above benefits are not directly avail- 
able for the design of embedded ICs for consumer 
electronics that need to be implemented in devices 
such as an application-specific integrated circuit 
(ASIC), a field-programmable gate array (FPGAs), 
etc. Fortunately, for TIFCs there are platforms that 
generate code in VHSIC hardware description lan- 
guage (VHDL; VHSIC: very high speed integrated 
circuit), C++, and Java. This is not true for the de- 
sign of T2ICs, since there are no specialized tools 
to develop the inference system as well as to op- 
timize it. 

The aim of this chapter is to present different 
ways of achieving high-performance computing 
for evolving TI and T2 ICs embedded into FPGAs. 
Therefore, we provide a compiled introduction 
to Tl and T2 FCs, with emphasis on the well- 
known bottle neck of the interval T2FC (IT2FC), and 
software and hardware proposals to minimize its 
effect regarding computational cost. An overview 
of learning systems and hosting technology for 
their implementation is given. We explain differ- 
ent ways to achieve such implementations: at the 
circuit level using a hardware description lan- 
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guage, using a multiprocessor system and a high- 
level language, and combining both methods. We 
explain how to use the IT2FC developed in VHDL as 
a standalone system, and as a coprocessor for the 
FPGA Fusion of Actel, Spartan 6, and Virtex 5. We 
present the methodology and two new proposals 
to achieve evolution of the IT2FC for FPGA, one for 
the static region of the FPGA, and the other one 

for the reconfigurable region using the dynamic 

partial reconfiguration methodology. 
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76.1 Overview 


An intelligent system and evolution are intrinsically re- 
lated since it is difficult to conceive intelligence without 
evolution because intelligence cannot be static. Hu- 
man beings create, adapt, and replace their own rules 
throughout their whole lives. The idea to apply evolu- 
tion to a fuzzy system is an attempt to construct a math- 
ematical assembly that can approximate human-like 
reasoning and learning mechanisms [76.1]. A mathe- 
matical tool that has been successfully applied to better 
represent different forms of knowledge is fuzzy logic 
(FL); also if-then rules are a good way to express hu- 
man knowledge, so the application of FL to a rule-based 
system leads to a Fuzzy Rule-Based System (FRBS). 
Unfortunately, an FRBS is not able to learn by itself, 
the knowledge needs to be derived from the expert or 
generated automatically with an evolutionary algorithm 
(EA) such as a genetic algorithm (GA) [76.2]. 

The use of GAs to design machine learning systems 
constitutes the soft computing paradigm known as the 
genetic fuzzy system where the goal is to incorporate 
learning to the system or tuning different components 
of the FRBS. Other proposals in the same line of work 
are: genetic fuzzy neural networks, genetic fuzzy clus- 
tering, and fuzzy decision trees. A system with the 
capacity to evolve can be defined as a self-developing, 
self-learning, fuzzy rule-based or neuro-fuzzy system 
with the ability to self-adapt its parameters and struc- 
ture online [76.3]. 

Figure 76.1 shows the general structure of an evo- 
lutionary FRBS (EFRBS) that can be used for tuning 


Learning or tuning process 


Scaling 


or learning purposes. Although, it is difficult to make 
a clear distinction between tuning and learning, the par- 
ticular aspect of each process can be summarized as 
follows. The tuning process is assumed to work on 
a predefined rule base having the target to find the 
optimal set of parameters for the membership func- 
tions and/or scaling functions. On the other hand, the 
learning process requires that a more elaborated search 
in the space of possible rule bases, or in the whole 
knowledge base be achieved, as well as for the scal- 
ing functions. Since the learning approach does not 
depend on a predefined set of rules and knowledge, 
the system can change its fundamental structure with 
the aim of improving its performance according to 
some criteria. The idea of using scaling functions for 
input and output variables is to normalize the uni- 
verse of discourse in which membership functions were 
defined. 
According to De Jong [76.4]: 


the common denominator in most learning systems 
is their capability of making structural changes to 
themselves over time with the intent of improving 
performance on tasks defined by the environment, 
discovering and subsequently exploiting interesting 
concepts, or improving the consistency and gener- 
ality of internal knowledge structures. 


Hence, it is important to have a clear understanding 
of the strengths and limitations of a particular learning 
system, to achieve a precise characterization of all the 


functions 


TAE a aed 


Optimization method: 
evolutionary algorithm 
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Fig. 76.1 General structure 
of an evolutionary fuzzy 
rule-based system 
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permitted structural changes and how they are going to 
be made. 

De Jong sets three different levels of complexity 
where the GA can perform legal structural changes in 
following a goal, these are [76.4]: 


1. By changing critical parameters’ values 

2. By changing key data structures 

3. By changing the program itself with the idea of 
achieving effective behavioral changes in a task 
subsystem where a prominent representative of this 
branch is the learning production-systems program. 


A good reason behind the success of production sys- 
tems in machine learning is due to the fact that they 
have a representation of knowledge that can simultane- 
ously support two kinds of activities: (1) the knowledge 
can be treated as data that can be manipulated according 
to some criteria; (2) for a particular task, the knowledge 
can be used as an executable entity. 

The two classical approaches for working with evo- 
lutionary FRBS (EFRBS) for a learning system are 
the Pittsburgh and Michigan approaches. Historically, 
in 1975 Holland [76.5] affirmed that a natural way to 
represent an entire rule set is to use a string, i.e., an 
individual; so, the population is formed by candidate 
tule sets, and to achieve evolution it is necessary to use 
selection and genetic operators to produce new gener- 
ations of rule sets. This was the approach taken by De 
Jong at the University of Pittsburgh, hence the name of 
Pittsburgh approach. During the same period, Holland 
developed a model of cognition in which the members 
of population are individual rules, and the entire popula- 
tion is conformed with the rule set; this quickly became 
the Michigan approach [76.6, 7]. 

There are extensive pioneering and recent work 
about tuning and learning using FRBS most of them 
fall in some way in the Michigan or in the Pitts- 
burgh approaches, for example, the supervised in- 
ductive algorithm [76.8,9], the iterative rule learning 
approach [76.10], coverage-based genetic induction 
(COGIN) [76.11, 12], the relational genetic algorithm 
learner (REGAL) system [76.13], the compact fuzzy 
classification system [76.14], with applications to fuzzy 
control [76.15, 16], and about tuning type-2 fuzzy con- 
trollers [76.17—20]. 

The focus of this chapter is on evolving embed- 
ded fuzzy controllers; this subclassification reduces the 
number of related works; however, they are still a big 
quantity, since by an embedding system (ES), we can 
understand a combination of computer hardware (HW) 


and software (SW) devoted to a specific control func- 
tion within a larger system. Typically, the HW of an ES 
can be a dedicated computer system, a microcontroller, 
a digital signal processor, or a FPGA-based system. If 
the SW of the ES is fixed, it is called firmware; because 
there are no strict boundaries between firmware and 
software, and the ES has the capability of being repro- 
grammed, the firmware can be low level and high level. 
Low-level firmware tells the hardware how to work and 
typically resides in a read only memory (ROM) or in 
a programmable logic array (PLA); high-level firmware 
can be updated, hence is usually set in a flash memory, 
and it is often considered software. 

In the literature, there is extensive work on suc- 
cessful applications of type-1 and type-2 fuzzy sys- 
tems; with regards to evolving embedded fuzzy sys- 
tems, they were applied in a control mechanism for 
autonomous mobile robot navigation in real environ- 
ments in [76.21]. For the sake of limiting more the 
content of this chapter, we have focused on EFRBSs 
to be implemented in an FPGA HW platform, with 
special emphasis on type-2 FRBSs. In this last cat- 
egory, with respect to type-1 FRBS took our atten- 
tion to the following proposals: The development of 
an FPGA-based proportional-differential (PD) fuzzy 
look-up table controller [76.22], FPGA implementa- 
tion of embedded fuzzy controllers for robotic ap- 
plications [76.23], a non-fixed structure fuzzy logic 
controller is presented in [76.24], a flexible architecture 
to implement a fuzzy controller into an FPGA [76.25], 
a very simple method for tuning the input membership 
function (MF) for modifying the implemented FPGA 
controller response [76.26]; how to test and simulate the 
different stages of a FRBS for future implementation 
into an FPGA are explained in [76.27—29]. On type- 
1 EFRBS there are some works like: A reconfigurable 
hardware platform for evolving a fuzzy system by us- 
ing a cooperative coevolutionary methodology [76.30], 
the tuning of input MFs for an incremental fuzzy PD 
controller using a GA [76.31]. In the type-2 FRBS cate- 
gory, the amount of reported work is less; representative 
work can be listed as follows: an architectural pro- 
posal of hardware-based interval type-2 fuzzy inference 
engine for FPGA is presented in [76.32], the use of par- 
allel HW implementation using bespoke coprocessors 
handled by a soft-core processor of an interval type-2 
fuzzy logic controller is explored in [76.33], a high- 
performance interval type-2 fuzzy inference system 
(IT2-FIS) that can achieve the four stages fuzzifica- 
tion, inference, KM-type reduction, and defuzzification 
in four clock cycles is shown in [76.34]; the same 
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system is suitable for implementation in pipelines pro- 
viding the complete IT2-FIS process in just one clock 
cycle. 

This work deals with the development of evolv- 
ing embedded type-1 and type-2 fuzzy controllers. In 
the chapter, a broad exploration of several ways to 
implement evolving embedded fuzzy controllers are 
presented. We choose to work with the Mamdani fuzzy 


controller proposal since it provides a highly flexible 
means to formulate knowledge. 

The organization of this chapter is as follows. In 
Sect. 76.2 we present the basis of T1 and T2 FL to 
explain how to achieve the HW implementation of an 
FRBS. In Sect. 76.3 a brief description of the state of 
the art in hosting technology for high-performance em- 
bedded systems is given. 


76.2 Type-1 and Type-2 Fuzzy Controllers 


The type-2 fuzzy sets (T2FS) were developed with the 
aim of handling uncertainty in a better way than T1 FS 
does, since a T1FS has crisp grades of membership, 
whereas a T2FS has fuzzy grades of membership. An 
important point to note is that if all uncertainty dis- 
appears, a T2 FS can be reduced to a TIFS. A type-2 
membership function (T2MF) is an FS that has primary 
and secondary membership values; the primary MF is 
a representation of an FS, and serves to create a lin- 
guistic representation of some concept with linguistic 
and random uncertainties with limited capabilities; the 
secondary MF allows capturing more about linguistic 
uncertainty than a T1 MF. 

There are two common ways to use a T2FS, the gen- 
eralized T2FS (GT2), and the interval T2FS (IT2FS). 
The former has secondary membership grades of dif- 
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Fig. 76.2 Type-2 membership function. For the triangular MF the 
FOU is shown. The FOU is bounded by the upper part UMF(A) and 
the lower part LMF(A). A vertical slice at x’ is illustrated. Right, 
top: secondary MF values for a generalized T2MF; bottom: sec- 
ondary MF values of an IT2MF 


ferent values to represent more accurately the existing 
uncertainty; on the other hand, in an IT2FS the sec- 
ondary membership value always takes the value of 
1. Unfortunately, to date for GT2 no one knows yet 
how to choose their best secondary MFs; moreover, 
this method introduces a lot of computations, making it 
inappropriate for current application in real-time (RT) 
systems, even those with small time constraints; in con- 
trast, the calculations are easy to perform in an IT2FS. 

A T2MF can be represented using a 3-D figure that 
is not as easy to sketch as a TIMF. A more common 
way to visualize a T2MF is to sketch its footprint of 
uncertainty (FOU) on the 2-D domain of the T2FS. We 
illustrate this concept in Fig. 76.2, where we show a ver- 
tical slice sketch of the FOU at the primary MF value 
x’; in the case of a GT2, in the right upper part of the 
figure, the secondary MF shows different height values 
of the GT2; in the case of an IT2F2, just below is the 
secondary MF with uniform values for the IT2FS. Note 
that the secondary values sit on top of its FOU. 

Figure 76.3 shows the main components of a fuzzy 
logic system showing the differences between the T1 
and T2 FC. For T1 systems, there are three components: 
fuzzifier, inference engine, and the defuzzifier which is 


Type-2/Type-1 FC 
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Fig. 76.3 Type-1 and type-2 FC. The T2FC at the output 
processing has the type reducer block 
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the only output processing unit; whereas for a T2 system 
there are four components, since the output processing 
has interconnected the type reducer (TR) block and the 
defuzzifier to form the output processing unit. 

Ordinary fuzzy sets were developed by Zadeh in 
1965 [76.35]; they are an extension of classical set the- 
ory where the concept of membership was extended to 
have various grades of membership on the real con- 
tinuous interval [0,1]. The original idea was to use 
a fuzzy set (FS); i. e., a linguistic term to model a word; 
however, after almost 10 years, Zadeh introduced the 
concept of type-n FS as an extension of an ordinary FS 
(T1FS) with the idea of blurring the degrees of mem- 
bership values [76.36]. 

TIFSs have been demonstrated to work efficiently 
in many applications; most of them use the mathematics 
of fuzzy sets but lose the focus on words that are mainly 
used in the context to represent a function which is more 
mathematical than linguistic [76.37]. 

A TIFS is a set of ordered pairs represented 
by (76.1) [76.38], 


A= {(x, Ma(x)) |x E X} , (76.1) 


where each element is mapped to [0, 1] by its MF ma, 
where [0, 1] means real numbers between 0 and 1, in- 
cluding the values 0 and 1, 


a(x): X —> [0,1]. (76.2) 


A pointwise definition of a T2FS is given as follows, 
A is characterized by a T2MF pz(x, u), where x € X and 
u € Jy C [0, 1], i. e. [76.39], 


A= {(x, u), UA(x, u)|Yx € X, Vue J, C [0, 1)} : 
(76.3) 


where 0 < g(x, u) < 1. 
Another way to express A is 


a= [ f aww JC [0,1], (76.4) 


xEX uel, 


where f f denote the union over all admissible input 
variables x’ and u’. For discrete universes of discourse 
J is replaced by X` [76.39]. In fact, J, C [0, 1] repre- 
sents the primary membership of x € X and yg, u) is 
a TIFS known as the secondary set. Hence, a T2MF 
can be any subset in [0,1], the primary membership, 
and corresponding to each primary membership, there 
is a secondary membership (which can also be in [0,1]) 
that defines the uncertainty for the primary member- 
ship. 


When py(x,u)=1, where xe X and ued, CS 
[0, 1], we have the IT2MF shown in Fig. 76.2. The uni- 
form shading for the FOU represents the entire IT2FS 
and it can be described in terms of an upper member- 
ship function and a lower membership function 


œx) = FOU(A) Yxe X, (76.5) 
L(x) = FOU(A) Vx eX. (76.6) 


Figure 76.2 shows an IT2MF, the shadow region is the 
FOU. At the points x; and x2 are the primary MFs Jy, 
and J;,, and the corresponding secondary MFs g(x) 
and pry(x2) are also shown. 

The basics and principles of fuzzy logic do not 
change from TIFSs to T2FSs [76.37, 40, 41], they are 
independent of the nature of the membership functions, 
and in general, will not change for any type-n. When 
a FIS uses at least one type-2 fuzzy set, it is a type-2 
FIS. 

In this chapter we based our study on IT2FSs, so the 
IT2 FIS can be seen as a mapping from the inputs to the 
output and it can be interpreted quantitatively as Y = 
f(X), where X = {x,,x2,...,x,} are the inputs to the 
IT2 FIS f, and Y = {y1, yo,..., Yn} are the defuzzified 
outputs. These concepts can be represented by rules of 
the form 


If xı is Fy and ... and x, is F,, then y isG. (76.7) 


In a TIFC, where the output sets are TIFS, the de- 
fuzzification produces a number, which is in some sense 
a crisp representation of the combined output sets. In 
the T2 case, the output sets are T2, so the extended 
defuzzification operation is necessary to get T1FS at 
the output. Since this operation converts T2 output sets 
to a TIFS, it is called type reduction, and the T1FS is 
called a type-reduced set, which may then be defuzzi- 
fied to obtain a single crisp number. 

The TR stage is the most computationally expen- 
sive stage of the T2FC; therefore, several proposals to 
improve this stage have been developed. One of the 
first proposals was the iterative procedure known as the 
Karnik—Mendel (KM) algorithm. 

In general, all the proposals can be classified into 
two big groups. Group I embraces all the algorithmic 
improvements and Group II all the hardware improve- 
ments, as follows [76.42]: 


1. Improvements to software algorithms, where the 
dominant idea is to reduce computational cost of 
IT2-FIS based on algorithmic improvements. This 
group can be subdivided into three subgroups. 
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(a) Enhancements to the KM TR algorithm. As the 


classification’s name claims, the aim is to im- 

prove the original KM TR algorithm directly, to 

speed it up. The best known algorithms in this 
classification are: 

i. Enhanced KM (EKM) algorithms. They 
have three improvements over the original 
KM algorithm. First, a better initialization 
is used to reduce the number of iterations. 
Second, the termination condition of the it- 
erations is changed to remove unnecessary 
iterations (one). Finally, a subtle computing 
technique is used to reduce the computa- 
tional cost of each iteration. 

ii. The enhanced Karnik-Mendel algorithm 
with new initialization (EKMANI) [76.43]. 
It computes the generalized centroid of gen- 
eral T2FS. It is based on the observation that 
for two alpha-planes close to each other, the 
centroids of the two resulting IT2FSs are 
also closed to each other. So, it may be ad- 
vantageous to use the switch points obtained 
from the previous alpha-plane to initialize 
the switch points in the current alpha-plane. 
Although EKMANI was primarily intended 
for computing the generalized centroid, it 
may also be used in the TR of IT2-FIS, 
because usually the output of an IT2-FIS 
changes only a small amount at each step. 

iii. The iterative algorithm with stop condition 
(IASC). This was proposed by Melgarejo 
et al. [76.44] and is based on the analysis of 
behavior of the firing strengths. 

iv. The enhaced IASC [76.45] is an improve- 
ment of the IASC. 

v. Enhanced opposite directions searching 
(EODS), which is a proposal to speed up 
KM algorithms. The aim is to search in both 
directions simultaneously, and in each iter- 
ation the points L and R are the switching 
points. 


(b) Alternative TR algorithms. Unlike iterative KM 


algorithms, most alternative TR algorithms have 

a closed-form representation. Usually, they are 

faster than KM algorithms. Two representative 

examples are: 

i. The Gorzalczany method. A polygon us- 
ing the firing strengths [f”,f’] and [(y', y”), 
which can be viewed as an IT2FS. 
It computes an approximate membership 
value for each point. Here, y" = y’ = y”, 


forn=1,2,3...,N. 


+f > 
uO) sE L -F-P], (76.8) 


where f —f is called the bandwidth. Then the 
defuzzified output can be computed as 


yG = arg max,p(y) . (76.9) 


ii. The Wu—Tan (WT) method. It searches an 
equivalent T1FS. The centroid method is ap- 
plied to obtain the defuzzification. This is 
the faster method in this category. 

2. Hardware implementation. The main idea is to take 
advantage of the intrinsic parallelism of the hard- 
ware and/or combinations of hardware and parallel 
programming. Here, we divided this group into four 
main approaches that embrace the existing propos- 
als of reducing the computational time of the type 
reduction stage by the use of parallelism at different 
levels. 

(a) The use of multiprocessor systems, including 
multicore systems that enable the same benefits 
at a reduced cost. In this category are personal 
and industrial computers with processors such 
as the Intel Pentium Core Processor family, 
which includes the Intel Core i3, 15 and 17; the 
AMD Quad-Core Optetron, the AMD Phenom 
X4 Quad-Core processors, multicore microcon- 
trollers such as the Propeller P8X32A from 
Parallax, or the F28M35Hx of the Concerto 
Microcontrollers family of Texas Instruments. 
Multicore processors also can be implemented 
into FPGAs. 

(b) The use of a general-purpose GPU (GPGPU), 
and compute unified device architecture 
(CUDA). In general, GPU provides a new 
way to perform high performance computing 
on hardware. In particular IT2FCs can take 
the most advantage of this technology be- 
cause their complexity. Traditionally, before 
the development of the CUDA technology, 
the programming was achieved by translating 
a computational procedure into a graphic format 
with the idea to execute it using the standard 
graphic pipeline; a process known as encoding 
data into a texture format. The CUDA technol- 
ogy of NVIDIA offers a parallel programming 
model for GPUs that does not require the use 
of a graphic application programming interface 
(API), such as OpenGL [76.46]. 
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(c) The use of FPGAs. This approach offers the 
best processing speed and flexibility. One of the 
main advantages is that the developer can deter- 
mine the desired parallelism grade by a trade-off 
analysis. Moreover, this technology allows us to 
use the strength of all platforms in tight inte- 
gration to provide the large performance avail- 
able at the present time. It is possible to have 
a standalone T1/IT2FC, or to integrate the same 
T1/T2FC as a coprocessor as part of a high per- 
formance computing system. 

(d) The use of ASICs. The T1/T2FC is factory 
integrated using complementary metal-oxide- 
semiconductor (CMOS) technology. The main 


76.3 Host Technology 


Until the beginnings of this century, general-purpose 
computers with a single-core processor were the sys- 
tems of choice for high-performance computing (HPC) 
for many applications; they replaced existing big and 
expensive computer architectures [76.47]. In 2001, 
IBM introduced a reduced intstruction set computer 
(RISC) microarchitecture named POEWER4 (perfor- 
mance optimization with enhanced RISC) [76.48]. 
This was the first dual core processor embedded into 
a single die, and subsequently other companies intro- 
duced different multicore microprocessor architectures 
to the market, such as the Arm Cortex A9 [76.49], 
Sparc64 [76.50], Intel and AMD Quad Core processors, 
Intel i7 processors, and others [76.51]. These develop- 
ments, together with the rapid development of GPUs 
that offer massively parallel architectures to develop 
high-performance software, are an attractive choice 
for professionals, scientists, and researchers interested 
in speeding up applications. Undoubtedly, the use of 
a generic computer with GPU technology has many ad- 
vantages for implementing an embedded learning fuzzy 
system [76.46], and disadvantages are mainly related to 
size and power consumption. A solution to the afore- 
mentioned problems is the use of application specific 
integrated circuits (ASICs) fuzzy processors [76.52— 
54], or reprogrammable hardware based on microcon- 
trollers and/or FPGAs. 

The orientation of this paper is towards tuning and 
learning using FRBS for embedded applications; for 
now, we are going to focus on FPGAs and ASIC tech- 
nology [76.55], since they provide the best level of 
parallelization. Both families of devices provide char- 
acteristics for HPC that the other options cannot. Each 


advantages are that they are cheaper than 
FPGAs. Differently to FPGA technology, ASIC 
solutions are not field reprogrammable. 


A system based on an FPGA platform allows us 
to program all the Group I algorithms since modern 
FPGAs have embedded hard and/or soft processors; this 
kind of system can be programmed using high-level 
languages such as C/C++ and also they can incorpo- 
rate operating systems such as Linux. On the other 
hand, T1/T2 FC hardware implementations have the 
advantage of providing competitive faster systems in 
comparison to ASIC systems and the in field reconfig- 
urability. 


technology has its own advantages and disadvantages, 
which are narrowing down due to recent developments. 
In general, ASICs are integrated circuits that are de- 
signed to implement a single application directly in 
fixed hardware; therefore, they are very specialized 
for solving a particular problem. The costs of ASIC 
implementations are reduced for high volumes; they 
are faster and consume less power; it is possible to 
implement analog circuitry, as well as mixed signal de- 
sign, but the time to market can take a year or more. 
There are several design issues that need to be car- 
ried out that do not need to be achieved using FPGAs, 
the tools for development are very expensive. On the 
other hand, FPGAs can be introduced to the market 
very fast since the user only needs a personal com- 
puter and low-cost hardware to burn the HDL (HDL) 
code to the FPGA before it is ready to work. They 
can be remotely updated with new software since they 
are field reprogrammable. They have specific dedicated 
hardware such as blocks of random access memory 
(RAM); they also provide high-speed programmable 
T/O, hardware multipliers for digital signal processing 
(DSP), intellectual property (IP) cores, microproces- 
sors in the form of hard cores (factory implemented) 
such as PowerPC and ARM for Xilinx, or Microblaze 
and Nios softcore (user implemented) for Xilinx and 
Altera, respectively. They can have built-in analog dig- 
ital converters (ADCs). The synthesis process is easier. 
A significant point is that the HDL tested code devel- 
oped for FPGAs may be used in the design process of 
an ASIC. 

There are three main disadvantages of the FPGAs 
versus ASICs, they are: FPGA devices consume more 
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power than ASICs, it is necessary to use the resources 
available in the FPGA which can limit the design, and 
they are good for low-quantity production. To overcome 
these disadvantages it is very important to achieve op- 
timized designs, which can only be attained by coding 
efficient algorithms. 

During the last decade, there has been an increasing 
interest in evolving hardware by the use of evolutionary 
computations applied to an embedded digital system. Al- 
though different custom chips have been proposed for 


this plan, the most popular device is the FPGA because 
its architecture is designed for general-purpose commer- 
cial applications. New FGAs allow modification of part 
of the programmed logic, or add new logic at the run- 
ning time. This feature is known as dynamic or active 
reconfiguration, and because in an FPGA we can com- 
bine a multiprocessor system and coprocessors, FPGAs 
are very attractive for implementing evolvable hardware 
algorithms. Therefore, in the next sections, we shall put 
special emphasis on multiprocessor systems and FPGAs. 


76.4 Hardware Implementation Approaches 


In this section, an overview of the three main lines of 
attack to do a hardware implementation of an intelligent 
system is given. 


76.4.1 Multiprocessor Systems 


Multiprocessor systems consist of multiple processors 
residing within one system; they have been available 


for many years. Multicore processors have equivalent 
benefits to multiprocessors at a lower cost; they are inte- 
grated in the same electronic component. At the present 
time, most modern computer systems have many pro- 
cessors that can be single core or multicore proces- 
sors; therefore, we can have three different layouts 
for multiprocessing; a multicore system, a multipro- 
cessor system, and a multiprocessor/multicore system. 


Pere 5 Software Software 
program 1 program 2 
Peripherals: 
serial Port, IIC, Microblaze Microblaze 


PWM, etc. 


XPS_mailbox 
core 


Power PC 440 
system 
master 


íi 


J 


XPS_mailbox 
core 


i 


XPS_mailbox 
core 


XPS_mailbox 
core 


XPS_mailbox 
core 


Microblaze Microblaze Microblaze 
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Fig. 76.5 The whole embedded evolutionary IT2FC im- AOD OOOO OOOO OOOO 


plemented in the program memory of the multiprocessor 
system, similarly as in a desktop computer > 


Multi-core system 


Figure 76.4 shows a multicore system embedded into 
a Virtex 5 FPGA XCSVFX70; it has the capacity to 
integrate a distributed multicore system with a hard- 
processor PowerPC 440 as the master, five Microblaze 
32-bit soft-processor slaves, coprocessors, and periph- 
erals. The FPGA capacity to integrate devices is, of 
course, limited by the size of the FPGA. Figure 76.5 
shows the full implementation in the program memory 
of the multiprocessor system. 
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76.4.2 Implementations into FPGAs 


The architecture of FPGAs offers massive parallelism 
because they are composed of a large array of config- 
urable logic blocks (CLBs), digital signal processing 
blocks (DSPs), block RAM, and input/output blocks 
(IOBs). Similarly, to a processor’s arithmetic unit 
(ALU), CLBs and DSPs can be programmed to per- QOUUUUUCUUUUUO UU UU 
form arithmetic and logic operations like compare, 
add/subtract, multiply, divide, etc. In a processor, ALU 
architectures are fixed because they have been de- 
signed in a general-purpose manner to execute various 
operations. CLBs can be programmed using just the 
operations that are needed by the application, which 
results in increased computation efficiency. Therefore, 
an FPGA consists of a set of programmable logic cells 
manufactured into the device according to a connec- 
tion paradigm to build an array of computing resources; 
the resulting arrangement can be classified into four 
categories: symmetrical array, row-based, hierarchy- 
based, and sets of gates [76.56]. Figure 76.6 shows 
a symmetrical array-based FPGA that consists of a two- 
dimensional array of logic blocks immersed in a set 
of vertical and horizontal lines; examples of FPGAs in 
this category are Spartan and Virtex from Xilinx, and 
Atmel AT40K. In Fig. 76.6 three main parts can be 
identified: a set of programmable logic cells also called 
logic blocks (LBs) or configurable logic blocks (CLBs), 
a programmable interconnection network, and a set of 
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mbedded programmable logic devices usually in- 
tegrate one or several processor cores, programmable Interconnect Switch matrix 


logic and memory on the same chip (an FPGA) [76.56]. Fig. 76.6 Symmetric array-based FPGA architecture island style 
Developments in the field of FPGA have been very 

amazing in the last two decades, and for this reason, finite state machines, glue-logic for complex devices, 
FPGAs have moved from tiny devices with a few thou- and very limited CPUs. In a 10-year period of time, 

sand gates that were used in small applications such as a 200% growth rate in the capacity of Xilinx FPGAs 
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Fig. 76.7 IT2FC design entity (FT2KM). This top-level module contains instances of the four fuzzy controller 


submodules 


devices was observed, a 50% reduction rate in power 
consumption, and prices also show a significant de- 
crease rate. Other FPGA vendors, such as ACTEL, and 
ALTERA show similar developments, and this trend 
still continues. These developments, together with the 
progress in development tools that include software and 
low-cost evaluation boards, have boosted the acceptance 
of FPGAs for different technological applications. 


Development Flow 
The development flow of an FPGA-based system con- 
sists of the following major steps: 


1. Write in VHDL the code that describes the systems’ 
logic; usually a top-down and bottom-up methodol- 
ogy is used. For example, to design an IT2FC, we 
need to achieve the following procedure: 

(a) Describe the design entity where the designer 
defines the input and output of the top VHDL 
module. The idea is to present the complex 
object in different hierarchical levels of abstrac- 


tion. For our example, the top design entity is 
FT2KM. 

(b) Once the design entity has been defined, it is 
required to define its architecture, where the de- 
scription of the design entity is given; in this 
step, we define its behavior, its structure, or 
a mixture of both. For the case of the IT2 FLS, 
we define the system’s internal behavior, so we 
determined the necessity to achieve a logic de- 
sign formed by four interconnected modules: 
fuzzification, inference engine, type reduction, 
and defuzzification. The VHDL circuits (sub- 
modules) are described using a register transfer 
logic (RTL) sequence, since we can divide the 
functionality in a sequence of steps. At each 
step, the circuit achieves a task consisting in 
data transference between registers and evalua- 
tion of some conditions in order to go to the next 
step; in other words, each VHDL module (de- 
sign entity) can be divided into two areas: data 
and control. Each of the four modules needs 
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to be conceptualized, so we need to define its 
own design entity and, therefore, its particular 
architecture as well interconnections with inter- 
nal modules. This process is achieved when we 
have reached the last system component. 

(c) Integrate the system. It is necessary to create 
a main design entity (top level) that integrates 
the submodules defining their interconnections. 
In Fig. 76.7 the integration of the four modules 
is shown. 

2. Develop the test bench in VHDL and perform RTL 
simulations for each submodule of the main design 
entity. It is necessary to achieve timing and func- 
tional simulations to create reliable internal design 
entities. 

3. Perform synthesis and implementation. In the syn- 
thesis process, the software transforms the VHDL 
constructs to generic gate-level components, such 
as simple logic gates and flip-flops. The imple- 
mentation process is composed of three small sub- 
processes: translate, map, and place, and route. In 
the translate process the multiple design files of 
a project are merged into a single netlist. The map 
process maps the generic gates in the netlist to the 
FPGA’s logic cells and IOBs, this process is also 
known as technology mapping. In the place and 
route process, using the physical layout inside the 
FPGA chip, the process places the cells in physical 
locations and determines the routes to connect di- 
verse signals. In the Xilinx flow, the static timing 
analysis performed at the end of the implantation 
process determines various timing parameters such 
as maximal clock frequency and maximal propaga- 
tion delay [76.57]. 

4. Generate the programming file and download it to 
the FPGA. According to the final netlist a configu- 
ration file is generated, which is downloaded to the 
FPGA serially. 

5. Test the design entity using a simulation program 
such as Simulink of Matlab and the Xilinx system 
generator (XSG) for Xilinx devices. The idea here 
is first to plot the surface control in order to analyze 
the general behavior of the design (a controller in 


our example), and second to integrate the design en- 
tity as a block of the desired system to be controlled. 
Although, this fifth step, is not in the current litera- 
ture of logic design for FPGA implementation, it is 
the authors’s recommendation since we have expe- 
rienced good results following this practice. 


Using the design entity FT2KM.vhd, which was 


created and tested using the aforementioned develop- 
ment flow, we can integrate it an FPGA in two ways: 


1. 


As a standalone system. Here, we mean an inde- 
pendent system that does not require the support 
of any microprocessor to work, the system itself 
is a specialized circuit that can produce the de- 
sired output. The IT2FC is implemented using the 
FPGA flow design; therefore, it is programmed us- 
ing the complete development flow for a specific 
application. 

As a coprocessor. The coprocessor performs spe- 
cialized functions in such a way that the main 
system processor cannot perform as well and faster. 
For IT2FCs, given an input, the time to produce an 
output is big enough to achieve an adequate con- 
trol of many plants when the IT2FC is programmed 
using high-level language, even we have used a par- 
allel programming paradigm. Since a coprocessor is 
a dedicated circuit designed to offload the main pro- 
cessor, and the FPGA can offer parallelism on the 
circuit level, the designer of the IT2FC coproces- 
sor can have control of the controller performance. 
The coprocessor can be physically separated, i. e., 
in a different FPGA circuit (or module), or it can be 
part of the system, in the same FPGA circuit. In this 
work, we show two methods to develop a system 
with an IT2FC as a coprocessor. In both methods, 
we consider that we have a tested IT2FC design en- 
tity. In the first case, we shall use the FT2KM design 
entity to incorporate the fuzzy controller as a copro- 
cessor of an ARM processor into an FPGA Fusion. 
In the second case, we are going to create the IT2FC 
IP core using the Xilinx Platform Studio; the core 
will serve as a coprocessor of the MicroBlaze pro- 
cessor embedded into a Spartan 6 FPGA. 


76.5 Development of a Standalone IT2FC 


Figure 76.7 shows the top-level design entity (FT2KM) 
of the IT2FC and its components (submodules) for 
FPGA implementation. The entity codification of the 
top-level entity and its components are given in 


Sect. 76.5.1. All stages include the clock (clk) and re- 
set (rst) signals. In the defuzzifier, we have included 
these two signals to illustrate that a full process takes 
only four clock cycles, one for each stage. In prac- 
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tice, we did not add these two signals, since when we 
used it as a coprocessor, in order to incorporate it to 
the system, one 8-bit data latch is added at the output. 
For a detailed description of the IT2FC stages con- 
sult [76.34]. 

The fuzzification stage has two input variables, x, 
and x2. This module contains a fuzzifier for the up- 
per MPs, and another for the lower MFs of the IT2FC. 
For the upper part, for the first input xı, considering 
that a crisp value can be fuzzified by two MFs be- 
cause it may have membership values in two contiguous 
T2MFs, the linguistic terms are assigned to the VHDL 
variables eiup and eup, and their upper membership 
values are géjyp and gezup. For the second input x2, 
the linguistic terms are assigned to the VHDL vari- 
ables deiup and dey», and gde,,, and gde,,, are the 
upper membership values. The lower part of the fuzzi- 
fier is similar; for example, for the input variable xı the 
VHDL assigned variables are éjjoy and eow, and their 
lower MF values are géjjow and gezow, etc. The fuzzifi- 
cation stage entity only needs one clock cycle to perform 
the fuzzification. These eight variables are the inputs of 
the inference engine stage [76.58]. 
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Fig. 76.8 A standalone IT2FC is embedded into an FPGA. The 
fuzzifier reads the inputs directly from the FPGA terminals. The 
defuzzifier sends the crisp output to the FPGA terminals. The sys- 
tem may be embedded in the static region or in the reprogrammable 
region 


The inference engine is divided into two parallel 
inference engine entities IEEup is used to manage the 
upper bound of the IT2FC, and IEElow for the lower 
bound of the IT2FCs. Each entity has eight inputs from 
the corresponding fuzzifier stage, and eight outputs; 
four belong to the output linguistic terms, the rest corre- 
spond to their firing strengths. All the inputs enter into 
a parallel selection VHDL process, the circuits into the 
process are placed in parallel; the degree of parallelism 
can be tailored by an adequate codification style. In our 
case, all the rules are processed in parallel and the eight 
outputs of each inference engine section (upper bound 
and lower bound) are obtained at the same time be- 
cause the c/k signal synchronizes the process, hence this 
stage needs only one clock cycle to perform a whole 
inference and provide the output to the next stage. In 
the upper bound, the four antecedents are formed at 
the same time, for example, for the first rule, the an- 
tecedent is formed using the concatenation operator &, 
so it looks like ante := e1 & de1. Each antecedent can 
address up to four rules and depending on the combina- 
tion, one of the four rules is chosen; the upper inference 
engine output provides the active consequents and its 
firing strengths. The lower bound of the inference en- 
gine is treated in the same way [76.59]. 

At the input of the TR, we have the equivalent val- 
ues of the pre-computed yi, i. e., the linguistic terms of 
the active consequents (Citen, Crier, C3left and Catert), 
the upper firing strength (gcjup, 8C2up, 8C3up, ANd gC4up), 
in addition to the equivalent values of the pre-computed 
y(Cirights Coright, C3right, and Cyrignt), the lower firing 
strength (8Cttows &C2Iows §C3low> and 8C4low) [76.60]. All 
the above-mentioned signals go to a parallel selection 
process to perform the KM algorithm [76.39]. There 
are parallel blocks to obtain the average of the upper 
and lower firing strength for the active consequents, re- 
quired to obtain the average of the y, and y;; a block to 
obtain the different defuzzified values of y, and y;; par- 
allel comparator blocks to obtain the final result of y, 
and y; [76.61]. 

The final result of the IT2FC is obtained using the 
defuzzification block, which computes the average of 
the y, and y;, and produces the only output y. 


76.5.1 Development of the IT2 FT2KM 
Design Entity 


Figure 76.8 shows the implementation of a static IT2FC 
that can work as a standalone system. By static, we 
mean that the only way to reconfigure (modify) the 
FC is to stop the application and uploading the whole 
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configuration bit file (bitstream). In this system, the 
inputs of the fuzzifier and the defuzzifier output are 
connected directly to the FPGA terminals. The assign- 
ment of the terminals is achieved in accordance with the 
internal architecture of the chosen FPGA. Hence, it is 
necessary to provide to the Xilinx Integrated Synthesis 
Environment (ISE) program, special instructions (con- 
straints) to carry through the synthesis process. They 
are generally placed in the user constraint file (UCF), 
although they may exist in the HDL code. In general, 
constraints are instructions that are given to the FPGA 
implementation tools with the purpose of directing the 
mapping, placement, timing, or other guidelines for 
the implementation tools to follow while processing an 
FPGA design. In Fig. 76.7 the overall entity of design 
of the IT2FC (FTK2M) was defined as follows, 


entity FT2KM is 


Port(clk, reset : in std_logic; 
xl, x2 : in std_logic_vector(8 downto 1); 
y : out std_logic_vector (8 downto 1) 
)3 
end FT2KM; 


The architecture of FT2KM has four components, and 
all of them have two common input ports: clock (clk), 
and reset (rst). All ports in an entity are signals by de- 
fault. This is important since a signal serves to pass 
values in and out of the circuit; a signal represents cir- 
cuit interconnects (wires). A component is a simple 
piece of customized code formed by entities as corre- 
sponding architectures, as well as library declarations. 
To allow a hierarchical design, each component must be 
declared before been used by another circuit, and to use 
a component it is neccesary to instatiate it first. In this 
approach the components are: 


1. The component labeled as fuzzyUpLw. It is the T2 
fuzzifier that consists of one fuzzifier for the upper 
MF of the FOU and one for the lower MF. It has two 
input ports x/ and x2; these are 16: el Up to de2Low. 


component fuzzyUpLw is 
port(clk, reset : in std_logic; 


x1, x2, gelUp, ge2Up, gdelUp, gde2Up : 


in std_logic_vector(n downto 1); 
elUp, e2Up, delUp, de2Up, elLow, 
e2Low, delLow, 


de2Low : out std_logic_vector(3 downto 1); 


gelUp, ge2Up, gdelUp, gde2Up, gelLow, 
ge2Low, gdelLow, 


gde2Low : out std_logic_vector(n downto 1); 


); 


end component; 
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The instantiation of this component is achieved us- 
ing nominal mapping and the name of this instance 
is fuzt2. Note that ports clk, reset, and x1 and x2 are 
mapped (connected) directly to the entity of design 
FT2KM, since as we explained before, all ports are 
signals by default, which represent wires. The piece 
of code that defines the instantiation of the fuzzyU- 
pLw component is as follows, 


fuzt2 : fuzzyUpLw port map( 
clk => clk, reset=> reset, x1 => x1, x2 => x2, 
elUp => elupsig, e2Up => e2upsig, delUp => delupsig, 
de2Up => de2upsig, gelUp => gelupsig, ge2Up => ge2upsig, 
gdelUp => gdelupsig, gde2Up => gde2upsig, elLow => ellowsig, 
e2Low => e2lowsig, delLow => dellowsig, de2Low => de2lowsig, 
gelLow => gellowsig, ge2Low => ge2lowsig, gdelLow => gdellowsig, 
gde2Low => gde2lowsig 
MF 


The component Infer_type_2 corresponds to the T2 
inference the controller. It has 16 inputs that match 
to the 16 outputs of the fuzzification stage. This 
component has 16 outputs to be connected to the 
type reduction stage. The piece of code to include 
this component is: 


component Infer_type_2 is 
port(rst, clk : in std_logic; 


el, e2, del, de2, e1_2, e2_2, del_2, de2_2 : in STD_LOGIC_VECTOR (m downto 1); 


g_el, g_e2, g_del, g_de2, g_el_2, g_e2_2, 
g_de1_2, g_de2_2 : in STD_LOGIC_VECTOR (n downto 1); 
cl, c2, c3, c4, cl_2, c2_2, c3_2, c4_2 : out STD_LOGIC_VECTOR {m downto 1); 


gcl_2, gc2_2, gc3_2, gc4_2, gel, gc2, gc3, gc4 : out STD_LOGIC_VECTOR (n downto 1); 


; 


end component ; 


This component is instantiated with the name Jn- 
fer_type_2 as follows, 


inft2: Infer_type_2 port map ( 
rst => reset, clk => clk, el => elupsig, e2 => e2upsig, del => delupsig, 
de2 => de2upsig, g_el => gelupsig, g_e2 => ge2upsig, g_del => gdelupsig, 
g_de2 => gde2upsig, e1_2 => ellowsig, e2_2 => e2lowsig, del_2 => dellowsig, 


de2_2 => de2lowsig, g_el_2 => gellowsig, g_e2_2 => ge2lowsig, g_del_2 => gdellowsig, 


g_de2_2 => gde2lowsig, cl => clsig, c2 => c2sig, c3 => c3sig, c4 => c4sig, 
gcl => gclsig, gc2 => gc2sig, gc3 => gc3sig, gc4 => gc4sig, cl_2 => cl2sig, 
c2_2 => c22sig, c3_2 => c32sig, c4_2 => c42sig, gcl_2 => gcl2sig, 
gc2_2 => gc22sig, gc3_2 => gc32sig, gc4_2 => gc42sig 
i 


To connect the instances fuzt2 and Infer_type_2 it is 
necessary to define some signals (wires), 


signal elupsig, e2upsig, delupsig, de2upsig : std_logic_vector (m-1 downto 0); 
signal gelupsig, ge2upsig, gdelupsig, gde2upsig :std_logic_vector (7 downto 0); 
signal ellowsig, e2lowsig, dellowsig, de2lowsig :std_logic_vector (m-1 downto 0); 


signal gellowsig, ge2lowsig, gdellowsig, gde2lowsig : std_logic_vector (7 downto 0); 
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The component TypeRed corresponds to the type 
reduction stage of the T2FC. It has 16 inputs that 
should connect the inference engine’s outputs and it 
has two outputs yr and yl that should be connected 
to the deffuzifier through signals, once both have 
been instantiated. The piece of code to include this 
component is: 


component TypeRed is 
Port (clk, rst : in std logic; 
cl, c2, c3, c4, c1_2, c2_2, c3_2, c4_2 : in STD LOGIC VECTOR (3 downto 1); 
gel, gc2, gc3, gc4, gcl_2, gc2_2, gc3_2, gc4_2 : in STD_LOGIC_VECTOR (7 downto 0); 
yl, yr : out std_logic_vector (8 downto 1)); 


end component; 


This component is instantiated with the name trkm 
as follows, 


inft2: Infer_type_2 port map( 
rst => reset, clk => clk, el => elupsig, e2 => e2upsig, del => delupsig, 
de2 => de2upsig, g_el => gelupsig, g_e2 => ge2upsig, g_del => gdelupsig, 
g_de2 => gde2upsig, e1_2 => ellowsig, e2_2 => e2lowsig, del_2 => dellowsig, 
de2_2 => de2lowsig, g_el_2 => gellowsig, g_e2_2 => ge2lowsig, g_del_2 => gdellowsig, 
g_de2_2 => gde2lowsig, cl => clsig, c2 => c2sig, c3 => c3sig, c4 => c4sig, 
gel => gclsig, gc2 => gc2sig, gc3 => gc3sig, gc4 => gc4sig, cl_2 => cl2sig, 
c2_2 => c22sig, c3_2 => c32sig, c4_2 => c42sig, gcl_2 => gcl2sig, 
gc2_2 => gc22sig, gc3_2 => gc32sig, gc4_2 => gc42sig 
js 


The signals that connect the instance Infer_type_2 
to the instance trkm are 


signal clsig, c2sig, c3sig, c4sig : std_logic_vector (m-1 downto 0); 
signal gclsig, gc2sig, gc3sig, gc4sig : std_logic_vector (7 downto 0); 
signal cl2sig, c22sig, c32sig, c42sig : std_logic_vector (m-1 downto 0); 
signal gcl2sig, gc22sig, gc32sig, gc42sig :std_logic_vector (7 downto 0); 


The last component defit2 corresponds to the de- 
fuzzifier stage of the T2FLC. It has two inputs and 
one output. 


component defit2 is 
Port ( yl, yr : in std_logic_vector (n-1 downto 0); 
y =: out std logic vector (n-1 downto 0)); 
end component; 


This component is instantiated with the name dfit2 
as follows, 


dfit2 : defit2 port map(yl => ylsig, yr => yrsig, y => y); 


We did not define any signal for the port y since 
it can be connected directly to the entity of design 
FT2KM. The instances trkm and dfit2 are connected 
using the following signals, 


signal ylsig, yrsig : std -logic vector (n-1 downto 0); 
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This approach of implementing an IT2FC pro- 
vides the faster response. The whole process 
consisting of fuzzification, inference, type re- 
duction, and defuzzification is achieved in four 


clock cycles, which for a Spartan family im- 
plementation using a SOMHz clock represents 
80 x107°s, and for a Virtex 5 FPGA-based system is 
40 x10~°s. 


76.6 Developing of IT2FC Coprocessors 


The use of IT2FC embedded into an FPGA can cer- 
tainly be the option that offers the best performance 
and flexibility. As we shall see, the best performance 
can be obtained when the embedded FC is used as stan- 
dalone system. Unfortunately, this gain in performance 
can present some drawbacks; for example, for people 
who were not involved in the design process of the con- 
troller or who are not familiar with VHDL codification, 
or the code owners simply want to keep the codifica- 
tion secret. All these obstacles can be overcome by the 
use of IP cores. Next, we shall explain two methods of 
implementing IT2FC as coprocessors. 


76.6.1 Integrating the IT2FC Through 
Internal Ports 


In Fig. 76.9, we show a control system that integrates 
the FT2KM design entity embedded into the Actel 
Fusion FPGA [76.62] as a coprocessor of an ARM 
processor. This FPGA allows incorporating the soft pro- 
cessor ARM Cortex, as well as other IP cores to make 
a custom configuration. The embedded system con- 
tains the ARM processor, two memory blocks, timers, 
interrupt controller (IC), a Universal Asynchronous Re- 
ceiver/Transmitter (UART) serial port, IIC, pulse width 
modulator/tachometer block, and a general-purpose in- 
put/output interface (GPIO) interfacing the FT2KM 
block. All the factory embedded components are soft 
IP cores. The FT2KM is a VHDL module that together 
with the GPIO form the Ft2km_core soft coprocessor, 
handled as an IP core; however, in this case, it is nec- 
essary to have the VHDL code. In the system, the IT2 
coprocessor is composed of the GPIO and the FT2KM 
modules, forming the Ft2km_core. In the system, more- 
over, are a DC motor with a high-resolution quadrature 
optical encoder, the system’s power supply, an H-bridge 
for power direction, a personal computer, and a digital 
display. 

The Ft2km_core has six inputs and two outputs. The 
inputs are error, c.error, ce, rst, w, and clk. The 8-bit 
inputs error and c.errror are the controller input for the 
error and change of error values. ce input is used to en- 


able/disable the fuzzy controller, the input rst restores 
all the internal registers of the IT2FC, and the input 
w allows starting a fuzzy inference cycle. The outputs 
are out, and IRQ/RDY; the first one is the crisp output 
value, which is 8-bit wide. JRQ/RDY is produced when 
the output data corresponding to the respective input is 
ready to be read. IRQ is a pulse used to request an inter- 
rupt, whereas, RDY is a signal that can be programmed 
to be active in high or low binary logic level, indicating 
that valid output was produced; this last signal can be 
used in a polling mode. In Fig. 76.9 we used only 1 bit 
for the JRO/RDY signal, at the moment of designing the 
system the designer will have to decide on one method. 
It is possible to use both, modifying the logic or sepa- 
rating the signal and adding an extra 1-bit output. 

The GPIO IP has two 32 bit wide ports, one for input 
(reading bus) and one for output (write bus). The output 
bus connects the GPIO IP to the ARM cortex using the 
32 bit bus APB. The input bus connects the IT2FC IP to 
the GPIO IP. The ARM cortex uses the Ft2km_core as 
a coprocessor. 


76.6.2 Development of IP Cores 


In Sect. 76.6.1, we showed how to integrate the fuzzy 
coprocessor through an input/output port, i.e., the IP 
GPIO. We also commented on the existence of IP cores 
such as the UART and the timers that are connected 
directly to the system bus as in any microcontroller 
system with integrated peripherals. In this section, we 
shall show how to implement an IT2FC connected to 
the system bus to obtain an IT2FC IP core integrated to 
the system architecture. The procedure is basically the 
same for any FPGA of the Xilinx family. We worked 
with the Spartan 6 and Virtex 5, so the Xilinx ISE De- 
sign Suite was used. 

The whole process to start an application that in- 
cludes a microprocessor and a coprocessor can be 
broadly divided into three steps: 


1. Design and implement the design entity that will be 
integrated as an IP core in further steps, then follows 
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Fig. 76.9 A coprocessor implemented into the Actel Fusion FPGA. The system has an ARM processor, the IT2FC 
coprocessor implemented through the general-purpose input/output port, and some peripherals 


the development flow explained in Sect. 76.4.2. In 
our case, the design entity is FT2KM. 

Create the basic embedded microcontroller system 
tailored for our application. We already know the 
kind and amount of memory that we will need, as 


well as the peripherals. This step is achieved as fol- 
lows: we create the microprocessor system using the 
base system builder (BSB) of the Xilinx Platform 
Studio (XPS) software. The system contains a Mi- 
croblaze softcore, 16 KB of local memory, the data 
controller bus (dlmb_cntlr), and the instruction con- 
troller bus (ilmb_cntlr). 

3. Create the IP core, which should contain the de- 
sired design entity, in our case the FT2KM. This 
step is achieved using the Import Peripheral Wiz- 
ard found in the Hardware option in the XPS. The 
idea is to connect the FTKM design entity to the 
processor local bus (PLB V4.6) through three reg- 
isters, one for each input (two registers) and one 
for the output. Upon the completion, this tool will 
create synthesizable HDL file (ft2km_core) that im- 
plements the intellectual property interface (IPIF) 


Fig. 76.10 IP Core implementation of a user defined pe- 
ripheral. The IT2FC coprocessor is implemented into the 
user logic module. This module achieves communication 
with the rest of the system through the PLB or the on-chip 
peripheral bus OPB. For a static coprocessor, use the PLB. 
For an implementation in the reconfigurable region, use the 
OPB < 
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services required and a stub user_logic_module. 
These two modules are shown in Fig. 76.10. The 
IPIF connects the user logic module to the sys- 
tem bus using the OPB or the PLB bus or to the 
on-chip peripheral bus (OPB). At this stage, we 
will need to use the JSE Project Navigator (ISE) 
software to integrate to the user_logic_module all 
the required files that implement the FT2KM de- 
sign entity. Edit the User_Logic_I.vhd file to de- 
fine the FT2KM component and signals. Open the 
jtk2_core.vhd file and create the ftk2_core entity 
and user logic. Synthesize the HDL code and exit 


76.7 Implementing a GA in an FPGA 


In essence, evolution is a two-step process of random 
variation and selection of a population of individuals 
that responds with a collection of behaviors to the envi- 
ronment. Selection tends to eliminate those individuals 
that do not demonstrate an appropriate behavior. The 
survivors reproduce and combine their features to ob- 
tain better offspring. In replication random mutation 
always occurs, which introduces novel behavioral char- 
acteristics. The evolution process optimizes behavior 
and this is a desirable characteristic for a learning sys- 
tem. Although the term evolutionary computation dates 
back to 1991, the field has decades of history, ge- 
netic algorithms being one avenue of investigation in 
simulated evolution [76.63]. GAs are family of compu- 
tational models, which imitates the principles of natural 
evolution. For consistency they adopt biological termi- 
nology to describe operations. There are six main steps 
of a GA: population initialization, evaluation of candi- 
dates using a fitness function, selection, crossover, and 
termination judgment, as is shown in Algorithm 76.1. 
The first step is to decide how to code a solution to the 
problem that we want to optimize; hence, each individ- 
ual is represented using a chromosome that contains the 
parameters. Common encoding of solutions are binary, 
integer, and real value. In binary encoding, every chro- 
mosome is a string of bits. In real-value encoding, every 
chromosome is a string than can contain one or several 
parameters encoded as real numbers. Algorithm 76.1 
starts initializing a population with random solutions, 
and then each individual of the population is evaluated 
using a fitness function, which is selected according to 
the optimization goals. For example, for tuning a con- 
troller it may be enough to check if the actual output 
controller is minimizing errors between the target and 


ISE. Return to the XSP and add the FTK2_core 
IP to the embedded system, connect the new IP 
core to the mb_plb bus system and generate ad- 
dress. Figure 76.10 shows the IT2FC IP core; 
the IPIF consists of the PLB V4.6 bus controller 
that provides the necessary signals to interface 
the IP core to the embedded soft core bus sys- 
tem. 

4. Design the drivers (software) to handle this design 
entity as a peripheral. 

5. Design the application software to use the design 
entity. 


the reference. However, one or more complex fitness 
functions can be designed in order to carry out the con- 
trol goal. In steps 3 to 5 the genetic operations are 
applied, i. e., selection, crossover (recombination), and 
mutation. In step 6, the termination criteria are checked, 
stopping the procedure if such criteria have been ful- 
filled. 


Algorithm 76.1 General scheme of a GA 
initialize population with random candidate solu- 
tions 
evaluate each candidate 
repeat 

select parents 

recombine pairs of parents 

mutate the resulting offspring 

evaluate new candidates 

select individuals for the next generation 
until termination condition is satisfied 


In this work, we have chosen work a GA to evolve 
the IT2FC. However, the ideas exposed here are valid 
for most evolutionary and natural computing methods. 
So, there are two methods to implement any evolu- 
tionary algorithm. One is based on executing software 
written using a computer language such as C/C++, 
similarly as with a desktop computer. The second 
method is based on designing specialized hardware us- 
ing a HDL. Both have advantages and disadvantages; 
the first method is the easier method since there is 
much information about coding using a high level lan- 
guage for different EAs. However, this solution may 
have similar limitations for real-time systems since they 
are slower than hardware implementations by at least 
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Fig. 76.11 High-level view of the structure of a GA for 
FPGA implementation 


a factor of magnitude of five. On the other hand, state 
machine hardware-based designs are more complex to 
implement and use. In this section we shall present 
a small overview of both methods. 


76.7.1 GA Software Based Implementations 


It is well known that a GA can run in parallel, taking ad- 
vantage of the two types of known parallelism: data and 
control parallelism. Data parallelism refers to executing 
one process over several instances of the EA, while con- 
trol parallelism works with separate instances. 
Coarse-grained parallelism and fine-grained paral- 
lelism are two methods often associated with the use 
of EA in parallel. The use of both methods is called 
a hybrid approach. Coarse-grained parallelism entails 
the EA cores to work in conjunction to solve a prob- 
lem. The nodes swap individuals of their population 
with another node running the same problem. The cores 


can exchange individuals with each other to improve 
diversity. The amount of information, frequency of ex- 
change, direction, data pattern, etc., are factors that can 
affect the efficiency of this approach. 

In fine-grained parallelism, the approach is to share 
mating partners instead of populations. The members 
of populations across the parallel cores select to mate 
their fittest members with the fittest found in a neigh- 
boring node’s population. Then, the offspring of the 
selected individuals are distribuited. The distribution 
of this next generation can go to one of the parents’ 
populations, both parents’ population, or all cores’ pop- 
ulations, based on the means of distribution. 

Figure 76.4 shows a six-core architecture design 
for the Virtex 5. Here, we can make fine or coarse- 
grained implementations of an EA. For example, for 
coarse-grained implementation, the island model with 
one processor per island can be used. 


76.7.2 GA Hardware Implementations 


Figure 76.11 shows a high-level view of the architec- 
ture of a GA for hardware implementation. The system 
has eight basic modules: selection module, crossover 
module, mutation module, fitness evaluation module, 
control module, observer module, four random gener- 
ation number (RGN) modules, and two random access 
memory modules. 

The control module is a Mealy state machine de- 
signed to feed all other modules with the necessary 
control signals to synchronize the algorithm execution. 
The selection module can have any existing method 
of selection, for example the Roulette Wheel Selec- 
tion Algorithm. This method picks the genes of the 
parents of the current population, and the parents are 
processed to create new individuals. At the current 
generation, the crossover and genetic modules achieve 
the corresponding genetic operation on the selected 
parents. The fitness evaluation module computes the 
fitness of each offspring and applies elitism to the pop- 
ulation. The observer module determines the stopping 
criterion and observes its fulfilment. RNGs are indis- 
pensable to provide the randomness that EAs require. 
Additionally, RAM 1 is necessary to store the current 
population and RAM 2 to store the selected parents of 
each generation. 
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76.8 Evolving Fuzzy Controllers 


In Sect. 76.1 the general structure of an EFRBS was 
presented. It was mentioned that the common denom- 
inator in most learning systems is their capability of 
making structural changes to themselves over time to 
improve their performance for defined tasks. It also was 
mentioned that the two classical approaches for fuzzy 
learning systems are the Michigan and Pittsburgh ap- 
proaches, and there exist newer proposals with the same 
target. Although to programm a learning system in 
a computer using high-level language, such as C/C++, 
requires some skill, system knowledge, and experimen- 
tation, there are no technical problems with achieving 
a system with such characteristics. This can be also 
true for hardware implementation, if the EFRBS was 
developed in C/C++ and executed by a hard or soft pro- 
cessor such as PowerPC or Microblaze, it is similarly 
as it is done in a computer. How to develop a coproces- 
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Fig. 76.12 The FPGA is divided into two regions: static and recon- 
figurable. The soft processor and peripherals are in the static region. 
Different fuzzy controller architectures are in the reconfigurable re- 
gion. The bus macro are fixed data paths for signals going between 
a reconfigurable module and another module 


sor was explained in Sect. 76.6. The coprocessor was 
developed in the FPGA’s static (base) region, which 
cannot be changed during a partial reconfiguration pro- 
cess. Therefore, such coprocessors cannot suffer any 
structural change. Achieving an EFRBS in hardware is 
quite different to achieving it using high-level language, 
because it is more difficult to change the circuitry than 
to modify programming lines. 

FPGAs are reprogrammable devices that need a de- 
sign methodology to be successfully used as reconfig- 
urable devices. Since there are several vendors with dif- 
ferent architectures, the methodology usually change 
from vendor to vendor and devices. For the Xilinx 
FPGAs the configuration memory is volatile, so, it needs 
to be configured every time that it is powered by upload- 
ing the configuration data known as bitstream. Configur- 
ing FPGA this way is not useful for many applications 
that need to change its behavior while they still work- 
ing online. A solution to overcome such a limitation is 
to use partial reconfiguration, which splits the FPGA 
into two kinds of regions. The static (base) region is 
the portion of the design that does not change during 
partial reconfiguration, it may include logic that con- 
trols the partial reconfiguration process. In other words, 
partial reconfiguration (PR) is the ability to reconfigure 
select areas of an FPGA any time after its initial con- 
figuration [76.64]. It can be divided into two groups: 
dynamic partial reconfiguration (DPR) and static par- 
tial reconfiguration (SPR). DPR is also known as active 
partial reconfiguration. It allows changing a part of the 
device while the rest of the FPGA is still running. DPR 
is accomplished to allow the FPGA to adapt to chang- 
ing algorithms and enhance performance, or for critical 
missions that cannot be disrupted while some subsys- 
tems are being defined. On the other hand, in SPR the 
static section of the FPGA needs to be stopped, so auto- 
reconfiguration is impossible (Fig. 76.12). 

For Xilinx FPGAs, there are basically three ways 
to achieve DPR for devices that support this feature. 
The two basic styles are difference-based partial re- 
configuration and module-based partial reconfiguration. 
The first one can be used to achieve small changes to 
the design, the partial bitstream only contains infor- 
mation about differences between the current design 
structure that resides in the FPGA and the new con- 
tent of the FPGA. Since the bitstream differences are 
usually small, the changes can be made very quickly. 
Module-based partial reconfiguration is useful for re- 
configuring large blocks of logic using modular design 
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concepts. The third style is also based on modular de- 
sign but is more flexible and less restrictive. This new 
style was introduced by Xilinx in 2006 and it is known 
as early access partial reconfiguration (EAPR) [76.65, 
66]. There are two key differences between the de- 
sign flow EAPR and the module-based one. (1) In the 
EAPR flow the shape and size of partially reconfig- 
urable regions (PRRs) can be defined by the user. Each 
PRR has at least one, and usually multiple, partially re- 
configurable modules (PRMs) that can be loaded into 
the PRR. (2) For modules that communicate with each 
other, a special bus macro allows signals to cross over 
a partial reconfiguration boundary. This is an important 
consideration, since without this feature intermodule 
communication would not be feasible, as it is impos- 
sible to guarantee routing between modules. The bus 
macro provides a fixed bus of inter-design communi- 
cation. Each time partial reconfiguration is performed, 
the bus macro is used to establish unchanging routing 
channels between modules, guaranteeing correct con- 
nections [76.65]. 

An important core that enables embedded micropro- 
cessors such as MicroBlaze and PowerPC to achieve 
reconfiguration at run time is HWICAP (hardware in- 
ternal configuration access point) for the OPB. The 
HWICAP allows the processors to read and write the 
FPGA configuration memory through the ICAP (in- 
ternal configuration access point). Basically it allows 
writing and reading the configurable logic block (CLB) 
look-up table (LUT) of the FPGA. 

The process to achieve reconfigurable computing 
with application to IT2FC will be explained with more 
detail in Sect. 76.8.2. Moreover, how to evolve an 
IT2FC embedded into an FPGA, whether it resides in 
the static or in the reconfigurable region, will be also 
explained in therein. 


76.8.1 EAPR Flow for Changing 
the Controller Structure 


Figure 76.12 shows the basic idea of using EAPR flow 
for reconfigurable computing to change from one IT2FC 
structure to a different one. In this figure the Microb- 
laze soft processor can evaluate each controller structure 
according to single or multiobjective criteria. The pro- 
cessor communicates with a PR region using the bus 
macro, which provides a means of locking the routing 
between the PRM and the base design. The system can 
achieve fast reconfiguration operations since partial bit- 
stream are transferred between the FPGA and the com- 
pact flash memory (CF) where bitstreams are stored. 


In general, the EAPR design flow is as fol- 
lows [76.64, 67, 68]: 


1. Hardware description language design and synthe- 
sis. The first steps in the EAPR design flow are very 
similar to the standard modular design flow. We can 
summarize this in three steps: 

(a) Top-level design. In this step, the design de- 
scription must only contain black-box instanti- 
ations of lower-level modules. Top-level design 
must contain: I/O instantiations, clock primi- 
tives instantiations, static module instantiations, 
PR module instantiations, signal declarations, 
and bus macro instantiations, since all non- 
global signals between the static design and the 
PRMs must pass through a bus macro. 

(b) Base design. Here, the static modules of the 
system contain logic that will remain constant 
during reconfiguration. This step is very simi- 
lar to the design flow explained in Sect. 76.4.2. 
However, the designer must consider input and 
output assignment rules for PR. 

(c) PRM design. Similarly to static modules, PR 
modules must not include global clock sig- 
nals either, but may use those from top-level 
modules. When designing multiple PRMs to 
take advantage of the same reconfigurable area, 
for each module, the component name and 
port configuration must match the reconfig- 
urable module instantiation of the top-level 
module. 

2. Set design constraints. In this step, we need to 
place constraints in the design for place and route 
(PAR). The constraints included are: area group, 
reconfiguration mode, timing constraint, and loca- 
tion constraints. The area group constraint specifies 
which modules in the top-level module are static 
and which are reconfigurable. Each module instanti- 
ated by the top-level module is assigned to a group. 
The reconfiguration mode constraint is only applied 
to the reconfigurable group, which specifies that the 
group is reconfigurable. Location constraints must 
be set for all pins, clocking primitives, and bus 
macros in top-level design. Bus macros must be lo- 
cated so that they straddle the boundary between the 
PR region and the base design. 

3. Implement base design. Before the implementation 
of the static modules, the top level is translated 
to ensure that the constraints file has been created 
properly. The information generated by implement- 
ing the base design is used for the PRM implemen- 
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tation step. Base design implementation follows 
three steps: translate, map, and PAR. 

4. Implement PRMs. Each of the PRMs must be 
implemented separately within its own directory, 
and follows base design implementation steps: 1. e., 
translate, map and PAR. 

5. Merge. The final step in the partial reconfiguration 
flow is to merge the top level, base, and PRMs. Dur- 
ing the merge step, a complete design is built from 
the base design and each PRM. In this step, many 
partial bitstreams for each PRM and initial full bit- 
streams are created to configure the FPGA. 


Partial dynamic reconfigurable computing allows us 
to achieve online reconfiguration. By selecting a cer- 
tain bitstream is possible to change the full controller 
structure, or any of the stages (fuzzification, inference 
engine, type reduction, and defuzzification), as well 
as any individual section of each stage, for example, 
different membership functions for the fuzzification 
stage, etc. However, we need to have all the reconfig- 
urable modules previously synthesized because they are 
loaded using partial bitstreams. Therefore, to have the 
capability to evolve reconfigurable modules we need to 
provide them with a control register (CR) to change the 
desired parameters. 

Next, a flexible coprocessor (FlexCo) prototype of 
an IT2FC (FlexCo IT2FC) that can be implemented 
either in the static region as well as in the PR is 
presented. 


76.8.2 Flexible Coprocessor Prototype 
of an IT2FC 


Figure 76.13 illustrates the FlexCo IT2FC, which con- 
tains the four stages (fuzzification, inference engine, 
type reduction, and defuzzification). They are con- 
nected depending on the target region, to the PLB 
or to the OPB through a 32bits command register 
(CR), which is formed by four 8 bit registers named 
R1 to R4 (Fig. 76.14). The parameters of each stage 
can be changed by the programmer since they are not 
static as they were defined previously for the FT2KM 
(Sect. 76.5). Now, they are volatile registers connected 
through signals to save parameter values. The proces- 
sor (MicroBlaze) can send through the PLB or the OPB, 
two kinds of commands to the CR: control words (CWs) 
and data words (DWs). The state machine of the FlexCo 
IT2FC interprets the command. 

Figure 76.14 illustrates the CR coding for static 
and reconfigurable FC. This register is used to perform 
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Fig. 76.13 Flexible coprocessor proposal of an IT2FC for 
the static region 
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Fig. 76.14 The control register is used for both styles of 
implementation, in the static region or in the reconfig- 
urable region 


parameter modification in both modes, static and recon- 
figurable. In general, bit 7 of R4 is used to differentiate 
between a CW or a DW, / means a CW, whereas 0 
means a DW. The StaGe bits (SG-bits) serves to iden- 
tify the IT2FC stage that is to be modified. 


Fig. 76.15 In the static region of the FPGA a multiproces- 
sor system (MPS) with operating system. The GA resides 
in the program memory, it is executed by the MPS. The 
IT2FC may be implemented in the reconfigurable region, 
Fig. 76.16, or in the static region, Fig. 76.13 > 


@ SG-bits = 00: The fuzzification stage has been cho- 
sen, then it is necessary to set the bit Ant/Con to 
1 to indicate that the antecedent MFs are going to 
be modified. With the section-bit (S-bit) we indi- 
cate which part of the FOU (upper or lower) will 
be modified. The bit linguistic-variable-term/active 
(LVT/Active) is to indicate whether we want to 
modify a linguistic variable (LV) or the linguistic 
term (LT), the Act option is for the inference engine 
(IE). In accordance to the LV/LT bit value, in the 
register R3 we set the number of the LV or the LT 
that will be changed. Finally, with registers R1 and 
R2, the parameter value of the LV or the LT is given, 
R1 is the least significant byte. 

© SG-bits = 01: With this setting, the state machine 
identifies that the IE will be modified. It works 
in conjunction with Ant/Con, S-bit, and the reg- 
isters Rl, R2, and R3. Set a O value in the 
Ant/Con bit to change the consequent parameters 
of a Mamdani inference system, in S-bit choose 
the upper or lower MF, using R3 indicate the 
number of MF, and with R1 and R2 set the cor- 
responding value or static implementation. It is 
possible to activate and deactivate rules using the 
bit LVT/Active. With bit dynamic change/activate- 
deactivate (DC/AD), it is possible to change the 
combination of antecedents and consequents of 
a specific rule provided that we have made this 
part flexible by using registers. For an implemen- 
tation in the reconfigurable region, it is possible 
to add or remove rules. These two features need 
to work in conjunction with registers R1, R2, and 
R3. 

@ SG-bits = 10: This selection is to modify the type 
reduction stage. It is possible to have more than 
one type reducer. By setting the DC/AD-bit to 1, 
we indicate that we wish to change the method 
at running time without the necessity of achiev- 
ing a reconfiguration process that implies uploading 
partial bitstreams. The methods can be selected us- 
ing register R3. By using a DC/AD-bit equal to 0 
and LVT/Act equal to 0, in combination with regis- 
ters R1 to R3 we can indicate that we wish to change 
the preloaded values that the KM-algorithm needs 
to achieve the TR. 
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Fig. 76.16 Flexible coprocessor proposal of an IT2FC for 
the reconfigurable region 


@ SG-bits = 11: Similarly to the type reduction stage, 
we can change the defuzzifier at running time. 


With respect to the type reducer and defuzzifica- 
tion stages, we give the option to have more than one 
module, which has the advantage of making the process 
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Fig. 76.17 This design may be implemented in both regions to have 
a dynamic reconfigurable system. For a static implementation, the 
system must have registers for all the variable parameters to make 
possible to change their values, Fig. 76.13 


easier and possible for static designs, but the disadvan- 
tage is that the design will consume more macrocells, 
increasing the cost of the required FPGAs, boards, and 
power consumption. Next, we will explain the imple- 
mentation of the FlexCo IT2FC for the static region and 
the reconfigurable region. 


Implementing the FlexCo IT2FC 

on the Static Region 
The IT2FC of is connected to the PLB. Although the 
controller structure is static, this system can be evolved 
for tuning and learning because it is possible to achieve 
parametric modifications to all the IT2FC stages. Fig- 
ure 76.13 shows the architecture of this system and 
Fig. 76.15 a conceptual model of the possible imple- 
mentation. 


Implementing the FlexCo IT2FC on the PR 
Figure 76.16 illustrates a more flexible architecture for 
FlexCo IT2FC. The IT2FC is implemented in the recon- 
figurable region, using a partially reconfigurable region 
(PRR) for each stage. This is convenient since each re- 
gion can have multiple modules that can be swapped 


in and out of the device on the fly. This is the most rec- 
ommended method to achieve the evolving IT2FC since 
it is more flexible. One disadvantage is that at running 
time it is slower than the static implementation because 
more logic circuits are incorporated. 

Figure 76.17 is an evolutive standalone system; as 
it was mentioned, the IT2FC and the GA can be in the 
static or in the reconfigurable region. 


76.8.3 Conclusion and Further Reading 


FPGAs combine the best parts of ASICs and processor- 
based systems, since they do not require high volumes 
to justify making a custom design. Moreover, they 
also provide the flexibility of software, running on 
a processor-based system, without being limited by the 
number of cores available. They are one of the best 
options to parallelize a system since they are parallel 
in nature. In an IT2FC, a typical whole T2-inference, 
computed using an industrial computer equipped with 
a quad-core processor, lasts about 18 x107? s. A whole 
IT2FC (fuzzification, inference, KM-type reducer, and 
defuzzification) lasts only four clock cycles, which for 
a Spartan implementation using a 50 MHz clock repre- 
sents 80x10? s, and for a Virtex 5 FPGA-based system 
represents 40 x10~° s. For the Spartan family the typi- 
cal implementation speedup is 225 000, whereas for the 
Virtex 5 it is 450000. Using a pipeline architecture, the 
speedup of the whole IT2 process can be obtained in 
just one clock cycle, so using the same criteria to com- 
pare, the speedup for Spartan is 90000 and 2 400000 
for Virtex. Reported speedups of GAs implemented into 
an FPGA, are at least 5 times higher than in a computer 
system. For all these reasons, FPGAs are suitable de- 
vices for embedding evolving fuzzy logic controllers, 
especially the IT2FC, since they are computationally 
expensive. There are some drawbacks with the use of 
this technology, mostly with respect to the need to have 
a highly experienced development team because its 
implementation complexity. Achieving an evolving in- 
telligent system using reconfigurable computing is not 
as direct as it is using a computer system. It requires 
the knowledge of FPGA architectures, VHDL cod- 
ing, soft processor implementation, the development 
of coprocessors, high-level languages, and reconfig- 
urable computing bases. Therefore, people interested in 
achieving such implementations require expertise in the 
above fields, and further reading must focus on these 
topics, FPGA vendor manuals and white papers, as well 
as papers and books on reconfigurable computing. 
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77. Multiobjective Genetic Fuzzy Systems 


Hisao Ishibuchi, Yusuke Nojima 


This chapter explains evolutionary multiobjective 
design of fuzzy rule-based systems in comparison 
with single-objective design. Evolutionary algo- 
rithms have been used in many studies on fuzzy 
system design for rule generation, rule selection, 
input selection, fuzzy partition, and membership 
function tuning. Those studies are referred to as 
genetic fuzzy systems because genetic algorithms 
have been mainly used as evolutionary algorithms. 
In many studies on genetic fuzzy systems, the ac- 
curacy of fuzzy rule-based systems is maximized. 
However, accuracy maximization often leads to the 
deterioration in the interpretability of fuzzy rule- 
based systems due to the increase in their com- 
plexity. Thus, multiobjective genetic algorithms 
were used in some studies to maximize not only 
the accuracy of fuzzy rule-based systems but also 
their interpretability. Those studies, which can be 
viewed as a subset of genetic fuzzy system stud- 
ies, are referred to as multiobjective genetic fuzzy 
systems (MoGFS). A number of fuzzy rule-based 
systems with different complexities are obtained 
along the interpretability—accuracy tradeoff curve. 
One extreme of the tradeoff curve is a simple highly 
interpretable fuzzy rule-based system with low 
accuracy while the other extreme is a complicated 
highly accurate one with low interpretability. In 
MoGFS, multiple accuracy measures such as a true 
positive rate and a true negative rate can be si- 
multaneously used as separate objectives. Multiple 
interpretability measures can also be simultane- 
ously used in MoGFS. 


77.1 Fuzzy System Design 


A fuzzy rule-based system is a set of fuzzy rules, which 
has been successfully used as a nonlinear controller in 
various real-world applications. The basic structure of 
fuzzy rules for multi-input and single-output fuzzy con- 
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trol can be written as follows [77.1—3] 


Rule R; : If xı is Aj; and ... and x, is Agn 77.) 
then y is B; , i 
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where q is a rule index, R, is the label of the qth rule, n 
is the number of input variables, x; is the ith input vari- 
able (i = 1,2,...,m), Ag; is an antecedent fuzzy set for 
the ith input variable x;, y is an output variable, and B, 
is a consequent fuzzy set for the output variable y. 
The antecedent and consequent fuzzy sets A,; and B, 
are specified by their membership functions pa, (xi) 
and ug, (y), respectively. Examples of antecedent fuzzy 
sets are shown in Fig. 77.1 where the domain inter- 
val [0,1] of the input variable x; is partitioned into 
three fuzzy sets small, medium, and large with trian- 
gular membership functions. 

Fuzzy rules of the form in (77.1) are based on 
the concept of linguistic variables by Zadeh [77.4— 
6]. According to Zadeh [77.4-6], a fuzzy set with 
a linguistic meaning such as small and large is re- 
ferred to as a linguistic value while a variable with 
linguistic values is called a linguistic variable. For ex- 
ample, in Fig. 77.1, the three fuzzy sets are linguistic 
values while x; is a linguistic variable. In our daily 
life, we almost always use linguistic variables and lin- 
guistic values. When we say your car is fast but my 
car is slow, the speed of cars is a linguistic variable 
while fast and slow are linguistic values. When we 
say it is hot today, the temperature is a linguistic vari- 
able while hot is a linguistic value. Of course, we 
use those linguistic values without explicitly specify- 
ing their meanings by membership functions. However, 
we have our own vague definitions of those linguis- 
tic values, which may be approximately represented by 
membership functions. 

The main advantage of fuzzy rule-based systems 
over other nonlinear models such as multilayer feedfor- 
ward neural networks is their linguistic interpretability. 
In Fig. 77.2, we show a two-input and single-output 
fuzzy rule-based system with the following nine fuzzy 


Membership value 
A 


1.0 


Small 


> 
0 1.0 
Input variable x; 


Fig. 77.1 Three antecedent fuzzy sets small, medium, and 
large 


rules 


Rule R, : If x 


is small and x2 is small 


then y is medium , 


Rule R : If x 


is small and x2 is medium 


then y is small , 


Rule R3 : If x 
then y is m 
Rule R4 : If x 


is small and x is large 
edium , 


is medium and xz is small 


then y is small , 


Rule Rs : If x 
then y is m 
Rule Re : If x 


is medium and x2 is medium 
edium , 


is medium and xp is large 


then y is large , 


Rule R3 : If x 
then y is m 
Rule Rg : If x 


is large and x2 is small 
edium , 


is large and xz is medium 


then y is large , 


Rule Ro : If x 


then y is m 


is large and x2 is large 


edium . 


A linguistic value in each cell in Fig. 77.2 shows the 


Input variable x2 


1.0 


consequent fuzzy set of the corresponding fuzzy rule. 


Medium 


1.0 
Input variable x, 


Fig. 77.2 A two-input and single-output fuzzy rule-based 
system with nine fuzzy rules 
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For example, medium in the bottom-right cell shows 
the consequent fuzzy set of the fuzzy rule R7 with the 
antecedent fuzzy set large for x; and small for x2. Let 
us assume that the consequent fuzzy sets in Fig. 77.2 
are defined by the triangular membership functions in 
Fig. 77.1. Then, we can roughly understand the shape 
of the two-input and single-output nonlinear function 
represented by the fuzzy rule-based system in Fig. 77.2 
(even when we do not know anything about fuzzy rea- 
soning). 

It is easy to linguistically understand the input— 
output relation of the fuzzy rule-based system in 
Fig. 77.2. That is, the fuzzy rule-based system in 
Fig. 77.2 has high interpretability. However, it is dif- 
ficult to approximate a complicated highly nonlinear 
function by such a simple 3 x 3 fuzzy rule-based system. 
More membership functions for the input and output 
variables may be needed for improving the accuracy of 
fuzzy rule-based systems. The tuning of each member- 
ship function may be also needed. Theoretically, fuzzy 
rule-based systems are universal approximators of non- 
linear functions. This property has been shown for 
fuzzy rule-based systems [77.7—9] and multilayer feed- 
forward neural networks [77.10-12]. This means that 
fuzzy rule-based systems as well as neural networks 
have high approximation ability of nonlinear functions. 

In Fig. 77.3, we show an example of a tuned 
7x7 fuzzy partition of the two-dimensional input 
space [0, 1] x [0, 1]. We can design a much more ac- 


Input variable x2 
1.0 


Input variable xı 


Fig. 77.3 A tuned 7 x 7 fuzzy partition 


curate fuzzy rule-based system by using such a tuned 
fuzzy partition than the simple 3 x 3 fuzzy partition in 
Fig. 77.2. That is, we can say that Fig. 77.3 is a better 
fuzzy partition than Fig. 77.2 with respect to the ac- 
curacy of fuzzy rule-based systems. However, it is very 
difficult to linguistically interpret each antecedent fuzzy 
set in Fig. 77.3. In other words, it is very difficult to as- 
sign an appropriate linguistic value such as small and 
large to each antecedent fuzzy set in Fig. 77.3. Thus, 
we can say that the fuzzy partition in Fig. 77.3 does not 
have high linguistic interpretability. That is, Fig. 77.2 is 
a better fuzzy partition than Fig. 77.3 with respect to the 
linguistic interpretability of fuzzy rule-based systems. 
As shown by the comparison between the two fuzzy 
partitions in Figs. 77.2 and 77.3, accuracy maximiza- 
tion usually conflicts with interpretability maximization 
in the design of fuzzy rule-based systems. 

Let us denote a fuzzy rule-based system by S. The 
fuzzy rule-based system S is a set of fuzzy rules. In 
fuzzy system design, the accuracy of S$ is maximized. 
The accuracy maximization of S is usually formulated 
as the following error minimization 


Minimize f (S$) = Error(S) , (77.2) 


where f(S) is an objective function to be minimized, 
and Error(S) is an error measure. 

As shown in Fig. 77.3, the accuracy maximization 
often leads to a complicated fuzzy rule-based system 
with low interpretability. Thus, a complexity measure 
is combined into the objective function in (77.2) as fol- 
lows [77.13, 14] 


Minimize f (S) = wı Complexity(S) + w2 Error(S) , 
(77.3) 


where w; and wp are nonnegative weights, and 
Complexity(S) is a complexity measure. 

In the late 1990s, the idea of multiobjective fuzzy 
system design [77.15] was proposed where the accuracy 
maximization and the complexity minimization were 
handled as separate objectives 


Minimize fı (S) = Complexity(S) and 


(77.4) 
fa(S) = Error(S) , 
where fı (S) and f2 (S) are separate objectives to be min- 
imized. 
The two-objective optimization problem in (77.4) 
does not have a single optimal solution that simulta- 
neously optimizes the two objectives f\(S) and f (S). 
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This is because the error minimization increases the 
complexity of fuzzy rule-based systems (i.e., the op- 
timization of f,(S) deteriorates fı (S)). That is, the two 
objectives fı (S) and f2 (S) in (77.4) are conflicting with 
each other. In general, a multiobjective optimization 
problem has a number of nondominated solutions with 
different tradeoffs among the conflicting objectives. 
Those solutions are referred to as Pareto optimal solu- 
tions. The two-objective optimization problem in (77.4) 
has a number of nondominated fuzzy rule-based sys- 
tems with different complexities (Figs. 77.2 and 77.3). 

In Fig. 77.4, we illustrate the concept of complex- 
ity-accuracy tradeoff in the design of fuzzy rule-based 
systems. The horizontal axis of Fig. 77.4 shows the 
values of the complexity measure (i. e., Complexity(S)) 
while the vertical axis shows the values of the error 
measure (i.e., Error(S)). Around the top-left corner 
of Fig. 77.4, we have simple fuzzy rule-based sys- 
tems with high interpretability and low accuracy (e.g., 
a simple 3 x 3 fuzzy rule-based system in Fig. 77.2). 
The improvement in their accuracy increases their com- 
plexity. By minimizing the error measure Error(S), we 
have complicated fuzzy rule-based systems with high 
accuracy and low interpretability around the bottom- 
right corner of Fig. 77.4 (e.g., a tuned 7x7 fuzzy 
rule-based system in Fig. 77.3). In Fig. 77.4, we have 
many nondominated fuzzy rule-based systems along the 
complexity—accuracy tradeoff curve. It should be noted 
that there exist no fuzzy rule-based systems around the 
bottom-left corner (i. e., no ideal fuzzy rule-based sys- 


77.2 Accuracy Maximization 


In this section, we briefly explain various approaches 
proposed for improving the accuracy of fuzzy rule- 
based systems. Those approaches often deteriorate the 
interpretability. 


77.2.1 Types of Fuzzy Rules 


Fuzzy rules of the form in (77.1) have been successfully 
used in fuzzy controllers since Mamdani’s pioneering 
work in 1970s [77.23, 24]. Those fuzzy rules have of- 
ten been called Mamdani-type fuzzy rules or Mamdani 
fuzzy rules. A heuristic rule generation method of such 
a fuzzy rule from numerical data was proposed by Wang 
and Mendel [77.25], which has been used for function 
approximation. 
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Fig. 77.4 Nondominated fuzzy rule-based systems with 
different complexity—accuracy tradeoffs 


tems with high accuracy and high interpretability). This 
is because the two objectives in Fig. 77.4 are conflicting 
with each other. 

Since the late 1990s, a number of multiobjective 
approaches have been proposed for fuzzy system de- 
sign [77.16-19]. In this chapter, we explain the basic 
idea of multiobjective fuzzy system design using multi- 
objective evolutionary algorithms [77.20—22]. Whereas 
we started with fuzzy rules for fuzzy control in (77.1), 
our explanations in this chapter are mainly about mul- 
tiobjective design of fuzzy rule-based systems for pat- 
tern classification. This is because early multiobjective 
approaches were mainly proposed for pattern classifica- 
tion problems. 


A well-known idea for improving the approxima- 
tion ability of fuzzy rules in (77.1) is the use of a linear 
function instead of a linguistic value in the consequent 
part 


Rule R, : If x; is Ag, and ... and x, is Agn 
then y = bgo + bai x1 + bg2X2 +++ + DgnXn , 
(77.5) 


where b,; is a real number coefficient (i = 0,1,...,7). 
Fuzzy rules of this type were proposed by Takagi and 
Sugeno [77.26]. A fuzzy rule-based system with fuzzy 
rules in (7.5) is referred to as a Takagi-Sugeno model. 
The use of a linear function instead of a linguistic value 
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in the consequent part of fuzzy rules clearly increases 
the accuracy of fuzzy rule-based systems. However, it 
degrades their interpretability. 

The following simplified version of fuzzy rules in 
Takagi—Sugeno models has been also used 


Rule R, : If xı is Ag; and ... and x, is Agn 
(77.6) 
then y = b4 , 


where bg is a consequent real number. It is easy to 
tune the consequent real number of each fuzzy rule. 
This is the main advantage of simplified fuzzy rules 
in (77.6). Thus, simplified fuzzy rules have often been 
used in trainable fuzzy rule-based systems called neuro- 
fuzzy systems [77.27—29]. In those studies, antecedent 
fuzzy sets as well as consequent real numbers are ad- 
justed in the same manner as the learning of neural 
networks. 

Due to their simple structure, simplified fuzzy 
rules in (77.6) may have higher interpretability than 
Takagi—Sugeno fuzzy rules in (77.5). However, it is 
usually difficult to linguistically interpret a conse- 
quent real number. Thus, the linguistic interpretability 
of simplified fuzzy rules in (77.6) is usually viewed 
as being limited if compared with Mamdani fuzzy 
rules with a linguistic value in their consequent part 
in (77.1). 

For pattern classification problems, three types of 
fuzzy rules have been used in the literature [77.30]. The 
simplest structure of fuzzy rules for pattern classifica- 
tion problems is as follows 


Rule R; : If xı is Ag; and ... and x, is Agn 


(77.7) 
then Class C; , 


where C; is a consequent class. 

The compatibility grade of an input pattern x, = 
p1; X2; - - - +%pn) With the antecedent part of the fuzzy 
rule R, in (77.7) is usually calculated by the minimum 
or product operator. In this chapter, we use the follow- 
ing product operator 


MA, (xp) = HA p1) Haga (x2) se MAgn (Xn) (77.8) 


where A, = (Agi, A42, . - - , Aqn) is an antecedent fuzzy 
set vector, and /14,(x,) shows the compatibility of x, 
with the antecedent fuzzy set vector A,. 

Let S be a set of fuzzy rules of the form in 
(77.7). The rule set S can be viewed as a fuzzy 
rule-based classifier. When an input pattern x, = 
(X%p1,%p2,-+++Apn) is presented to S, x, is classified 


by a single winner rule with the maximum compat- 
ibility. Such a single winner-based fuzzy reasoning 
method has been frequently used in fuzzy rule-based 
classifiers. 

Let us assume that we have nine fuzzy rules in 
Fig. 77.5 for a pattern classification problem with the 
two-dimensional pattern space [0, 1] x [0, 1]. A different 
consequent class is assigned to each rule in Fig. 77.5 
for explanation purposes. The grid lines in the pattern 
space in Fig. 77.5 show the classification boundary be- 
tween different classes when we use the single winner- 
based fuzzy reasoning method together with the product 
operator-based compatibility calculation. It should be 
noted that the classification boundary by the nine fuzzy 
tules in Fig. 77.5 can also be generated by nine non- 
fuzzy rules with interval antecedent conditions [77.31, 
32]. 

The second type of fuzzy rules for pattern classifi- 
cation problems has a rule weight [77.30] 


Rule R; : If xı is Ag; and ... and x, is Agn 


: (77.9) 
then Class C, with CF, , 


where CF; is a real number in the unit interval [0, 1], 
which is called a rule weight or a certainty fac- 
tor. This type of fuzzy rules has been used in many 
studies on fuzzy rule-based classifiers since the early 
1990s [77.33, 34]. 
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Fig. 77.5 A fuzzy rule-based classifier with nine fuzzy 
classification rules 
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d) CF; = CF; = CF, = 0.25 
1.0 


Fig. 77.6a-d Classification boundaries generated by assigning a different rule weight to each of the nine fuzzy rules in 
Fig. 77.5. In each plot, the default setting of CF, is 1.0 (e.g., CF, = 1.0 for q = 4,5, 6,7, 8,9 in (b)) 


When an input pattern x, is presented to a fuzzy 
rule-based classifier with fuzzy rules of the form in 
(77.9), a single winner rule is determined using the 
product of the compatibility 14, (xp) of x, with each 
rule R, and its rule weight CF 4: HA, (x,)CFy. 

Fuzzy rules with a rule weight have higher clas- 
sification ability than those with no rule weight. For 
example, the classification boundary in Fig. 77.5 can 
be adjusted by assigning a different rule weight to 
each rule (without changing the shape of each an- 
tecedent fuzzy set). Examples of the adjusted classi- 
fication boundaries are shown in Fig. 77.6. As shown 
in Fig. 77.6, the accuracy of fuzzy rule-based classi- 
fiers can be improved by using fuzzy rules with a rule 
weight. However, the use of a rule weight degrades 


the interpretability of fuzzy rule-based classifiers. It is 
a controversial issue to compare the interpretability of 
fuzzy rule-based classifiers between the following two 
approaches: One is the use of fuzzy rules with a rule 
weight and the other is the modification of antecedent 
fuzzy sets [77.35, 36]. 

The third type of fuzzy rules has multiple rule 
weights as follows [77.30] 


Rule R; : If xı is Ag; and ... and x, is Ag, 
then Class C; with CF41 ,...,Class Cm (77.10) 
with CF gn , 


where m is the number of classes and CF; is a real 
number in the unit interval [0, 1], which can be viewed 
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Fig. 77.7 Examples of fuzzy rules in an approximative 
fuzzy rule-based system 


as a rule weight for the jth class G (j= 1,2,...,m). 
When we use the single winner-based fuzzy reasoning 
method, the classification result of each pattern de- 
pends only on the maximum rule weight CF, of each 
rule (i.e., CF, = max{CF 4, CF 2, ...,CFgm}). Thus, 
the use of multiple rule weights in (77.9) is meaning- 
less under the single winner rule-based fuzzy reasoning 
method. However, they can improve the accuracy of 
fuzzy rule-based classifiers when we use a voting-based 
fuzzy reasoning method [77.30, 37]. Of course, the use 
of multiple rule weights further degrades the inter- 
pretability of fuzzy rule-based classifiers. 


77.2.2 Types of Fuzzy Partitions 


Since Mamdani’s pioneering work in the 1970s [77.23, 
24], grid-type fuzzy partitions have frequently been 
used in fuzzy control (e.g., the 3 x 3 fuzzy partition in 
Fig. 77.2). Such a grid-type fuzzy partition has high 
interpretability when it is used for two-dimensional 
problems (i. e., for the design of two-input single-output 
fuzzy rule-based systems). However, grid-type fuzzy 
partitions have the following two difficulties. One diffi- 
culty is the inflexibility of membership function tuning. 
Since each antecedent fuzzy set is used in multiple 
fuzzy rules, membership function tuning for improving 
the accuracy of one fuzzy rule may degrade the accu- 
racy of some other fuzzy rules. The other difficulty is 


the exponential increase in the number of fuzzy rules 
with respect to the number of input variables. Let L 
be the number of antecedent fuzzy sets for each of 
the n variables. In this case, the number of cells in 
the corresponding n-dimensional fuzzy grid is L” (e.g., 
510 — 9765 625 when L = 5 and n = 10). 

These two difficulties can be removed by assign- 
ing different antecedent fuzzy sets to each fuzzy rule 
as shown in Fig. 77.7. Each fuzzy rule has its own an- 
tecedent fuzzy sets. That is, no antecedent fuzzy set is 
shared by multiple fuzzy rules. 

Fuzzy rule-based systems with this type of fuzzy 
rules are referred to as approximative models whereas 
grid-type fuzzy rule-based systems such as Fig. 77.2 are 
called descriptive models [77.38, 39]. If the accuracy 
of fuzzy rule-based systems is much more important 
than their interpretability, approximative models may 
be a better choice than descriptive models. Approx- 
imative models have been used as fuzzy rule-based 
classifiers since the early 1990s [77.40, 41]. 

One limitation of approximative models with re- 
spect to accuracy maximization is that every antecedent 
fuzzy set is defined on a single input variable. As 
a result, the shape of a fuzzy subspace covered by 
the antecedent part of each fuzzy rule is rectangular 
as shown in Fig. 77.7. This means that such a fuzzy 
subspace cannot handle any correlation among input 
variables. One approach to the handling of correlated 
subspaces is the use of a single high-dimensional an- 
tecedent fuzzy set in each fuzzy rule 

Rule R, : If x is A,then Class C; , (71-41) 
where x is an n-dimensional input vector (i. e., x = 
(x1, X2, .- -, Xn)) and A; is a n-dimensional antecedent 
fuzzy set directly defined in the n-dimensional input 
space. This type of fuzzy rules has also been used for 
pattern classification problems since the 1990s [77.42]. 
Figure 77.8 illustrates an example of the n-dimensional 
antecedent fuzzy set A, in the case of n = 2. As we can 
see from Fig. 77.8, antecedent fuzzy sets in fuzzy rules 
of the type in (77.11) can cover correlated fuzzy sub- 
spaces of the input space. This characteristic feature is 
an advantage over single-dimensional antecedent fuzzy 
sets with respect to the accuracy of fuzzy rule-based 
systems. However, as we can see from Fig. 77.8, it is 
almost impossible to linguistically interpret a high-di- 
mensional antecedent fuzzy set. That is, the use of high- 
dimensional antecedent fuzzy sets may improve the ac- 
curacy of fuzzy rule-based systems but degrade their 
interpretability. 
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Fig. 77.8 Illustration of an n-dimensional antecedent 
fuzzy set A, in the two-dimensional input space 
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Fig. 77.9 A five-input single-output fuzzy rule-based sys- 
tem with a hierarchical structure 


77.2.3 Handling 
of High-Dimensional Problems 
with Many Input Variables 


As we have already explained, the number of fuzzy 
rules exponentially increases with the number of in- 
put variables when we use a descriptive model with 
a grid-based fuzzy partition. Thus, it looks impracti- 
cal to design a descriptive model for high-dimensional 
problems. 

Approximative models do not have such a diffi- 
culty of grid-based fuzzy partitions. This is because the 
number of fuzzy rules in an approximative model is 
independent from the number of input variables. That 


is, we can design fuzzy rule-based systems for high-di- 
mensional problems by using approximative models. 

One difficulty in the use of approximative models is 
poor interpretability of fuzzy rule-based systems due to 
the following two reasons: (i) it is difficult to linguis- 
tically interpret antecedent fuzzy sets in approximative 
models as shown in Fig. 77.7, and (ii) it is also diffi- 
cult to understand a fuzzy rule with a large number of 
antecedent conditions. 


77.2.4 Hybrid Approaches 
with Neural Networks 
and Genetic Algorithms 


In the 1990s, a large number of learning and opti- 
mization methods were proposed for accuracy maxi- 
mization of fuzzy rule-based systems. Almost all of 
those approaches were hybrid approaches with neu- 
ral networks called neuro-fuzzy systems [77.27-29, 43, 
44] and with genetic algorithms called genetic fuzzy 
systems [77.4548]. In neuro-fuzzy systems, learning 
algorithms of neural networks were utilized for param- 
eter tuning (e.g., for membership function tuning). As 
shown in Fig. 77.3, parameter tuning in fuzzy rule- 
based systems usually leads to accuracy improvement 
and interpretability deterioration. 

Genetic fuzzy systems can be used not only for pa- 
rameter tuning but also for structure optimization such 
as rule selection, input selection and fuzzy partition. 
As we will explain in the next section, rule selection, 
and input selection can improve the interpretability of 
fuzzy rule-based systems by decreasing their complex- 
ity whereas parameter tuning almost always deterio- 
rates their interpretability. Genetic fuzzy systems were 
also used for constructing a hierarchical structure of 
fuzzy rule-based systems [77.49]. Figure 77.9 shows 
an example of a fuzzy rule-based system with a hierar- 
chical structure. The use of hierarchical structures can 
prevent the exponential increase in the number of fuzzy 
rules because each subsystem has only a few inputs 
(e.g., in Fig. 77.9, each subsystem has only two inputs). 
However, it significantly degrades the interpretability of 
fuzzy rule-based systems. This is because the interpre- 
tation of intermediate variables between subsystems is 
usually impossible. 
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In this section, we briefly explain various approaches 
proposed for decreasing the complexity of fuzzy rule- 
based systems. Those approaches improve the inter- 
pretability of fuzzy rule-based systems but often de- 
grade their accuracy. 


77.3.1 Decreasing the Number 
of Fuzzy Rules 


A simple idea for complexity reduction of fuzzy rule- 
based systems is to decrease the number of fuzzy rules. 
Let us consider a three-class pattern classification prob- 
lem in Fig. 77.10. All patterns in Fig. 77.10 can be 
correctly classified by the following nine fuzzy rules 
with the 3 x 3 fuzzy grid in Fig. 77.10 


Rule R; : If x; is small and xz is small 
then Class 2 , 

Rule R3 : If x; is small and x2 is medium 
then Class 2 , 

Rule R; : If xı is small and xp is large 
then Class 1 , 

Rule R; : If xı is medium and xz is small 
then Class 2 , 

Rule Rs : If xı is medium and x2 is medium 
then Class 2 , 

Rule Re : If xı is medium and xp is large 
then Class 1 , 

Rule R3 : If x; is large and x2 is small 
then Class 3 , 

Rule Rg : If x; is large and x2 is medium 
then Class 3 , 

Rule Ro : If x; is large and xp is large 
then Class 3 . 


That is, all patterns in Fig. 77.10 can be correctly 
classified by a fuzzy rule-based classifier with these 
nine fuzzy rules. It is also possible to correctly clas- 
sify all patterns in Fig. 77.10 using a simple fuzzy 
rule-based classifier only with the four fuzzy rules 
around the top-right corner (i. e., fuzzy rules Rs, Ro, Rg, 
and Ro). This example illustrates the simplification of 
fuzzy rule-based systems through rule selection. 

The use of genetic algorithms for fuzzy rule selec- 
tion was proposed by Jshibuchi et al. [77.13, 14] in the 


1990s. Let Say be a set of all fuzzy rules. Since an ar- 
bitrary subset of San can be represented by a binary 
string of length |Say|, standard genetic algorithms for 
binary strings can be directly applied to fuzzy rule se- 
lection [77.13, 14]. The number of fuzzy rules, which 
should be minimized, was used as a part of a fitness 
function in single-objective approaches [77.13, 14]. It 
was also used as a separate objective in multiobjective 
approaches [77.15]. 


77.3.2 Decreasing the Number 
of Antecedent Conditions 


In Fig. 77.10, the rightmost three fuzzy rules (i. e., R7, 
Rg and Ro with the same antecedent condition on xı) 
can be combined into a single fuzzy rule: If x, is large 
then Class 3. This fuzzy rule has no condition on the 
second input variable x2. In this manner, the 3 x 3 fuzzy 
rule-based classifier with the nine fuzzy rules can be 
simplified to a simpler classifier with the seven fuzzy 
tules. 

The fuzzy rule If x, is large then Class 3 is viewed 
as having a don’t care condition on the second in- 
put variable xz: If x; is large and x is don’t care 
then Class 3. In this fuzzy rule, don’t care is a spe- 
cial antecedent fuzzy set that is fully compatible with 


A Class 3 


© Class 1 


L] Class 2 


Input variable x2 


0 1.0 
Input variable x, 


Fig. 77.10 A three-class pattern classification problem and 
a3 x 3 fuzzy grid 
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b) Merging similar antecedent fuzzy sets 
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Fig. 77.11a,b Projection of two-dimensional fuzzy sets and the merge of similar fuzzy sets. (a) Projection onto each input 


variable, (b) merging similar antecedent fuzzy sets 


any input values. The use of don’t care enables us to 
perform rule-level input selection, which significantly 
improves the applicability of descriptive fuzzy rule- 
based systems to high-dimensional problems [77.50]. 
When we use don’t care as a special antecedent 
fuzzy set, the number of antecedent conditions in 
a fuzzy rule excluding don’t care conditions is referred 
to as the rule length since don’t care conditions are usu- 
ally omitted (e.g., If xı is large and x2 is don’t care 
then Class 3 is usually written as If xı is large then 
Class 3). A short fuzzy rule with a small number of 
antecedent conditions covers a large fuzzy subspace of 
a high-dimensional pattern space while a long fuzzy 
rule covers a small fuzzy subspace. For example, let us 
consider a 50-dimensional pattern classification prob- 
lem with the pattern space [0, 1}°°. A fuzzy rule with 
the antecedent fuzzy set small on all the 50 input vari- 
ables covers less than 1/10! of the pattern space [0, 1]°°. 
However, a short fuzzy rule with the antecedent fuzzy 
set small on only two input variables (e.g., If x; is 
small and x49 is small then Class 3) covers 1/4 of the 
pattern space [0, 1]5?. As a result, almost all of the en- 
tire high-dimensional pattern space can be covered by 
a small number of short fuzzy rules. That is, we can de- 
sign a simple fuzzy rule-based classifier with a small 
number of short fuzzy rules for a high-dimensional 
pattern classification problem. It should be noted that 
different fuzzy rules may have antecedent conditions 


on different input variables. Moreover, the rule length 
of each fuzzy rule may be different (e.g., one fuzzy 
rule has an antecedent condition only on x; while an- 
other fuzzy rule has antecedent conditions on x2, x3 
and x4). 

The total rule length (i. e., the total number of an- 
tecedent conditions), which should be minimized, was 
used as a part of a fitness function in single-objec- 
tive approaches [77.51]. It was also used as a separate 
objective in multiobjective approaches [77.52,53]. In 
multiobjective approaches, the total rule length instead 
of the average rule length has been used in the literature. 
This is because the minimization of the average rule 
length does not necessarily mean the complexity mini- 
mization of fuzzy rule-based systems. In many cases, 
the average rule length can be decreased by adding 
a new fuzzy rule with a single antecedent condition, 
which leads to the increase in the complexity of a fuzzy 
rule-based system. 


77.3.3 Other Interpretability Improvement 
Approaches 


For the design of accurate fuzzy rule-based systems for 
high-dimensional problems, clustering techniques such 
as fuzzy c-means [77.54—56] have often been used to 
generate fuzzy rules [77.42, 57—61]. Fuzzy rules with 
ellipsoidal high-dimensional antecedent fuzzy sets are 
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often obtained from clustering-based fuzzy rule gen- 
eration methods. Fuzzy rules of this type have high 
accuracy but low interpretability. Their interpretability 
is improved by projecting high-dimensional antecedent 
fuzzy sets onto each input variable. As a result, we have 
approximative fuzzy rule-based systems (Fig. 77.1 1a). 
The interpretability of the obtained fuzzy rule-based 
systems can be further improved by merging similar 
antecedent fuzzy sets on each input variable into a sin- 
gle one (Fig. 77.11b). Each of the generated antecedent 
fuzzy sets by a merging procedure is replaced with a lin- 
guistic value to further improve the interpretability of 
fuzzy rule-based systems. 

It should be noted that each of the abovemen- 
tioned interpretability improvement steps (i. e., projec- 


77.4 Single-Objective Approaches 


As we have already explained, the simplest multiob- 
jective formulation of fuzzy system design has two 
objectives (i.e., error minimization and complexity 
minimization) as follows 


Minimize f (S) = (f1(S),fo(S)) 
= (Complexity(S), Error(S)) , 
(17.42) 


where f(S) shows an objective vector. In this sec- 
tion, we explain how the two-objective problem 
in (77.12) can be handled by single-objective ap- 
proaches. For more general and comprehensive ex- 
planations on the handling of multiobjective prob- 
lems through single-objective optimization, see text- 
books on multicriteria decision making such as Miet- 
tinen [77.69]. 


77.4.1 Use of Scalarizing Functions 


One of the most frequently used approaches to multiob- 
jective optimization is the use of scalarizing functions. 
Multiple objective functions are combined into a single 
scalarizing function. That is, a multiobjective prob- 
lem is handled as a single-objective problem. Our two 
objectives in multiobjective fuzzy system design are 
combined as follows 


Minimize f ($) = f(fi(S).f2(S)) 
= f(Complexity(S), Error(S)) , 
(e713) 


tion of high-dimensional antecedent fuzzy sets, merg- 
ing similar fuzzy sets, and replacement with linguistic 
values) deteriorates the accuracy of fuzzy rule-based 
systems. Thus, the design of fuzzy rule-based sys- 
tems can be viewed as being the search for a good 
tradeoff solution between accuracy and interpretabil- 
ity. From this viewpoint, some sophisticated approaches 
were proposed [77.62-68] after a large number of 
accuracy improvement algorithms were proposed in 
1990s. Some of those approaches tried to improve 
the accuracy of fuzzy rule-based systems without 
severely deteriorating their interpretability. Other ap- 
proaches tried to improve the interpretability of fuzzy 
rule-based systems without severely deteriorating their 
accuracy. 


where f(S) is a scalarizing function to be minimized. 
A simple but frequently used scalarizing function is the 
weighted sum 


Minimize f (S) = w, fı (S) + w2 fo(S) 
= w; Complexity(S) 


+ w2 Error(S) , (77.14) 


where w; and w3 are nonnegative weights (w is a weight 
vector: w = (w1, W2)). 

Single-objective optimization algorithms such as 
genetic algorithms are used to search for the optimal 
solution (i. e., optimal fuzzy rule-based system) of the 
minimization problem in (77.13). In Fig. 77.12, we 
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Fig. 77.12 The optimal fuzzy rule-based system of the 
weighted-sum minimization problem in (77.14) and the 
nondominated fuzzy rule-based systems of the original 
two-objective problem 
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illustrate the search for the optimal fuzzy rule-based 
system of the weighed-sum minimization problem in 
(77.14) together with the nondominated fuzzy rule- 
based systems of the original two-objective problem in 
(77.12). 

As shown in Fig. 77.12, a single optimal fuzzy rule- 
based system is obtained from a scalarizing function- 
based approach. The main difficulty of this approach is 
the dependency of the obtained fuzzy rule-based sys- 
tem on the choice of a scalarizing function. A different 
fuzzy rule-based system is likely to be obtained from 
a different scalarizing function. For example, a different 
specification of the weight vector in Fig. 77.12 leads to 
a different fuzzy rule-based system. Moreover, an ap- 
propriate choice of a scalarizing function is not easy. 


77.4.2 Handling of Objectives 
as Constraint Conditions 


If we have a pre-specified requirement about the com- 
plexity or the accuracy, we can use it as a constraint 
condition. For example, let us assume that the error 
measure Error(S) in our two-objective problem is the 
classification error rate. We also assume that the upper 
bound of the allowable error rate is given as w%. In this 
case, our two-objective problem can be reformulated as 
the following single-objective problem with a constraint 
condition 


Minimize fı (S) = Complexity(S) 
subject to Error(S) <a. 
(77.15) 


This single-objective problem is to find the simplest 
fuzzy rule-based system among those with a pre-speci- 
fied accuracy (i. e., with error rates smaller than or equal 
toa%). 

It is also possible to use a constraint condition on 
the complexity measure Complexity(S). For example, 
let us assume that Complexity (S) is the number of fuzzy 
rules. We also assume that the upper bound of the allow- 
able number of fuzzy rules is given as £. In this case, 
the following single-objective problem is formulated 


Minimize fə (S) = Error(S) 
subject to Complexity(S) < B . 
(77.16) 


This formulation is illustrated in Fig. 77.13 where the 
optimal solution is the most accurate fuzzy rule-based 


system under the constraint condition Complexity(S) 
<Ê. 

When we have more than two objectives, only a sin- 
gle objective is used as an objective function while all 
the others are used as constraint conditions in this ap- 
proach. That is, an m-objective problem is reformulated 
as a single-objective problem with (m— 1) constraint 
conditions. The main difficulty in this constraint con- 
dition-based approach is an appropriate specification of 
the upper bound for each objective. 


77.4.3 Minimization of the Distance 
to the Reference Point 


In the abovementioned constraint condition-based ap- 
proach, the right-hand side constant for each objective 
is the upper bound of the allowable error or com- 
plexity (e.g., the error rate should be at least smaller 
than or equal to w%). The right-hand side constant 
should be specified so that the formulated constrained 
optimization problem has feasible fuzzy rule-based 
systems. 

A single-objective problem can be also formulated 
when an ideal fuzzy rule-based system is given as a ref- 
erence point in the objective space. We assume that 
the given reference point is outside the feasible re- 
gion of the original two-objective problem in (77.12). 
That is, the ideal fuzzy rule-based system does not 
exist as a feasible solution of the two-objective prob- 
lem. Let the reference point in the two-dimensional 
objective space be f* = (f ,fž). The following sin- 
gle-objective problem can be formulated to search for 
the fuzzy rule-based system closest to the reference 
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Fig. 77.13 The optimal fuzzy rule-based system of the 
constrained optimization problem in (77.15) and the non- 
dominated fuzzy rule-based systems of the original two- 
objective problem 
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point 


Minimize distance (f(S),f*) , (77.17) 
where f(S) is the objective vector (i.e, f(S)= 
FiCS), f2(S)), and distance(A, B) is a distance measure 
between the two points A and B in the objective space. 
Various distance measures can be used in (77.17). 
We illustrate the reference point-based approach in 
Fig. 77.14 where the Euclidean distance is used. As 
shown in Fig. 77.14, the fuzzy rule-based system clos- 
est to the given reference point (f¥, f7) is the optimal 
solution of the single-objective problem in (77.17). 
The main difficulty of the reference point-based ap- 
proach is an appropriate specification of the reference 
point. When we have no information about the com- 
plexity and the accuracy of fuzzy rule-based systems, 
it is very difficult to appropriately specify the refer- 
ence point in the reference point-based approach as well 
as the right-hand side constant for each objective in 
the constraint condition-based approach. However, if 
we know the shape of the complexity—accuracy trade- 
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Fig. 77.14 The optimal fuzzy rule-based system of the dis- 
tance minimization problem from the reference point in 
(77.17) and the nondominated fuzzy rule-based systems of 
the original two-objective problem 


off surface in the objective space (i. e., if we know the 
nondominated fuzzy rule-based systems in the objective 
space), such a parameter specification becomes much 
easier. 


77.5 Evolutionary Multiobjective Approaches 


Since an early study in the 1990s [77.15], various mul- 
tiobjective approaches have been proposed to search for 
a large number of nondominated solutions of multiob- 
jective fuzzy system design problems. In this section, 
we explain the basic idea of those multiobjective ap- 
proaches, recent studies on multiobjective fuzzy system 
design, and future research directions. 


77.5.1 Basic Idea of Evolutionary 
Multiobjective Approaches 


Multiobjective fuzzy system design was first formu- 
lated as a two-objective optimization problem to max- 
imize the accuracy of fuzzy rule-based classifiers 
and to minimize the number of fuzzy rules in the 
1990s [77.15]. Then this two-objective optimization 
problem was extended to a three-objective problem by 
including an additional objective to minimize the total 
tule length (i. e., the total number of antecedent condi- 
tions) in [77.52]. 

The main characteristic feature of evolutionary mul- 
tiobjective approaches to fuzzy system design is that 
a number of nondominated fuzzy rule-based systems 
are obtained by a single run of an evolutionary multiob- 
jective optimization (EMO) algorithm. This is clearly 


different from the single-objective approaches where 
a single fuzzy rule-based system is obtained by a sin- 
gle run of a single-objective optimization algorithm. 
In Fig. 77.15, we illustrate the search for nondomi- 
nated fuzzy rule-based systems in evolutionary multi- 
objective approaches. The population of solutions (i. e., 
fuzzy rule-based systems) is pushed toward the Pareto 
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Fig. 77.15 Search for a variety of nondominated fuzzy 
rule-based systems along the Pareto front by evolutionary 
multiobjective approaches 


14.91 


G22 | D Hed 


14.92 


G22 | D Wed 


Part G 


Hybrid Systems 


front and widened along the Pareto front to search 
for a variety of nondominated solutions. Well-known 
and frequently used EMO algorithms such as nondom- 
inated sorting genetic algorithm II (NSGA-ID [77.70], 
strength Pareto evolutionary algorithm (SPEA) [77.71], 
multiobjective evolutionary algorithm based on de- 
composition (MOEA/D) [77.72], and S metric selec- 
tion evolutionary multiobjective optimisation algorithm 
(SMS-EMOA) [77.73] have their own mechanisms to 
push the population toward the Pareto front and widen 
the population along the Pareto front. 

The obtained set of nondominated solutions can be 
used to examine the complexity—accuracy tradeoff rela- 
tion in the design of fuzzy rule-based systems [77.53]. 
A human decision maker is supposed to choose a final 
fuzzy rule-based system from the obtained nondomi- 
nated ones according to his/her preference. It should be 
noted that the decision maker’s preference is needed in 
the problem formulation phase in the single-objective 
approaches in the previous section (i. e., in the form of 
a scalarizing function, the upper bound of the allow- 
able values for each objective, and the reference point 
in the objective space). However, the evolutionary mul- 
tiobjective approaches do not need any information on 
the decision maker’s preference in their search for non- 
dominated fuzzy rule-based systems. That is, a number 
of nondominated fuzzy rule-based systems can be ob- 
tained with no information on the decision maker’s 
preference. A human decision maker is needed only in 
the solution selection phase after a number of nondom- 
inated solutions are obtained. 


77.5.2 Various Evolutionary Multiobjective 
Approaches 


We have explained multiobjective fuzzy rule-based 
design using the two-objective formulation with the 
complexity minimization and the error minimization in 
Fig. 77.15. However, various evolutionary multiobjec- 
tive approaches have been proposed for multiobjective 
fuzzy system design (for their review, see [77.19]). In 
this subsection, we briefly explain some of those evolu- 
tionary multiobjective approaches. 

In some real-world applications, the design of 
fuzzy rule-based systems involves multiple perfor- 
mance measures. Especially in multiobjective fuzzy 
controller design, multiple performance measures have 
been frequently used with no complexity measures. For 
example, in Stewart et al. [77.74], multiobjective fuzzy 
controller design was formulated as a three-objective 
problem with three performance measures: a current 


tracking error, a velocity tracking error, and a power 
consumption. In Chen and Chiang [77.75], fuzzy 
controller design was formulated using no complexity 
measure and three accuracy measures: the number of 
collisions, the distance between the target and lead 
points of the new path, and the number of explored 
actions. Whereas multiple performance measures have 
been frequently used in multiobjective fuzzy controller 
design, a single performance measure such as the 
error rate has been mainly used in multiobjective 
fuzzy classifier design. However, for the handling of 
classification problems with imbalanced and cost- 
sensitive data sets, multiple performance measures 
were used in some studies on multiobjective fuzzy 
classifier design. For example, a true positive rate and 
a false positive rate were used as separate performance 
measures together with a complexity measure in three- 
objective fuzzy classifier design in [77.76]. 

Multiple complexity measures have been frequently 
used in multiobjective fuzzy classifier design. In 
the first study on multiobjective fuzzy classifier de- 
sign [77.15], the number of fuzzy rules was used 
as a complexity measure. Then the total rule length 
(i.e., the total number of antecedent conditions) was 
added as another complexity measure in three-objec- 
tive fuzzy classifier design [77.52, 53]. The number of 
fuzzy rules and the total rule length have been used 
in many other studies on multiobjective fuzzy classi- 
fier design [77.77—79]. In some studies, the number of 
antecedent fuzzy sets was used instead of the total rule 
length [77.80, 81]. 

When membership function tuning is performed to- 
gether with fuzzy rule generation in fuzzy classifier 
design, complexity measures such as the number of 
fuzzy rules and the total rule length are not always 
enough to evaluate the interpretability of fuzzy rule- 
based systems. Let us compare two fuzzy partitions 
in Fig. 77.16 with each other. The 5 x 5 fuzzy parti- 
tion in Fig. 77.16a has 25 fuzzy rules while the 4 x 4 
fuzzy partition in Fig. 77.16b has 16 fuzzy rules. Thus, 
the fuzzy partition in Fig. 77.16a is evaluated as be- 
ing more complicated than that of Fig. 77.16b when the 
abovementioned simple complexity measures are used. 
However, we intuitively feel that the simple 5 x 5 fuzzy 
partition in Fig. 77.16a is more interpretable than the 
tuned 4 x 4 fuzzy partition in Fig. 77.16b. This is be- 
cause the tuned antecedent fuzzy sets in Fig. 77.16b are 
not easy to interpret linguistically. These discussions on 
the comparison between the two fuzzy partitions show 
the necessity of interpretability measures in addition 
to the abovementioned simple complexity measures in 
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Fig. 77.16a,b Two fuzzy partitions: (a) simple 5 x 5 grid, (b) tuned 4 x 4 grid 


fuzzy classifier design when membership function tun- 
ing is performed. 

Interpretability of fuzzy rule-based systems has 
been a hot topic in the field of fuzzy systems [77.82]. 
Various aspects of fuzzy rule-based systems are related 
to their interpretability [77.83-88]. Some studies focus 
on the explanation ability of fuzzy rule-based classifiers 
to explain why each pattern is classified as a particular 
class in an understandable manner [77.89]. 

Whereas a number of studies have already ad- 
dressed the interpretability of fizzy rule-based sys- 
tems [77.8289], it is still a very difficult open problem 
to quantitatively define all aspects of the interpretability 
of fuzzy rule-based systems. This is because the inter- 
pretability is totally subjective. That is, its definition 
totally depends on human users. Each human user may 
have a different idea about the interpretability of fuzzy 
rule-based systems. 

A number of approaches have been proposed to 
incorporate the interpretability into evolutionary mul- 
tiobjective fuzzy system design [77.90-94]. The basic 
idea is to significantly improve the accuracy of fuzzy 
rule-based systems by slightly deteriorating their in- 
terpretability (e.g., by slightly tuning antecedent fuzzy 
sets). Since the interpretability is totally subjective, it is 
not easy to compare those approaches. In this sense, ex- 
perimental studies on the interpretability of fuzzy rule- 
based systems seem to be one of the promising research 
directions [77.83]. 

Whereas multiobjective genetic algorithms have 
been mainly used in evolutionary multiobjective fuzzy 
system design, the use of other algorithms was also ex- 


amined. This is closely related to the increase in the 
popularity of not only multiobjective genetic algorithms 
but also other multiobjective algorithms. For example, 
multiobjective versions of particle swarm optimization 
(PSO) have been actively studies in the field of evolu- 
tionary computation [77.95—99]. In response to those 
active studies, multiobjective PSO algorithms were used 
for multiobjective fuzzy system design [77.100-104]. 


77.5.3 Future Research Directions 


Formulation of interpretability is still an important is- 
sue to be further studied. As pointed out by many 
studies [77.8388], various aspects are related to the in- 
terpretability of fuzzy rule-based systems. One problem 
is to quantitatively formulate those aspects so that they 
can be used as objectives in evolutionary multiobjec- 
tive fuzzy system design. Another problem is how to 
use them. We may have several options: the use of all 
aspects as separate objectives, the choice of only a few 
aspects as separate objectives, and the integration of all 
or some aspects into a few interpretability measures. If 
we use all aspects as separate objectives, multiobjective 
fuzzy system design is formulated as a many-objective 
problem. It is well-known that many-objective prob- 
lems are usually very difficult for evolutionary multiob- 
jective optimization problems [77.105—107]. However, 
both the choice of only a few aspects and the integration 
into a few interpretability measures are also difficult. 
The main advantage of multiobjective approaches to 
fuzzy system design over single-objective approaches is 
that a number of nondominated fuzzy rule-based sys- 
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tems are obtained along the interpretability—accuracy 
tradeoff surface as we explained in the complexity— 
accuracy objective space. One issue, which has not been 
discussed in many studies, is how to choose a sin- 
gle fuzzy rule-based system from a large number of 
obtained ones. It is implicitly assumed that a single 
fuzzy rule-based system is to be selected by a hu- 
man decision maker. However, the selection of a single 
fuzzy rule-based system is an important issue espe- 
cially when a large number of fuzzy rule-based systems 
are obtained in a high-dimensional objective space. 
A related research topic is the elicitation of the deci- 
sion maker’s preference about interpretability—accuracy 
tradeoffs and its utilization in evolutionary multiobjec- 
tive fuzzy system design. 

Performance improvement of evolutionary multiob- 
jective approaches is still an important research topic. 
Since multiobjective fuzzy system design is often for- 
mulated as complicated multiobjective optimization 
problems with many discrete and continuous decision 
variables, it is very difficult to search for their true 
Pareto optimal solutions. Thus, it is likely that better 
fuzzy rule-based systems than reported results in the 
literature would be obtained by more efficient multiob- 
jective algorithms and/or better problem formulations. 


77.6 Conclusion 


We explained the basic idea of evolutionary multiobjec- 
tive fuzzy system design using a simple two-objective 
formulation for complexity and error minimization in 
comparison with single-objective approaches. The main 
advantage of multiobjective approaches is that a large 
number of fuzzy rule-based systems with different 
complexity—accuracy tradeoffs are obtained from a sin- 
gle run of a multiobjective approach. A human user 
can choose a single fuzzy rule-based system based on 


Actually better results are continuously reported in the 
literature. A related research topic is parallel imple- 
mentation of evolutionary multiobjective approaches. 
In general, parallel implementation of evolutionary al- 
gorithms is not difficult due to their population-based 
search mechanisms (i. e., because the fitness evaluation 
of multiple individual in the current population can be 
easily performed in parallel). 

Multiobjective genetic algorithms have been mainly 
used for evolutionary multiobjective fuzzy system de- 
sign. As we have already mentioned, recently the use 
of multiobjective PSO has been examined [77.100- 
104]. Since other population-based search algorithms 
such as ant colony optimization (ACO) have already 
been used in single-objective approaches to fuzzy sys- 
tem design [77.108—112], the use of their multiobjective 
versions will be examined for multiobjective fuzzy sys- 
tem design. 

A very important and promising research direc- 
tion is multiobjective design of type-2 fuzzy sys- 
tems [77.113]. A number of single-objective ap- 
proaches have already been proposed for type-2 fuzzy 
system design [77.114—116]. However, multiobjective 
type-2 fuzzy system design has not been discussed in 
many studies. 


his/her preference and the requirement in each applica- 
tion field. Highly interpretable fuzzy systems may be 
needed in some application fields while highly accu- 
rate ones may be preferred in other application fields. 
See [77.19] for more comprehensive review on evolu- 
tionary multiobjective approaches to fuzzy rule-based 
system design, [77.88] for single-objective and mul- 
tiobjective approaches, and [77.115, 116] for type-2 
fuzzy system design. 
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particle swarm optimization, and ant colony opti- 
mization as three different paradigms that help in 
the design of optimal type-2 fuzzy systems. We also 
provide a comparison of the different optimization 


78.3.4 General Remarks About 
Optimization of Type-2 Fuzzy 


methods for the case of designing type-2 fuzzy 
systems. 


78.1 Related Work in Type-2 Fuzzy Control 


Uncertainty affects decision-making and appears in 
a number of different forms. The concept of infor- 
mation is fully connected with the concept of un- 
certainty [78.1]. The most fundamental aspect of this 
connection is that the uncertainty involved in any 
problem-solving situation is a result of some informa- 
tion deficiency, which may be incomplete, imprecise, 
fragmentary, not fully reliable, vague, contradictory, or 
deficient in some other way. Uncertainty is an attribute 
of information [78.2]. The general framework of fuzzy 
reasoning allows handling much of this uncertainty, and 
fuzzy systems that employ type-1 fuzzy sets represent 
uncertainty by numbers in the range [0, 1]. When some- 
thing is uncertain, like a measurement, it is difficult to 
determine its exact value, and of course type-1 using 
fuzzy sets make more sense than using crisp sets [78.3]. 


However, it is not reasonable to use an accurate mem- 
bership function for something uncertain, so in this case 
what we need are higher-order fuzzy sets, which are 
able to handle these uncertainties, like the so-called 
type-2 fuzzy sets [78.3]. So, the degree of uncertainty 
can be managed by using type-2 fuzzy logic because 
this offers better capabilities to handle linguistic un- 
certainties by modeling vagueness and unreliability of 
information [78.4—6]. 

Recently, we have seen the use of type-2 fuzzy 
sets in fuzzy logic systems (FLS) in different ar- 
eas of application [78.7-11]. In this paper we deal 
with the application of interval type-2 fuzzy con- 
trol to non-linear dynamic systems [78.4, 12-15]. It 
is a well-known fact that in the control of real sys- 
tems, the instrumentation elements (instrumentation 
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amplifier, sensors, digital to analog, analog to dig- 
ital converters, etc.) introduce some sort of unpre- 
dictable values in the information that has been col- 


78.2 Fuzzy Logic Systems 


In this section, a brief overview of type-1 and type-2 
fuzzy systems is presented. This overview is considered 
to be necessary to understand the basic concepts needed 
to develop the methods and algorithms presented later 
in the chapter. 


78.2.1 Type-1 Fuzzy Logic Systems 


Soft computing techniques have become an important 
research topic that can be applied in the design of in- 
telligent controllers, which utilize human experience in 
amore natural form than the conventional mathematical 
approach [78.18, 19]. An FLS described completely in 


Error i Output 
a u Plant y=f(u) 
Type-2 of 
FLC process 


y = y + 0.05 - randn 


Enable to introduce 
uncertainly to the system 


Fig. 78.1 System used to obtain the experimental results 
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Fig. 78.2 Type-1 membership function 


lected [78.16]. So, the controllers designed under ide- 
alized conditions tend to behave in an inappropriate 
manner [78.17]. 


terms of type-1 fuzzy sets is called a type-1 fuzzy logic 
system (type-1 FLS). In this paper, the fuzzy controller 
has two input variables, which are the error e(f) and the 
error variation Ae(f), 


(78.1) 
(78.2) 


elt) =r(t)— y(t), 
Ae(t) = e(t)—e(t— 1), 


so the control system can be represented as shown in 
Fig. 78.1. 


78.2.2 Type-2 Fuzzy Logic Systems 


If for a type-1 membership function, as in Fig. 78.2, 
we blur it to the left and to the right, as illustrated 
in Fig. 78.3, then a type-2 membership function is 
obtained. In this case, for a specific value x’, the mem- 
bership function (u’) takes on different values, which 
are not all weighted the same, so we can assign an am- 
plitude distribution to all of those points. 

A type-2 fuzzy set A is characterized by the mem- 
bership function [78.1, 3] 


A= {(Q, u), uz x, u)) |YxeX, VYueJ,C[0, 1]} ; 


(78.3) 
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in which 0 < u3 (x, u) < 1. Another expression for A is 


i= f| [a Jc [0,1], (78.4) 


xEX uel, 


where f f denotes the union over all admissible input 
variables x and u. For discrete universes of discourse 
J is replaced by >>. In fact J, C [0, 1] represents the 
primary membership of x, and pz(x,u) is a type-1 
fuzzy set known as the secondary set. Hence, a type- 
2 membership grade can be any subset in [0, 1], the 
primary membership, and corresponding to each pri- 
mary membership, there is a secondary membership 
(which can also be in [0, 1]) that defines the possibilities 
for the primary membership. Uncertainty is represented 
by a region, which is called the footprint of uncer- 
tainty (FOU). When p4(x, u) = 1, Vu € J, C [0, 1] we 
have an interval type-2 membership function, as shown 
in Fig. 78.4. The uniform shading for the FOU rep- 


resents the entire interval type-2 fuzzy set and it can 
be described in terms of an upper membership func- 
tion üz (x) and a lower membership function u- (x). 

A FLS described using at least one type-2 fuzzy set 
is called a type-2 FLS. Type-1 FLSs are unable to di- 
rectly handle rule uncertainties, because they use type-1 
fuzzy sets that are certain [78.3]. On the other hand, 
type-2 FLSs, are very useful in circumstances where 
it is difficult to determine an exact membership func- 
tion and there are measurement uncertainties [78.14, 
20, 21]. 

A type-2 FLS is again characterized by IF-THEN 
tules, but its antecedent or consequent sets are now of 
type-2. Similar to a type-1 FLS, a type-2 FLS includes 
a fuzzifier, a rule base, fuzzy inference engine, and an 
output processor, as we can see in Fig. 78.5. The out- 
put processor includes a type-reducer and a defuzzifier; 
it generates a type-1 fuzzy set output (type-reducer) or 
a crisp number (defuzzifier). 


Fuzzifier 

The fuzzifier maps a crisp point x = (x;,...,x,)" € 
X, xX) x...xX, =X into a type-2 fuzzy set Ax in 
X [78.1], interval type-2 fuzzy sets in this case. We 
will use type-2 singleton fuzzifier, in a singleton fuzzi- 
fication, the input fuzzy set has only a single point on 
nonzero membership [78.3]. Ax is a type-2 fuzzy sin- 
gleton if u5, (x) = 1/1 for x = x’ and Mg, (x) = 1/0 for 
all other x Æ x’ [78.1]. 


Rules 
The structure of rules in a type-1 FLS and a type-2 FLS 
is the same, but in the latter the antecedents and the con- 
sequents will be represented by type-2 fuzzy sets. So 
for a type-2 FLS with p inputs xı € X1, .. . , Xp E€ Xp and 
one output y € Y, multiple input single output (MISO). 
If we assume that there are M rules, the /-th rule in the 


Type 2 FIS 
7 Output 
4 ! Crisp value 
Defuzzifier [4——» 
Active Active 7 
Inputs antecedents consequents | | 
rill = ; : | E 
Fuzzification |>| Inference į Type-reducer 
in2 ~>] | — 
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Rules 


Output processor 


Fig. 78.5 Type-2 FLS 
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type-2 FLS can be written as follows [78.3] 
R! : IF x; is Fi and ... 


THEN yis G! , 
„M. 


. pl 
and x, is F, ; 


b=1,54 (78.5) 
Inference 
In the type-2 FLS, the inference engine combines rules 
and gives a mapping from input type-2 fuzzy sets to 
output type-2 fuzzy sets. It is necessary to compute 
the join U, (unions) and the meet TI (intersections), as 
well as extended sup-star compositions of type-2 rela- 
tions [78.3]. If 


Ris Bl x 
(78.6) 


R! is described by the membership function ugi (x, y) = 


UR (X1, ~- - Xp, Y), Where 
Pel X.Y) = Higi (x,y) (78.7) 
can be written as [78.3] 
Hr, y) = MG X, y) = ug (x1) 
x TMT- Hup Oua) 
= [Tupi] Tue). (78.8) 


In general, the p-dimensional input to R! is given by the 
type-2 fuzzy set A, whose membership function is 


My, (X) = Me (IT + Tuz, (Xp) = TT) He i), 
(78.9) 
where X;(i= 1,...,p) are the labels of the fuzzy sets 


describing the inputs. Each rule R! determines a type-2 
fuzzy set B! = A, o R! such that [78.3] 


= Ujor! = Urex [uz ®)Tue(x.y)] . 
yeY, 1=1,...,M (78.10) 


Hay) 


This equation is the input/output relation in Fig. 78.5 
between the type-2 fuzzy set that activates one rule in 


the inference engine and the type-2 fuzzy set at the out- 
put of that engine [78.3]. In the FLS we used interval 
type-2 fuzzy sets and meet under product t-norm, so the 
result of the input and antecedent operations, which are 
contained in the firing set TT Mi, (x, = F'(x’), is an 
interval type-1 set [78.3], 


re =e] am 
where 

L= uy (x1) Mp) (78.12) 

FONR A)R), (78.13) 


where * is the product operation. 


Type-Reducer 
The type-reducer generates a type-1 fuzzy set output, 
which is then converted in a crisp output through the 
defuzzifier. This type-1 fuzzy set is also an interval set, 
for the case of our FLS we used center of sets (coss) 
type reduction, Y..;, which is expressed as [78.3] 


Yeos(x) = [yr] = 7 1] har y] 


al, : Daf i 
fief! P Ja Da 


(78.14) 


This interval set is determined by its two end points, y; 
and y,, which correspond to the centroid of the type-2 
interval consequent set G! [78.3], 


Diet Yiði ivi 
caf f = [iyi] . 
0i EJy1 Oy EJyN 1 Pee YS 6; 


(78.15) 


Before the computation of Y.os(x), we must evalu- 
ate (78.15) and its two end points, y; and y,. If the values 
of f; and y; that are associated with y; are denoted f’ 
and y), respectively, and the values of f; and y; that are 
associated with y, are denoted f! and y,, respectively, 
from (78.14), we have [78.3] 


Ae ees 
via Sy; 
Leal 
ME 
Dich Y 
M n’ 
DaN 


y= (78.16) 


yy = (78.17) 


Bio-Inspired Optimization of Type-2 Fuzzy Controllers 


78.3 Bio-Inspired Optimization Methods 


Defuzzifiers 
From the type-reducer we obtain an interval set Yoos; 
to defuzzify it we use the average of y; and y,, so the 
defuzzified output of an interval singleton type-2 FLS 


is [78.3] 


=. yi yr 


y(x) 5 


(78.18) 


78.3 Bio-Inspired Optimization Methods 


In this section a brief overview of the basic concepts 
from bio-inspired optimization methods needed for this 
work is presented. 


78.3.1 Particle Swarm Optimization 


Particle swarm optimization is a population-based 
stochastic optimization technique, which was devel- 
oped by Eberhart and Kennedy in 1995. It was inspired 
by the social behavior of bird flocking or fish school- 
ing [78.7]. (PSO) shares many similarities with evo- 
lutionary computation techniques such as the genetic 
algorithm (GA) [78.22]. 

The system is initialized with a population of ran- 
dom solutions and searches for optima by updating 
generations. However, unlike the GA, PSO has no 
evolution operators such as crossover and mutation. 
In PSO, the potential solutions, called particles, fly 
through the problem space by following the current op- 
timum particles [78.18]. Each particle keeps track of 
its coordinates in the problem space, which are asso- 
ciated with the best solution (fitness) it has achieved 
so far (the fitness value is also stored). This value 
is called pbest. Another best value that is tracked by 
the particle swarm optimizer is the best value, ob- 
tained so far by any particle in the neighbors of the 
particle. This location is called Jbest. When a parti- 
cle takes all the population as its topological neigh- 
bors, the best value is a global best and is called 
gbest [78.15]. 

The particle swarm optimization concept consists 
of, at each time step, changing the velocity of (acceler- 
ating) each particle toward its pbest and Ibest locations 
(local version of PSO). Acceleration is weighted by 
a random term, with separate random numbers be- 
ing generated for acceleration toward pbest and Ibest 
locations [78.7]. In the past several years, PSO has 
been successfully applied in many research and ap- 
plication areas. It has been demonstrated that PSO 
obtains better results in a faster, cheaper way com- 
pared with other methods [78.15]. Another reason that 
PSO is attractive is that there are few parameters to 


adjust. One version, with slight variations, works well 
in a wide variety of applications. Particle swarm opti- 
mization has been considered for approaches that can 
be used across a wide range of applications, as well as 
for specific applications focused on a specific require- 
ment. 

The basic algorithm of PSO has the following 
nomenclature: 


xi: Particle position 
vi: Particle velocity 

w;;: Inertia weight 

P}: Best remembered individual particle position 
P£: Best remembered swarm position 

c1, Cy: Cognitive and social parameters 


rı, r2: Random numbers between 0 and 1. 


The equation to calculate the velocity is 


Vip. = Wy, + ciri (P; — x) 

+on (p8 —x!) : (78.19) 
and the position of the individual particles is updated as 
follows 


E E AE (78.20) 


The basic PSO algorithm is defined as follows: 


1) Initialize 
a) Set constants Zmax, C1, C2 
b) Randomly initialize particle position xi ED 
in R" fori=1,...,p 
c) Randomly initialize particle velocities 0 < vi < 
v” fori=1,...,p 
d) SetZ= 1 
2) Optimize 
a) Evaluate function value f using design space 
coordinates xi, 


b) If fi < fogs: then firost = fi, p =x : 
c) Ff = Soest then foes =f. PË = z 
d) If stopping condition is satisfied then go to 3. 
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e) Update all particle velocities vi fori=1,...,p 
f) Update al particle positions x fori=1,...,p 
g) Increment z. 
h) Goto 2(a). 

3) Terminate. 


78.3.2 Genetic Algorithms 


Genetic algorithms (GAs) are adaptive heuristic search 
algorithms based on the evolutionary ideas of natu- 
ral selection and genetic processes [78.21]. The basic 
principles of GAs were first proposed by Holland in 
1975, inspired by the mechanism of natural selection, 
where stronger individuals are likely to be the winners 
in a competing environment [78.22]. GA assumes that 
the potential solution of any problem is an individual 
and can be represented by a set of parameters. These pa- 
rameters are regarded as the genes of a chromosome and 
can be structured by a string of values in binary form. 
A positive value, generally known as a fitness value, is 
used to reflect the degree of goodness of the chromo- 
some for the problem, which would be highly related 
with its objective value. The pseudocode of a GA is as 
follows: 


1) Start with a randomly generated population of 
n chromosomes (candidate solutions to a prob- 
lem). 

1. Calculate the fitness of each chromosome in the 
population. 

2. Repeat the following steps until n offspring have 
been created: 

a) Select a pair of parent chromosomes from the 
current population, the probability of selection 
being an increasing function of fitness. Selec- 
tion is done with replacement, meaning that the 
same chromosome can be selected more than 
once to become a parent. 

b) With probability (crossover rate), perform 
crossover to the pair at a randomly chosen point 
to a form two offspring. 

c) Mutate the two offspring at each locus with 
probability (mutation rate), and place the result- 
ing chromosomes in the new population. 

2) Replace the current population with the new popu- 
lation. 

3) Go to step 2. 


The simple procedure just described above is the 
basis for most applications of GAs found in the liter- 
ature [78.23, 24]. 


78.3.3 Ant Colony Optimization 


Ant colony optimization (ACO) is a probabilistic tech- 
nique that can be used for solving problems that can 
be reduced to finding good paths along graphs. This 
method was inspired from the behavior exhibited by 
ants in finding paths from the nest or colony to the food 
source. 

Simple ant colony optimization (S-ACO) is an al- 
gorithmic implementation that adapts the behavior of 
real ants to solutions of minimum cost path problems 
on graphs [78.11]. A number of artificial ants build 
solutions for a certain optimization problem and ex- 
change information about the quality of these solutions 
making allusion to the communication system of real 
ants [78.25]. 

Let us define the graph G = (V, E), where V is the 
set of nodes and E is the matrix of the links between 
nodes. G has ng = |V| nodes. Let us define L% as the 
number of hops in the path built by the ant k from the 
origin node to the destiny node. Therefore, it is neces- 
sary to find 

Q= {qas 4101 ec} ; (78.21) 
where Q is the set of nodes representing a continuous 
path with no obstacles; q,,...,q are former nodes of 
the path, and C is the set of possible configurations of 
the free space. If x*(t) denotes a Q solution in time 
t, f(x*(t)) expresses the quality of the solution. The 
S-ACO algorithm is based on (78.22)-(78.24) 


rk 


pit) = 4 Lenk TH O oN (78.22) 
0 if i¢ Ni 
tj — —p) y(t), (78.23) 
ty(tt+ I = y+ > WO. (78.24) 
k=1 


Equation (78.22) represents the probability for an ant k 
located on a node i selects the next node denoted by j, 
where, N‘ is the set of feasible nodes (in a neighbor- 
hood) connected to node i with respect to ant k, tj is 
the total pheromone concentration of link ij, and œ is 
a positive constant used as a gain for the pheromone 
influence. 

Equation (78.23) represents the evaporation 
pheromone update, where p € [0, 1] is the evaporation 
rate value of the pheromone trail. The evaporation is 


Bio-Inspired Optimization of Type-2 Fuzzy Controllers 


78.4 General Overview of the Area and Future Trends 


added to the algorithm in order to force the explo- 
ration of the ants and avoid premature convergence to 
sub-optimal solutions. For p = 1 the search becomes 
completely random. 

Equation (78.24), represents the concentration 
pheromone update, where Ark is the amount of 
pheromone that an ant k deposits in a link ij in a time t. 

The general steps of S-ACO are the following: 


Set a pheromone concentration Tij to each link (i,j). 

. Place a number k = 1,2,..., n; in the nest. 

3. Iteratively build a path to the food source (destiny 
node), using (78.22) for every ant. 

@ Remove cycles and compute each route weight 
fO). A cycle could be generated when there 
are no feasible candidates nodes, that is, for any 
iand any k, NE = Ø; then the predecessor of that 
node is included as a former node of the path. 

4. Apply evaporation using (78.2). 
5. Update of the pheromone concentration us- 

ing (78.24) 

6. Finally, finish the algorithm in any of the three dif- 
ferent ways: 


Ne 


@ When a maximum number of epochs has been 
reached. 
@ When it has found an acceptable solution, with 


F(x) < e. 


@ When all ants follow the same path. 


78.3.4 General Remarks About 
Optimization of Type-2 Fuzzy 
Systems 


The problem of designing type-2 fuzzy systems can 
be solved with any of the above-mentioned optimiza- 
tion methods. The main issue in any of these methods 
is to decide on the appropriate representation of the 
type-2 fuzzy system in the corresponding optimization 
paradigm. For example, in the case of GAs, the type-2 
fuzzy systems must be represented in the chromosomes. 
On the other hand, in PSO the fuzzy system is repre- 
sented as a particle in the optimization process. In the 
ACO method, the fuzzy system can be represented as 
one of the paths that the ants can follow in a graph. 
Also, the evaluation of the fuzzy system must be rep- 
resented as an objective function in any of the methods. 


78.4 General Overview of the Area and Future Trends 


In this section, a general overview of the area of type-2 
fuzzy system optimization is presented. Also, possible 
future trends that we can envision based on the review 
of this area are presented. It has been well known for 
a long time that to design fuzzy systems is a difficult 
task, and this is especially true in the case of type-2 
fuzzy systems [78.4]. The use of GAs, ACO, and PSO 
in designing type-1 fuzzy systems has become stan- 
dard practice for automatically designing this sort of 
system [78.7, 8, 23,25]. This trend has also continued 
to the type-2 fuzzy systems area, which has been ac- 
counted for with the review of papers presented in the 
previous sections. In the case of designing type-2 fuzzy 
systems the problem is more complicated due to the 
higher number of parameters to consider, making it of 
very important to use bio-inspired optimization tech- 
niques to achieve the optimal designs of this sort of 
system. In this section, a summary of the total number 
of papers published in the area of type-2 fuzzy sys- 
tem optimization is presented, so that the increasing 
trend occurring in this area can be better appreciated. 
Also, the distribution of papers according to the opti- 
mization technique used is presented, so that a general 


idea of how these different techniques contribute to the 
automatic design of optimal type-2 fuzzy systems is 
obtained. 

Figure 78.6 shows the distribution of the papers 
published on the optimization of type-2 fuzzy systems 
according to the different bio-inspired optimization 
techniques previously mentioned. From Fig. 78.6 it can 
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Fig. 78.6 Distribution of publications per area and year 
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be noted that the use of GAs has been decreasing re- 
cently. On the other hand, the use of PSO, ACO, and 
other methods have been increasing. The reason for the 
increase in use of PSO and ACO may be due to recent 
work in which either PSO or ACO have been able to 
outperform GAs for different applications. Regarding 
the question of which method would be the most appro- 
priate for optimizing type-2 fuzzy systems, there is no 
easy answer. At the moment, what we can be sure of is 
that the techniques mentioned in this paper, and prob- 
ably newer ones that may appear in the future, would 
certainly be tested in the optimization of type-2 fuzzy 
systems because the problem of automatically design- 
ing these types of systems is complex enough to require 
their use. 


78.5 Conclusions 


In this chapter we have presented a representative 
account of the different optimization methods that 
have been applied in the optimal design of type-2 
fuzzy systems. To date, genetic algorithms have been 
used more frequently to optimize type-2 fuzzy sys- 
tems. However, more recently PSO and ACO have 
attracted more attention and have also been applied 
with some degree of success to the problem of 
the optimal design of type-2 fuzzy systems. There 
have been also other optimization methods applied 


References 


78.1 J.M. Mendel: Uncertainty, fuzzy logic, and signal 
processing, Signal Process. J. 80, 913-933 (2000) 
L.A. Zadeh: The concept of a linguistic variable and 
its application to approximate reasoning, Inf. Sci. 
8, 43-80 (1975) 

N.N. Karnik, J.M. Mendel: An Introduction to Type- 
2 Fuzzy Logic Systems, Technical Report (University 
of Southern California, Los Angeles 1998) 

0. Castillo, P. Melin, A. Alanis, 0. Montiel, R. Sepul- 
veda: Optimization of interval type-2 fuzzy logic 
controllers using evolutionary algorithms, J. Soft 
Comput. 15(6), 1145-1160 (2011) 

R. Sepulveda, 0. Montiel, 0. Castillo, P. Melin: Em- 
bedding a high speed interval type-2 fuzzy con- 
troller for a real plant into an FPGA, Appl. Soft 
Comput. 12(3), 988-998 (2012) 

R.R. Yager: Fuzzy subsets of type II in decisions, 
J. Cybern. 10, 137-159 (1980) 


78.2 


78.3 


78.4 


78.5 


78.6 


There are other bio-inspired or nature-inspired tech- 
niques that at the moment have not been applied to the 
optimization of type-2 fuzzy systems that may be worth 
mentioning. For example, membrane computing, har- 
mony computing, electromagnetism-based computing, 
and other similar approaches have not been applied (to 
date) in the optimization of type-2 fuzzy systems. It is 
expected that these approaches and similar ones could 
be applied in the near future in the area of type-2 fuzzy 
system optimization. Of course, as new bio-inspired and 
nature-inspired optimization methods are continuously 
being proposed in this fruitful area of research, it is ex- 
pected that newer optimization techniques will also be 
tried in the near future in the automatic design of opti- 
mal type-2 fuzzy systems. 


to the optimization of type-2 fuzzy systems, like ar- 
tificial immune systems and the chemical optimiza- 
tion paradigm. At this time, it would be very diffi- 
cult to declare one of these optimization techniques 
as the best for optimizing type-2 fuzzy systems, be- 
cause different techniques have had success in different 
applications of type-2 fuzzy logic. In any case, the 
need for bio-inspired optimization methods is justi- 
fied due to the complexity of designing type-2 fuzzy 
systems. 


78.7 Z. Bingül, 0. Karahan: A fuzzy logic controller tuned 
with PSO for 2 DOF robot trajectory control, Expert 
Syst. Appl. 38(1), 1017-1031 (2011) 

J. Cao, P. Li, H. Liu, D. Brown: Adaptive fuzzy con- 
troller for vehicle active suspensions with particle 
swarm optimization, Proc. SPIE Int. Soc. Opt. Eng., 
Vol. 7129 (2008) 

J.R. Castro, 0. Castillo, P. Melin: An interval type- 
2 fuzzy logic toolbox for control applications, Proc. 
FUZZ-IEEE (2007) pp. 1-6 

T. Dereli, A. Baykasoglu, K. Altun, A. Durmusoglu, 
I.B. Turksen: Industrial applications of type-2 fuzzy 
sets and systems: A concise review, Comput. Ind. 
62, 125-137 (2011) 

C.-F. Juang, C.-H. Hsu: Reinforcement ant op- 
timized fuzzy controller for mobile-robot wall- 
following control, IEEE Trans. Ind. Electron. 56(10), 
3931-3940 (2009) 


78.8 


78.9 


78.10 


78.11 


Bio-Inspired Optimization of Type-2 Fuzzy Controllers 


References 


78.12 


78.13 


78.14 


78.15 


78.16 


78.17 


78.18 


0. Castillo, G. Huesca, F. Valdez: Evolutionary com- 
puting for topology optimization of type-2 fuzzy 
controllers, Stud. Fuzziness Soft Comput. 208, 163- 
178 (2008) 

0. Castillo, L.T. Aguilar, N.R. Cazarez-Castro, S. Car- 
denas: Systematic design of a stable type-2 fuzzy 
logic controller, Appl. Soft Comput. J. 8, 1274-1279 
(2008) 

R. Martinez, 0. Castillo, L.T. Aguilar: Optimiza- 
tion of interval type-2 fuzzy logic controllers for a 
perturbed autonomous wheeled mobile robot us- 
ing genetic algorithms, Inf. Sci. 179(13), 2158-2174 
(2009) 

S.-K. Oh, H.-J. Jang, W. Pedrycz: A comparative 
experimental study of type-1/type-2 fuzzy cascade 
controller based on genetic algorithms and parti- 
cle swarm optimization, Expert Syst. Appl. 38(9), 
11217-11229 (2011) 

R. Sepulveda, 0. Castillo, P. Melin, A. Rodriguez- 
Diaz, 0. Montiel: Experimental study of intelligent 
controllers under uncertainty using type-1 and 
type-2 fuzzy logic, Inf. Sci. 177(10), 2023-2048 (2007) 
H. Hagras: Hierarchical type-2 fuzzy logic control 
architecture for autonomous mobile robots, IEEE 
Trans. Fuzzy Syst. 12, 524-539 (2004) 

R. Martinez, A. Rodriguez, 0. Castillo, L.T. Aguilar: 
Type-2 fuzzy logic controllers optimization using 
genetic algorithms and particle swarm optimiza- 
tion, Proc. IEEE Int. Conf. Granul. Comput. (2010) 
pp. 724-727 


78.19 


78.20 


78.21 


78.22 


78.23 


78.24 


78.25 


S.M.A. Mohammadi, A.A. Gharaveisi, M. Mashinchi: 
An evolutionary tuning technique for type-2 fuzzy 
logic controller in a non-linear system under un- 
certainty, Proc. 18th Iran. Conf. Electr. Eng. (2010) 
pp. 610-616 

J.R. Castro, 0. Castillo, L.G. Martinez: Interval 
type-2 fuzzy logic toolbox, Eng. Lett. 15(1), 14 
(2007) 

0. Cordon, F. Gomide, F. Herrera, F. Hoffmann, 
L. Magdalena: Ten years of genetic fuzzy systems: 
Current framework and new trends, Fuzzy Sets Syst. 
141, 5-31 (2004) 

0. Cordon, F. Herrera, P. Villar: Analysis and guide- 
lines to obtain a good uniform fuzzy partition 
granularity for fuzzy rule-based systems using sim- 
ulated annealing, Int. J. Approx. Reason. 25, 187- 
215 (2000) 

C. Wagner, H. Hagras: A genetic algorithm based 
architecture for evolving type-2 fuzzy logic con- 
trollers for real world autonomous mobile robots, 
Proc. IEEE Conf. Fuzzy Syst. (2007) 

D. Wu, W.-W. Tan: Genetic learning and perfor- 
mance evaluation of interval type-2 fuzzy logic 
controllers, Eng. Appl. Artif. Intell. 19(8), 829-841 
(2006) 

C.-F. Juang, C.-H. Hsu: Reinforcement interval 
type-2 fuzzy controller design by online rule gener- 
ation and Q-value-aided ant colony optimization, 
IEEE Trans. Syst. Man Cybern. B 39(6), 1528-1542 
(2009) 


1507 


82|9 Hed 


1509 


79. Pattern Recognition with Modular Neural 
Networks and Type-2 Fuzzy Logic 


Patricia Melin 


79.1 Related Work in the Area ..................... 1509 
Interval type-2 fuzzy systems can be of great help z 
in image analysis and pattern recognition appli- 19.2 a. PERE EA me 1510 
ti eal icul d tection i aae 
talons in pariaan ecEa detaan i a MOCE with Fuzzy LOgiC.............0.000000 1510 


usually applied to image sets before the training 
phase in recognition systems. This preprocessing 
step helps to extract the most important shapes in 


79.2.2 Morphological Gradient Edge 
Detector Improved 


i N ; ‘ with Fuzzy Logic..................:06 1511 
an image, ignoring the homogeneous regions and 79.3. Experi ore 1512 
remarking the real objective to classify or recog- ` Sapp p igei S 
nize. Many traditional and fuzzy edge detectors i for the Ex aae 1512 
can be used, but it is very difficult to demon- 79.3.2 Pashes A 
strate which one is better before the recognition = for the Images Databases 1512 
results are obtained. In this chapter, we show ex- 79.3.3 The Modular Neural Network...... 1513 
peiimenil results wnere sevel edep deiecta 79.4 Experimental Results .......................00 1513 
were used to preprocess the same image sets. Each ; 
resulting image set was used as training data for 79.5 COMCIUSIONS ............. cece cee eeeeeeeees 1515 

References... cee eee teeeeeeeeeeeeeenaes 1515 


a modular neural network recognition system, and 
the recognition rates were compared. The goal of 
these experiments is to find the better edge de- 
tector that can be used to improve the training 


79.1 Related Work in the Area 


In previous work, we have proposed extensions to the 
traditional edge detectors to improve their performance 
by using fuzzy systems [79.1—3]. The performed exper- 
iments have shown that the resulting images obtained 
with fuzzy edge detectors were visually better than the 
ones obtained with the traditional edge detection meth- 
ods. 

There is still work to be done on developing for- 
mal validation metrics for fuzzy edge detectors. In the 
literature, we can find comparison of edge detectors 
based on human observations [79.4—8], and some oth- 
ers that found the optimal values for parametric edge 
detectors [79.9]. 

Edge detectors can be used in recognition systems 
for different purposes, but in this work we are partic- 
ularly interested in knowing, which is the best edge 


data of a modular neural network for an image 
recognition system. 


detector for a neural recognition system. In this chapter, 
we present some experiments which show that fuzzy 
edge detectors are a good method to improve the per- 
formance of neural recognition systems, and for this 
reason we propose that the recognition rate of the neural 
networks can be used as an edge detection performance 
index. 

The rest of the chapter is organized as fol- 
lows. Section 79.2 presents an overview of fuzzy 
edge detectors. Section 79.3 describes the exper- 
imental setup used to test the proposed fuzzy 
edge detectors in a modular neural recognition sys- 
tem. Section 79.4 presents the experimental results 
achieved with the proposed fuzzy edge detectors. Fi- 
nally, Sect. 79.5 outlines the conclusions and future 
work. 
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79.2 Overview of Fuzzy Edge Detectors 


79.2.1 Sobel Edge Detector Improved 
with Fuzzy Logic 


In this section, an overview of the previously pro- 
posed fuzzy edge detectors is presented. First, the Sobel 
edge detector improved with fuzzy logic is presented. 
Second, the morphological gradient edge detector en- 
hanced with fuzzy logic is also presented. 


In the Sobel fuzzy edge detector we used the individual 
operators Sobel, and Sobel, as in the traditional method, 


Sobel method 


ED or 
FIS1 or 


Edges 


FIS2 


Fig. 79.1 Sobel 
edge detector en- 
hanced with fuzzy 
logic 
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Fig. 79.2 Membership functions of the variables for the 
Sobel+FIS1 edge detector 


yl (Sobel +FIS1 edges) 


yl (Sobel + FIS2 edges) 


Fig. 79.3 Membership functions of the variables for the 
Sobel+FIS2 edge detector 
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79.2 Overview of Fuzzy Edge Detectors 


and then we substitute the Euclidean distance of (79.1) 
by a fuzzy system, as shown in Fig. 79.1 [79.3]. 


Sobel_edges = \/ Sobel; + Sobel; (79.1) 


The individual Sobel operators are the main inputs 
to the type-1 fuzzy inference system (FIS1) and type-2 
fuzzy inference system (FIS2), and we have also con- 
sidered adding two more inputs, which are filters that 
improve the final edge image. The fuzzy variables used 
in the Sobel+FIS1 and Sobel+ FIS2 edges detectors 
are shown in Fig. 79.2 and Fig. 79.3, respectively. 

The use of the FIS2 [79.10, 11] provided images 
with better defined edges than the FIS1, which is a very 
important result in providing better inputs to the neural 
networks that will perform the recognition task. 

The fuzzy rules for both the FIS1 and FIS2 are the 
same and are shown below: 


1. If (dh is LOW) and (dv is LOW) then (y1 is HIGH) 

2. If (dhis MIDDLE) and (dv is MIDDLE) then (y1 is 
LOW) 

3. If (dh is HIGH) and (dv is HIGH) then (y1 is LOW) 


4. If (dh is MIDDLE) and (hp is LOW) then (y1 is 
LOW) 

5. If (dv is MIDDLE) and (Ap is LOW) then (y1 is 
LOW) 

6. If (m is LOW) and (dv is MIDDLE) then (y1 is 
HIGH) 

7. If (m is LOW) and (dh is MIDDLE) then (y1 is 
HIGH) 


The fuzzy rule base shown above infers the gray 
tone of each pixel for the edge image with the following 
reasoning: When the horizontal gradient dp and vertical 


T 
20 40 6 80 


gradient d, are LOW means that there is not enough dif- 
ference between the gray tones in it’s neighbors pixels, 
then the output pixel must belong of an homogeneous 
or not edges region, then the output pixel is HIGH or 
near WHITE. In the opposite case, when dp and dy are 
both HIGH this means that there is enough difference 
between the gray tones in its neighborhood, then the 
output pixel is an EDGE. 


79.2.2 Morphological Gradient Edge 
Detector Improved with Fuzzy Logic 


In the morphological gradient (MG), we calculated 
the four gradients as in the traditional method [79.12, 
13], and substitute the sum of gradients in (79.2) with 
a fuzzy inference system, as shown in Fig. 79.4. 


MG edges = Dı + D2 + D3 + D4 (79.2) 


The linguistic variables used in the MG+FIS1 and 
MG-+FIS2 edges detectors are shown in Fig. 79.5 and 
Fig. 79.6, respectively. 

The rules for both the FIS1 and FIS2 are the same 
and are shown below: 


1. If (D1 is HIGH) or (D2 is HIGH) or (D3 is HIGH) 
or (D4 is HIGH) then (E is BLACK) 

2. If (D1 is MIDDLE) or (D2 is MIDDLE) or (D3 is 
MIDDLE) or (D4 is MIDDLE) then (E is GRAY) 

3. If (D1 is LOW) and (D2 is LOW) and (D3 is LOW) 
and (D4 is LOW) then (E is WHITE) 


After many experiments, we found that an edge ex- 
ists when any gradient D; is HIGH, which means that 
a difference of gray tones in any direction of the image 
must produce a pixel with a BLACK value or EDGE. 
The same behavior occurs when any gradient D; is 


Fig. 79.4 Morphological gradient 
edge detector enhanced with fuzzy 
systems 
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Fig. 79.5 Membership functions of the variables for the 
MG-+FIS1 edge detector 


MIDDLE, which means that even when the differences 
in the gray tones are not maximal, the pixel is an EDGE, 
then the only rule that found a non edge pixel is the 


79.3 Experimental Setup 


The experiment consists on applying a neural recogni- 
tion system using each of the previously presented edge 
detectors: Sobel, Sobel+FIS1, Sobel+FIS2, morpho- 
logical gradient (MG), morphological gradient+FIS1 
and morphological gradient+-FIS2 and then comparing 
the results. 


79.3.1 General Algorithm used 
for the Experiments 


Define the database folder. 

Define the edge detector. 

3. Detect the edges of each image as a vector and store 
it as a column in a matrix. 


N e 


Fig. 79.6 Membership functions of the variables for the 
MG-+FIS2 edge detector 


number 3, where only when all the gradients are LOW, 
the output pixel is WHITE, which means a pixel be- 
longing to an homogeneous region. 


4. Calculate the recognition rate using the k-fold cross- 
validation method. 
a) Calculate the indices for training and test k- 
folds. 
b) Train the neural network k— 1 times, one for 
each training fold calculated previously. 
c) Test the neural network k times, one for each 
fold test set calculated previously. 
5. Calculate the mean rate for all the k-folds. 


79.3.2 Parameters for the Images Databases 


The experiments can be performed with benchmark 
image databases used for identification purposes. This 
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Table 79.1 Particular information for the tested benchmark face databases 


Database Person number (p) Samples number (s) 
ORL 40 10 
Cropped Yale 38 10 
FERET 74 4 


is the case of face recognition applications, then we 
used three of the most popular benchmark sets of 
images, the ORL face database [79.14], the Cropped 
Yale face database [79.15,16], and the FERET face 
database [79.17]. 

For the three databases, we defined the variable p 
as the person number and s as number of samples for 
each person. The tests were made with k-fold cross- 
validation method, with k = 5 for the three databases. 
We can generalize the calculation of fold size m or num- 
ber of samples in each fold, dividing the total number 
of samples for each person s by the fold number, and 
then multiplying the result by the person number p (3), 
then the train data set size i (4) can be calculated as the 
number of samples in k— 1 folds m, and test data set 
size t (5) are the number of samples in only one fold. 


m= (s/k)*p (79.3) 
i= m(k—1) (79.4) 
fai (79.5) 


The total number of samples used for each person were 
of 10 for the ORL and YALE databases; then if the 


79.4 Experimental Results 


In this section, we show the numerical results of the ex- 
periments. Table 79.2 contains the results for the ORL 
face database, Table 79.3 contains the results for the 
Cropped Yale database, and Table 79.4 contains the re- 
sults for the FERET face database. 

For a better appreciation of the results, we made 
plots for the values presented in Tables 79.2—79.4. Even 


Table 79.2 Recognition rates for the ORL database of 
faces 


Training set Mean Mean Standard Max 
preprocessing time rate deviation rate 
method (s) (%) (%) 
MG-+FIS1 1.2694 89.25 4.47 95.00 
MG-+ FIS2 1.2694 90.25 5.48 97.50 
Sobel++FIS 1 1.2694 87.25 3.69 1 25) 


Sobel+FIS2 1.2694 90.75 4.29 95.00 


Fold size (m) Training set size (i) Test set size (£) 
80 320 80 
76 304 76 
74 222 74 


size m of each 5-fold is 2, the number of samples for 
training for each person is 8 and for testing is 2. For the 
experiments with the FERET face database, we use only 
the samples of 74 persons who have 4 frontal sample 
images. The particular information for each database is 
shown in Tab. 79.1. 


79.3.3 The Modular Neural Network 


In previous experiments with neural networks for image 
recognition, we have found a general structure with ac- 
ceptable performance, even if it is not optimized. We 
used the same structure for multinet modular neural 
networks, in order to establish a standard for compar- 
ison for all the experiments [79.3, 18—23]. The general 
structure for the monolithic neural network is indicated 
below: 


@ Two hidden layers with 200 neurons. 

© Learning algorithm: Gradient descent with mo- 
mentum and adaptive learning rate backpropaga- 
tion. 

@ Error goal 0.0001. 


if this work does not pretend to make a comparison 
based on the training times as performance index for the 
edge detectors, it is interesting to note that the necessary 
time to reach the error goal is established for each ex- 
periment. 

As we can see in Fig. 79.7, the lowest training 
times are for the morphological gradient+FIS2 edge 


Table 79.3 Recognition rates for the cropped Yale 
database of faces 


Training set Mean Mean Standard Max 
preprocessing time rate deviation rate 
method (s) (%) (%) 
MG-+FIS1 1.76 68.42 29.11 100 
MG-+FIS2 1.07 88.16 21.09 100 
Sobel+FIS 1 1.17 79.47 26.33 100 
Sobel+FIS2 1.1321 90 22.36 100 
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Time (s) 
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Table 79.4 Recognition rates for the FERET database of 
faces 


Training set Mean Mean Standard Max 
preprocessing time rate deviation rate 

method (s) (%) (%) 

MG-+FIS1 1.17 75.34 5.45 TOTI 
MG-+FIS2 1.17 72.30 6.85 82.43 
Sobel-+FIS 1 1.17 B2 00.68 83.78 
Sobel+FIS2 eli 84.46 03.22 87.84 


detector and Sobel+FIS2 edge detector. That is because 
both edge detectors were improved with interval type-2 
fuzzy systems and produce images with more homoge- 
neous areas; which means a high frequency of pixels 
near the WHITE linguistic values. 

However, the main advantages of the interval type- 
2 edges detectors are the recognition rates plotted in 
Fig. 79.8, where we can notice that the best mean per- 
formance of the neural network was achieved when 
it was trained with the data sets obtained with the 
MG-+FIS2 and Sobel+FIS2 edge detectors. 

Figure 79.9 shows that the recognition rates are 
also better for the edge detectors improved with inter- 
val type-2 fuzzy systems. The maximum recognition 
rates could not be the better parameter to compare the 
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Fig. 79.8 Mean recognition rates for the compared edge 
detectors with ORL, Cropped Yale and FERET face 


Fig. 79.7 Training time for the compared edge detectors tested with databases 
the ORL, Cropped Yale and FERET face databases 
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Fig. 79.9 Maximum recognition rates for the compared 
edge detectors with ORL, Cropped Yale and FERET face 
database 


performance of the neural networks depending on the 
training set; but it is interesting to note that the max- 
imum recognition rate of 97.5% was achieved when 
the neural network was trained with the ORL data set 
preprocessed with the MG+FIS2. This is important be- 
cause in a real-world system, we can use this as the best 
configuration for images recognition, expecting to ob- 
tain good results. 
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79.5 Conclusions 


This chapter is one of the first efforts to develop a com- 
parison method for edge detectors as a function of their 
performance in different types of recognition systems. 
In this chapter, we show that Sobel and Morphologi- 
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80. Fuzzy Controllers for Autonomous Mobile Robots 


Patricia Melin, Oscar Castillo 


This chapter addresses the tracking problem for 
the dynamic model of a unicycle mobile robot. 
A novel optimization method inspired from the 
chemical reactions is applied to solve this motion 
problem by integrating a kinematic and a torque 
controller based on fuzzy logic theory. Computer 
simulations are presented confirming that this 
optimization paradigm is able to outperform other 
optimization techniques applied to this particular 
robot application. 
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80.1 Fuzzy Control of Mobile Robots 


Optimization is an activity carried out in almost ev- 
ery aspect of our life, from planning the best route of 
our way back home from work to more sophisticated 
approximations on the stock market, or the parameter 
optimization for a wave soldering process used in the 
manufacture of a printed circuit board assembly, op- 
timization theory has gained importance over the last 
decades. From science to applied engineering (to name 
a few), there is always something to optimize and, of 
course, more than one way to do it. 

In a generic definition, we may say that optimiza- 
tion aims to find the best available solution among 
a set of potential solutions in a defined search space. 
For almost every problem there exists a solution, not 
necessarily the best one, but we can always find an 
approximation to the ideal solution, and while in 
some cases or processes it is still common to use 
our own experience to qualify a process, a part of 
the research community has dedicated a considerable 
amount of time and effort to help find robust opti- 
mization methods for optima found in a vast range of 
applications. 
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That it is difficult to solve different problems by ap- 
plying the same methodology, and even the most robust 
optimization approaches may be outperformed by other 
optimization techniques, depending on the problem to 
be solved. 

When the complexity and the dimension of the 
search space make a problem unsolvable by a deter- 
ministic algorithm, probabilistic algorithms deal with 
this problem by going through a diverse set of possi- 
ble solutions or candidate solutions. Many metaheuris- 
tic algorithms can be considered probabilistic because 
they apply probability tools to solve a problem; meta- 
heuristic algorithms seek good solutions by mimicking 
natural processes or paradigms. Most of these novel op- 
timization paradigms that were inspired by nature were 
conceived by mere observation of an existing process 
and their main characteristics were embodied as com- 
putational algorithms. 

The importance of the optimization theory and its 
application has grown in the past few decades, from 
the well known genetic algorithm paradigm to parti- 
cle swarm optimization (PSO), ant colony optimization 
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(ACO), harmonic search, deoxyribonucleic acid (DNA) 
computing, among others, abd they were all were intro- 
duced with the expectation that they would improve the 
results obtained with existing strategies. 

There is no doubt that there could be some opti- 
mization strategies presented at some point that were 
left behind due their complexity and poor performance. 
Novel optimization paradigms should be able to per- 
form well in comparison with another optimization 
techniques and must be easily adaptable to different 
kinds of problems. 

Optimization based on chemical processes is 
a growing field that has been satisfactorily applied to 
several problems. In [80.1] a DNA-based algorithm 
was introduced to solve the small hitting set problem. 
A catalytic search algorithm was explored in [80.2], 
where some physical laws such as mass and energy 
conservation were taken into account. In [80.3], the 
potential roles of energy in algorithmic chemistries 
were illustrated. An energy framework was introduced, 
which keeps the molecules within reasonable length 
bounds, allowing the algorithm to behave thermo- 
dynamically and kinetically, similarly to real chem- 
istry. A chemical reaction optimization was applied to 
a grid scheduling problem in [80.4], where molecules 
interact with each other aiming to reach the mini- 
mum state of free potential and kinetic energies. The 
main difference between these metaheuristics is the 
parameter representation, which can be explicit or 
implicit. 

In this paper, we introduce an optimization method 
inspired by chemical reactions and its application for 
the optimization of the tracking controller of the dy- 
namic model of the unicycle mobile robot. 

The importance of applying this chemical opti- 
mization algorithm is that different methods have been 
applied to solve motion control problems. Kanayama 
etal. [80.5] propose a stable tracking control method 
for a nonholonomic vehicle using a Lyapunov function. 
Lee etal. [80.6] solved tracking control using back- 
stepping and in [80.7] saturation constraints were used. 
Furthermore, most reported designs rely on intelligent 
control approaches such as fuzzy logic control [80.8— 
13] and neural networks [80.14, 15]. 


However, the majority of the publications men- 
tioned above concentrated on kinematic models of 
mobile robots, which are controlled by the veloc- 
ity input, while less attention has been paid to the 
control problems of nonholonomic dynamic systems, 
where forces and torques are the true inputs. Bloch 
and Drakunov [80.16] and Chwa [80.17] used slid- 
ing mode control for the tracking control problem. 
Fierro and Lewis [80.18] proposed a dynamical exten- 
sion that makes the integration of kinematics and torque 
controller possible for a nonholonomic mobile robot. 
Fukao et al. [80.19] introduced an adaptive tracking 
controller for the dynamic model of mobile robots with 
unknown parameters using backstepping methodology, 
which has been recognized as a tool for solving several 
control problems [80.20, 21]. 

Motivated by this, herein a Mamdani fuzzy logic 
controller is introduced in order to drive the kinematic 
model to a desired trajectory in a finite time; consid- 
ering the torque as the real input, a chemical reaction 
optimization paradigm is applied and simulations are 
shown. 

Further publications [80.22—24] applied bio- 
inspired optimization techniques to find the parameters 
of the membership functions for the fuzzy tracking 
controller that solves the problem for the dynamic 
model of a unicycle mobile robot, using a fuzzy logic 
controller that provides the required torques to reach 
the desired velocity and trajectory inputs. 

In this paper, the main contribution is the represen- 
tation of the fuzzy controller in the chemical paradigm 
to search for the optimal parameters. Simulation results 
show that the proposed approach outperforms other 
nature-inspired computing paradigms, such as genetic 
algorithms, particle swarm, and ant colony optimiza- 
tion. 

The rest of this chapter is organized as follows. 
Section 80.2 illustrates the proposed methodology. Sec- 
tion 80.3 describes the problem formulation and con- 
trol objective. Section 80.4 describes the proposed 
fuzzy logic controller of the robot. Section 80.5 shows 
some experimental results of the tracking controller. 
In Sect. 80.6 some conclusions and future work are 
presented. 


80.2 The Chemical Optimization Paradigm 


The proposed chemical reaction algorithm is a meta- 
heuristic strategy that performs a stochastic search for 
optimal solutions within a defined search space. In this 


optimization strategy, every solution is represented as 
an element (or compound), and the fitness or perfor- 
mance of the element is evaluated in accordance with 
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Fig. 80.1 General flowchart of the 
chemical reaction algorithm 
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the objective function. The general flowchart of the al- 
gorithm is shown in Fig. 80.1. 

The main difference with other optimization tech- 
niques [80.1—4] is that no external parameters are 
taken into account to evaluate the results, while 
other algorithms introduce additional parameters (ki- 
netic/potential energies, mass conservation, thermody- 
namic characteristics, etc.). This is a very straight- 
forward methodology that takes the characteristics of 
the chemical reactions (synthesis, decomposition, sub- 
stitution, and double-substitution) to find the optimal 
solution. 

This approach is a static population-based meta- 
heuristic that applies an abstraction of the chemical re- 
actions as intensifiers (substitution, double substitution 
reactions) and diversifying (synthesis, decomposition 
reactions) mechanisms. The elitist reinsertion strategy 
allows the permanence of the best elements, and thus 
the average fitness of the entire element pool increases 
with every iteration. The algorithm may trigger only 
one reaction or all of them, depending on the nature of 
the problem to solve. For example, we may use only the 
decomposition reaction subroutine to find the minimum 
value of a mathematical function. 

The pseudocode for the chemical reaction algorithm 
is as follows: 


Algorithm 80.1 Chemical_Reaction_Algorithm 
Input: problem_definition, objective_function, di- 
mensions, 

1: Assign values to variables: pool_size, trials, 
upper_boundary, lower_boundary, synthesis_rate, 
decomposition_rate, singlesubstitution_rate, dou- 
blesubstitution_rate. 

2: Generate randomly Initial_Pool in interval [lower_ 
boundary, upper_boundary| 

3: Evaluate Jnitial_Pool 


4: Identify best_solution 

5: while ( stopping criteria not met ) do 

6: Perform Synthesis_Procedure; Get Synthe- 
sis_vector 

7: Perform Decomposition_Procedure,; Get De- 
composition_vector 

8: Perform SingleSubstitution_Procedure; Get Sin- 
gleSubstitution_vector 

9: Perform DoubleSubstitution_Procedure; Get 
DoubleSubstitution_vector 

10: Evaluate Synthesis_vector, _Decomposition_ 
vector, SingleSubstitution_vector, DoubleSub- 
stitution_vector 

11: Apply elitist_reinsertion; Get improved_pool 

12: Update best_solution 

13: end while 

Output: best_solution 


All nature-inspired paradigms have their own way 
to encode candidate solutions. When these parameters 
are defined, a set of processes or procedures are applied 
to lead the population to an optimal result. The main 
components of this chemical reaction algorithm are de- 
scribed below. 


80.2.1 Elements/Compounds 


These are the basic components of the algorithm. Each 
element or compound represents a solution within the 
search space. The initial definition of elements and/or 
compounds depends on the problem itself and can be 
represented as binary numbers, integer, floating, etc. 
They interact with each other implicitly; that is, the 
definition of the interaction is independent of the real 
molecular structure. In this approach the potential and 
kinetic energies and other molecular characteristics are 
not taken into account. 
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80.2.2 Chemical Reactions 


A chemical reaction is a process in which at least 
one substance changes its composition and its sets of 
properties. In this approach, the chemical reactions be- 
have as intensifiers (substitution, double substitution 
reactions) and diversifying (synthesis, decomposition 
reactions) mechanisms. The four chemical reactions 
considered in this approach are the synthesis, decom- 
position, single and double-substitution reactions. The 
objective of these operators is to explore or exploit new 
possible solutions within a slightly larger hypercube 
than the original elements/compounds, but within the 
previously specified range. 

The synthesis and decomposition reactions are used 
to diversify the resulting solutions; these procedures 
were shown to be highly effective and to rapidly lead 
the results to a desired value. They can be described as 
follows. 


80.2.3 Synthesis Reactions 


This is a reaction of two reactants to produce one 
product. By combining two (or more) elements, this 
procedure allows us to explore higher-valued solutions 
within the search space. The result can be described as 
a compound (B+ C — BC). The pseudocode for the 
synthesis reaction procedure is as follows: 


Algorithm 80.2 Synthesis_Procedure 
Input: selected_elements, synthesis_rate 
1: n= size ( selected_elements ) 
2: i= floor (n/2 ) 
3: forj = 1toi-1 
4: Synthesis = selected_elements; 
+ selected_elements;+ 
> J=]F2 
6: end for 
Output: Synthesis_vector 


80.2.4 Decomposition Reactions 


In this reaction, typically, only one reactant is given, 
which allows a compound to be decomposed into 
smaller instances (BC —> B + C). The pseudocode for 
the decomposition reaction procedure is as follows: 


Algorithm 80.3 Decomposition_Procedure 
Input: selected_elements, decomposition_rate 
1: n= size ( selected_elements ) 


Table 80.1 Main elements of several nature-inspired 
paradigms 


Paradigm Parameter Basic operations 
representation 
GA Genes Crossover, mutation 
ACO Ants Pheromone 
PSO Particles Cognitive, social coefficients 
GP Trees Crossover, mutation 
(In some cases) 
CRM Elements, Reactions (combination, 
Compounds decomposition, 
Substitution, double- 
substitution) 


: Get randval randomly in interval [ 0, 1 ] 

: fori= l ton 
Deco, = selected _elements; x randval 
Deco = selected_elements; x ( 1 — randval) 
i=i+l 

end for 

Output: Decomposition_vector ( Deco, Decoz) 


So Oy Gv kD 


The single and double-substitution reactions allow 
the algorithm to search for optima around a previously 
found good solution and they are described below. 


80.2.5 Single-Substitution Reactions 


When a free element reacts with a compound of dif- 
ferent elements, the free element will replace one of 
the elements in the compound if the free element is 
more reactive than the element it replaces. A new com- 
pound and a new free element are produced. In the 
algorithm, a compound and an element are selected and 
a decomposition reaction is applied to the compound; 
two elements are generated from this operation. Then, 
one of the new generated elements is combined with 
the non-decomposed selected element (C+ AB — AC 
+B). The pseudocode for the single-substitution reac- 
tion procedure is as follows: 


Algorithm 80.4 SingleSubstitution_Procedure 
Input: selected_elements, singlesubstitution_rate 

: n= size ( selected_elements ) 

2: i= floor (n/2 ) 

3: a= _ selected_elements,, selected_elements>, ..., 
selected_elements; 

4: b= selected_elements;+,, selected_elements;+2, 
..., selected_elements} x2 

5: Apply Decomposition_Procedure to a; Get Decoy, 
Decoz 


= 
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6: Apply Synthesis_Procedure ( b+ Deco), Get Syn- 
thesis_vector Output: SingleSubstitution_vector 
( Synthesis_vector, Decon ) 


80.2.6 Double-Substitution Reactions 


Double-substitution or double-replacement reactions, 
also called double-decomposition reactions or metathe- 
sis reactions, involve two ionic compounds, most often 
in aqueous solution. In this type of reaction, the cations 
simply swap anions; in the algorithm, a similar process 
to that in the previous reaction happens. The difference 
is that in this reaction both of the selected compounds 
are decomposed, and the resulting elements are com- 
bined with each other (AB + CD —> CB + AD). The 
pseudocode for the double-substitution reaction proce- 
dure is as follows: 


Algorithm 80.5 DoubleSubstitution_Procedure 
Input: selected_elements, doublesubstitution_rate 
1: n = size ( selected_elements ) 
2: i= floor (n/2 ) 
3: a= _ selected_elements,, selected_elements>, ..., 
selected_elements; 


80.3 The Mobile Robot 


Mobile robots are non-nonholonomic systems due to 
the constraints imposed on their kinematics. The equa- 
tions describing the constraints cannot be integrated 
symbolically to obtain explicit relationships between 
robot positions in local and global coordinates’ frames. 
Hence, control problems that involve them have at- 
tracted attention in the control community in recent 
years [80.25]. 


S ____i _.| 
x X 


Fig. 80.2 Diagram of a wheeled mobile robot 


4: b= selected_elements;+ı, selected_elementsi+2, 
..., Selected _elementSix2 

5: Apply Decomposition_Procedure to a and b; Get 
(Deco,, Decoz), (Deco, Deco?) 

6: Apply Synthesis_Procedure (Decoy + Deco}), 
(Decon + Deco); Get Synthesis_vector,, Synthe- 
sis_vector 
Output: SingleSubstitution_vector 
( Synthesis_vector,, Synthesis_vector’, ) 


In this chemical reaction algorithm we may trigger 
only one reaction or all of them, depending on the na- 
ture of the problem to be solved, e.g., we can apply only 
the decomposition reaction subroutine to find the mini- 
mum value of a mathematical function. 

Throughout the execution of the algorithm, when- 
ever a new set of elements/compounds is created, an 
elitist reinsertion criterion is applied, allowing the per- 
manence of the best elements, and thus the average 
fitness of the entire element pool increases through iter- 
ations. 

In order to have a better picture of the general 
schema for this proposed chemical reaction algorithm, 
a comparison with other nature-inspired paradigms is 
shown in Table 80.1. 


The model considered is that of a unicycle mobile 
robot (see Fig. 80.2) that has two driving wheels fixed 
to the axis and one passive orientable wheel placed in 
front of the axis and normal to it [80.26]. 

The two fixed wheels are controlled independently 
by the motors, and the passive wheel prevents the robot 
from overturning when moving on a plane. 

It is assumed that the motion of the passive wheel 
can be ignored from the dynamics of the mobile robot, 
which is represented by the following set of equa- 
tions [80.18] 


cos 0 ‘ 
qg=| sind 0 M(q)v+ V(q, g)v + G(q) 
0 1 


(80.1) 


where q = [x, y, O]! is the vector of generalized coordi- 
nates that describes the robot’s position, (x, y) are the 
Cartesian coordinates, which denote the mobile center 
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Fig. 80.3 Tracking control structure 


of mass, and 0 is the angle between the heading direc- 
tion and the x-axis (which is taken in counterclockwise 
form); v = [v, w]! is the vector of velocities, v and w are 
linear and angular velocities respectively t € R” is the 
input vector, M(q) € R”” is a symmetric and positive- 
definite inertia matrix, V(g, g) € R”” is the centripetal 
and Coriolis matrix, and G(q) € R” is the gravitational 
vector. Equation (80.1) represents the kinematics or 
steering system of a mobile robot. 

Notice the no-slip condition imposed a non- 
nonholonomic constraint described by (80.2), which 
means that the mobile robot can only move in the di- 
rection normal to the axis of the driving wheels. 


ycosO—xsind =0. (80.2) 


The control objective will be established as follows: 
given a desired trajectory qa(f) and the orientation of 
the mobile robot, we must design a controller that 
applies an adequate torque t such that the measured 
positions q(t) achieve the desired reference qa(t) rep- 
resented as (80.3) 


lim ||qa(t)—q@|| = 0. (80.3) 
t—> co 
To reach the control objective, the method is based 
on the procedure of [80.18], and we derive t(t) of 
a specific v,(t) that controls the steering system (80.1) 
using a fuzzy logic controller (FLC). The general 
structure of a tracking control system is presented in 
Fig. 80.3. 

The control is based on the procedure proposed by 
Kanayama et al. [80.5] and Nelson and Cox [80.27] 


80.4 Fuzzy Logic Controller 


The purpose of the fuzzy logic controller (FLC) is to 
find a control input t such that the current velocity vec- 
tor v is able to reach the velocity vector ve, and this is 


to solve the tracking problem for the kinematic 
model v, (t). Suppose that the desired trajectory qq sat- 
isfies (80.4) 


cosa 0 
Ga=| sina 0 va (80.4) 
o 41 |!" 


Using the robot local frame (the moving coordinate 
system x-y in Fig. 80.1), the error coordinates can be 
defined as (80.5) 


ey 
e=T.(qa—4), ey 
eg 
cos sinô 0 Xa — x 
=| —sinð cosé 0 ya— y (80.5) 
0 0 1 0a— 0 


Moreover, the auxiliary velocity control input that 
achieves tracking for (80.1) is given by (80.6) 


Ve =fele, va), 


Ve 
We 
va + coseg + ky ey 


; ; 80.6 
wa + Vak2ey + vaks sin eg ( ) 


where kı, k2 and kz are positive gain constants. 

The first part for this work is to apply the proposed 
method to obtain the values of k; (i = 1, 2,3) to achieve 
the optimal behavior of the controller, and the second 
part is to optimize the fuzzy controller. 


denoted as 


lim ||v.—v|| =0. (80.7) 
t>>co 
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Fig. 80.4a,b Membership functions of (a) input e, and ew, 
and (b) output variables F and N 


The input variables of the FLC correspond to the ve- 
locity errors obtained in (80.5) using the derivatives of 


80.5 Experimental Results 


Several tests of the chemical optimization paradigm 
were made to test the performance of the tracking con- 
troller. First, we need to find the values of k; (i = 1, 2, 
3) shown in (80.6), which will guarantee convergence 
of the error e to zero. 

To evaluate the constants obtained by the algo- 
rithm, the mobile robot tracking system, which consists 
in (80.5) and (80.6), was modeled using Simulink. 
Figure 80.5 shows the closed loop for the tracking con- 
troller. 

The conditions to evaluate each result, which 
correspond to the final position error, are given 
by (80.8): 


(80.8) 


TE 5 ex) +e ++ eo) | 


n 
i=1 


Table 80.2 Fuzzy rule set 


ey ey N Z P 

N N/N N/Z N/P 
Z Z/N ZA Z/P 
B P/N P/Z P/P 


the position and angular errors (denoted as e, and é,,). 
The initial membership functions (MF) are defined by 
one triangular and two trapezoidal functions for each 
variable involved. Figure 80.4 depicts the MFs in which 
N, Z, P represent the fuzzy sets (negative, zero, and pos- 
itive, respectively) associated to each input and output 
variable. 

The rule set of the FLC contains nine rules, 
which govern the input-output relationship of the 
FLC, and this adopts the Mamdani-style inference 
engine. We use the center of gravity method to re- 
alize the defuzzification procedure. In Table 80.2 we 
present the rule set whose format is established as fol- 
lows: 


Rule i: If e, is G1 and e,, is G2 
then F is G3 and N is G4, 


where G1 G4 are the fuzzy sets associated 
to each variable i=1...9. In this case, P de- 
notes positive, N denotes negative, and Z denotes 
zero. 


For the first set of experiments only the decomposi- 
tion reaction mechanism was triggered and the decom- 
position factor was varied; this factor is the quantity 
of resulting elements after applying a decomposition 
reaction to a determined compound. The only restric- 
tion here is that x be the selected compound and x; (i = 
1 2,...,7) the resulting elements. The sum of all values 
found in the decomposition must be equal to the value 
of the original compound. This is shown in (80.9) 


yes. (80.9) 


Each experiment was executed 35 times and the test 
parameters for each set of experiments can be observed 
in Table 80.3. 


1523 


S°08 | D Hed 


1524 Part G | Hybrid Systems 


fu) 


Ideal linear velocity 


Position error 


fw 


Ideal angular velocity 


Tracking error system 


Error display 


total test 


To workspace 


Desired linear 
velocity (va) 


Desired angular 
velocity (va) 


Fig. 80.5 Closed loop for the tracking controller system 


The decomposition rate (Dec. rate) represents the The selection strategy applied was stochastic uni- 
percentage of the pool that are candidates for the de- versal sampling, which uses a single random value to 
composition and the decomposition factor (Dec. factor) 
is the number of elements that are to be decomposed Table 80.3 Parameters of the chemical reaction optimiza- 


into. tion 
No. Elements Iterations Dec. factor Dec. rate 
Positi ; 1 2 10 a 0.3 
osition error 1n x 

y 0.34 2 5 10 3 0.3 
a | 02 ae 10 2 0.4 
(=) 0.1 4 2 10 3 0.4 
ae 0 5 5 10 2 0.4 
fo) > 
o 0 0.5 1 1.5 2 2.5 3 35 6 5 10 3 0.4 
wa Position error in y T 5 10 2 0.5 

0.3 4 8 10 10 2 0.5 

0.2 

0.1 Table 80.4 Experimental results of the proposed method 

0 for optimizing the values of the gains kı, k2, and k3 


> 


22 ; K i z . 2 i oe No. Best error Mean ky k2 k3 

ie Rosier emor ino 1 0.0086 1.1568 S100) een is 
2 4.79x10-% 0.1291 AOS | Bil | Sil 

0 3 0.0025 0.5809 36 ©6328 |E 
ne 4 0.0012 0.5589 2 |2 |0 
0 05 1 15 7 25 3 35 z 3 0.0035 0.0480 185 29 3 

6 EEO 0.0299 m | SB | 1S 

Fig. 80.6 Final position errors in x, y, and 6 for experiment num- 7 0.0066 0.1440 29 15 0 
ber 6 8 0.0019 0.1625 51 3 0 
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Table 80.5 Comparison of the best results 


Parameters Genetic Chemical optimization 
algorithm algorithm 
Individuals 5 2 
Iterations 5} 10 
Crossover rate 0.8 N/A 
Mutation rate 0.1 N/A 
Synthesis rate N/A 0.2 
Decomposition N/A 0.8 
rate 
Substitution rate N/A 0.6 
Double N/A 0.6 
substitution rate 
kı, k2, k3 43, 493, 195 36, 328, 88 
Final error 0.006734 0.0025 


Table 80.6 Parameters of the simulations for Type-1 FLC 


Parameters Value 

Elements 10 

Trials 15 

Selection method Stochastic universal sampling 
ky 117 

ky 226 

k3 IB7 

Error 0.077178 


Table 80.7 Parameters of the simulations for Type-2 FLC 


Parameters Value 

Elements 10 

Trials 10 

Selection method Stochastic universal sampling 
kı 117 

ky 226 

k3 137 

Error 2.7736 


sample all of the solutions by choosing them at evenly 
spaced intervals. In the example, for a pool containing 
five initial compounds, the vector length of decomposed 
elements when the decomposition factor is 3 and the de- 
composition rate is 0.4 will be six elements. 

By applying this criterion, the initial pool of ele- 
ments increased with every iteration. This is why the 
initial element pool was set to ten elements as the max- 
imum. Table 80.4 shows the results after applying the 
chemical optimization paradigm. 

As can be observed in Table 80.4, experiment num- 
ber 6 seems to have the best result because it reached the 
smaller final error among all experiments. Figure 80.6 
shows the final position errors in x, y, and 0 for experi- 
ment number 6. 
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Fig. 80.7 Final position errors in x, y, and 0 for experiment num- 
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Fig. 80.8 Position errors in x, y, and 0 of best result by apply- 


ing GAs 


By analyzing the graphical results of several sets of 
exercises, we noticed that the control obtained for some 
of them was smoother despite the average error value. 
This was the case for experiment number 3, in which 
the final error value was significantly higher than that 
obtained in experiment number 6. Figure 80.7 shows 
the final position errors in x, y, and @ for experiment 
number 3. 

Comparing both graphics, we can observe that the 
average error obtained for 0 is 0.0338 for experiment 
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number 6 and 0.0315 for experiment number 3. This 
smoother control of the tracking system could make 
a big difference in the complete dynamic system of the 
mobile robot. 

In previous work [80.28], the gain constant values 
were found by means of genetic algorithms. Table 80.5 
shows a comparison of the best results obtained with 
both algorithms, and we can observe that the result with 
the chemical optimization outperforms the GA in find- 
ing the best gain values. 

Figure 80.8 shows the result in Simulink for the 
experiment with the best overall result when applying 
GAs as the optimization method. 

Once we have found optimal values for the gain 
constants, the next step is to find the optimal val- 
ues for the input/output membership functions of the 
fuzzy controller. Our goal is that the lineal and angu- 
lar velocities reach zero in the simulations. Table 80.6 
shows the parameters of the simulations for Type-1 
FLC. 

Figure 80.9 shows the behavior of the chemical op- 
timization algorithm throughout the experiment. 

Figure 80.10 shows the resulted input and output 
membership functions found by the proposed optimiza- 
tion algorithm. 

Figure 80.11 shows the trajectory obtained when 
simulating the mobile control system including the ob- 
tained input and output membership functions. 
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Fig. 80.9 Best simulation of experiments with the chemi- 
cal optimization method 


Fig. 80.10a-d Resulting input membership functions: 
(a) linear and (b) angular velocities and output (c) right 
and (d) left torque > 


Figure 80.12 shows the best trajectory reached by 
the mobile robot when optimizing the input and output 
membership functions using genetic algorithms. 

A Type-2 FLC was developed using the param- 
eters of the membership functions found for Type-1 
FLC. The parameters searched with the chemical re- 
action algorithm were for the footprint of uncertainty 
(FOU). 

Table 80.7 shows the parameters used in the simula- 
tions and Fig. 80.13 shows the behavior of the chemical 
optimization algorithm throughout the experiment. 
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80.5 Experimental Results 
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Fig. 80.11 Trajectory obtained when applying the chemi- 
cal reaction algorithm 
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Fig. 80.12 Trajectory obtained using GAs 


Figure 80.14 shows the resulting Type-2 input and 
output membership functions found by the proposed 
optimization algorithm and Fig. 80.15 shows the ob- 
tained trajectory reached by the mobile robot. 

As observed in Table 80.7, the final error obtained 
is not smaller that the final error found for the Type-1 
FLC. Despite this, the trajectory obtained, which is 
shown in Fig. 80.15, is acceptable, taking into ac- 
count that the reference trajectory is a straight line. In 
Fig. 80.16 we can observe an unacceptable trajectory, 
which was found in the early attempts of optimization 
for the Type-1 FLC applying this chemical reaction 
algorithm. Here, we can observe that the parameters 
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Fig. 80.13 Behavior of the algorithm when optimizing the 
Type-2 FLC 


found were not adequate to make the FLC follow the 
desired trajectory. 

In order to test the robustness of the Type-1 
and Type-2 FLC, we added an external signal given 
by (80.10). 


Falt) =e xX sinw xt. (80.10) 


This represents an external force applied in a period 
of 10 s to the trajectory obtained that will make the mo- 
bile robot move out of its path. The idea of adding this 
disturbance is to measure the errors obtained with the 
FLC and to test the behavior of the mobile robot under 
perturbed torques. Table 80.8 shows the parameters for 
the simulations and the errors obtained during the run 
of the simulation. 

Figure 80.17 show the trajectories obtained for the 
Type-1 FLC optimized with GAs. 

Figure 80.18 shows the trajectories obtained for the 
Type-1 FLC optimized with the chemical reaction algo- 
rithm. 

Figure 80.19 shows the trajectories obtained for the 
Type-2 FLC optimized with the CRA method. 

In Table 80.8 and Figs. 80.17 to 80.19 we can 
observe that the Type-2 FLC was able to maintain 
a more controlled trajectory in despite of the large error 
found by the algorithm (e = 2.7736). For larger ep- 
silon (£) values, it was difficult for the Type-1 FLCs 
to keep in the path, and in a determined time the 
controller was not able to return to the reference tra- 
jectory. 
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Fig. 80.14a-d Resulting 
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Fig. 80.15 Trajectory obtained for the mobile robot when 
applying the chemical reaction algorithm to the Type-2 
FLC 


Fig. 80.16 Unacceptable trajectory resulting in early opti- 
mization trials 
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Table 80.8 Simulation parameters and errors obtained under disturbed torques 


e Velocity 
errors 
0.05 Final error 
Average error 
5 Final error 
Average error 
10 Final error 
Average error 
30 Final error 
Average error 
32 Final error 
Average error 
34 Final error 
Average error 
40 Final error 
Average error 
41 Final error 
Average error 
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Fig. 80.17a-c From left to right: trajectory obtained with the Type-1 FLC optimized with GAs. (a) ¢ = 30, (b) e = 32, 


(c) £ = 34 
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Fig. 80.18a-c From left to right: trajectory obtained with the Type-1 FLC optimized with CRA. (a) £ = 30, (b) e = 32, 


(c) £ = 34 
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Fig. 80.19a-c From left to right: trajectory obtained with the Type-2 FLC optimized with CRA. (a) ¢ = 30, (b) ¢ = 32, 


(c) £ = 34 


80.6 Conclusions 


In this paper, we presented simulation results from an 
optimization method that mimics chemical reactions 
applied to the problem of tracking control. The goal 
was to find the gain constants involved in the tracking 
controller for the dynamic model of a unicycle mo- 
bile robot. In the figures of the experiments we were 
able to note the behavior of the algorithm and the solu- 
tions found through all the iterations. Simulation results 
show that the proposed optimization method is able to 


References 


outperform the results previously obtained by apply- 
ing a genetic algorithm optimization technique. The 
optimal fuzzy logic controller obtained with the pro- 
posed chemical paradigm is able to reach smaller error 
values in less time than genetic algorithms. Also, the 
Type-2 fuzzy controller was able to perform better in 
the presence of disturbance in this problem despite the 
large error obtained (e = 2.7736). The design of opti- 
mal Type-2 fuzzy controllers is performed at the time. 


80.1 N.-Y. Shi, C.-P. Chu: A molecular solution to the 
hitting-set problem in DNA-based supercomput- 
ing, Inf. Sci. 180, 1010-1019 (2010) 

80.2 L. Yamamoto: Evaluation of a catalytic search algo- 
rithm, Proc. 4th Int. Workshop Nat. Inspired Coop. 
Strateg. Optim., NICSO 2010 (2010) pp. 75-87 

80.3 T. Meyer, L. Yamamoto, W. Banzhaf, C. Tschudin: 
Elongation control in an algorithmic chemistry, 
Lect. Notes Comput. Sci. 5777, 273-280 (2010) 

80.4 J. Xu, AY.S. Lam, V.0.K. Li: Chemical reaction op- 
timization for the grid scheduling problem, IEE 
Commun. Soc., ICC 2010 (2010) pp. 1-5 

80.5 Y. Kanayama, Y. Kimura, F. Miyazaki, T. Noguchi: 
A stable tracking control method for a non- 
holonomic mobile robot, Proc. IEEE/RSJ Int. Work- 
shop Intell. Robot. Syst., Osaka (1991) pp. 1236-1241 

80.6 T.-C. Lee, C.H. Lee, C.-C. Teng: Tracking control of 
mobile robots using the backsteeping technique, 
Proc. 5th Int. Conf. Contr. Automat. Robot. Vis., Sin- 
gapore (1998) pp. 1715-1719 

80.7 T.-C. Lee, K. Tai: Tracking control of unicycle- 
modeled mobile robots using a saturation feedback 


controller, IEEE Trans. Control Syst. Techn. 9(2), 305- 
318 (2001) 

80.8 S. Bentalba, A. El Hajjaji, A. Rachid: Fuzzy control 
of a mobile robot: A new approach, IEEE Int. Conf. 
Control Appl., Hartford (1997) pp. 69-72 

80.9 S. Ishikawa: A method of indoor mobile robot nav- 
igation by fuzzy control, Proc. Int. Conf. Intell. 
Robot. Syst., Osaka (1991) pp. 1013-1018 

80.10 T.H. Lee, F.H.F. Leung, P.K.S. Tam: Position control 
for wheeled mobile robot using a fuzzy controller, 
25th Annu. Conf. IEEE, San Jose (1999) pp. 525-528 

80.11 S. Pawlowski, P. Dutkiewicz, K. Kozlowski, W. Wrob- 
lewski: Fuzzy logic implementation in mobile robot 
control, 2nd Workshop Robot Motion Control (2001) 
pp. 65-70 

80.12 (C.-C. Tsai, H.-H. Lin, C.-C. Lin: Trajectory tracking 
control of a laser-guided wheeled mobile robot, 
Proc. IEEE Int. Conf. Control Appl., Taipei (2004) 
pp. 1055-1059 

80.13 S.V. Ulyanov, S. Watanabe, V.S. Ulyanov, K. Yama- 
fuji, L.V. Litvintseva, G.G. Rizzotto: Soft computing 
for the intelligent robust control of a robotic uni- 


Fuzzy Controllers for Autonomous Mobile Robots | References 1531 


80.14 


80.15 


80.16 


80.17 


80.18 


80.19 


80.20 


80.21 


cycle with a new physical measure for mechanical 
controllability, Soft Comput. 2, 73-88 (1998) 

R. Fierro, F.L. Lewis: Control of a nonholonomic 
mobile robot using neural networks, IEEE Trans. 
Neural Netw. 9(4), 589-600 (1998) 

K.T. Song, L.H. Sheen: Heuristic fuzzy-neural Net- 
work and its application to reactive navigation of 
a mobile robot, Fuzzy Sets Syst. 110(3), 331-340 
(2000) 

A.M. Bloch, S. Drakunov: Tracking in non- 
holonomic dynamic system via sliding modes, 
Proc. IEEE Conf. Decis. Control, Brighton (1991) 
pp. 1127-1132 

D. Chwa: Sliding-mode tracking control of non- 
holonomic wheeled mobile robots in polar coordi- 
nates, IEEE Trans. Control Syst. Tech. 12(4), 633-644 
(2004) 

R. Fierro, F.L. Lewis: Control of a nonholonomic 
mobile robot: Backstepping kinematics into dy- 
namics, Proc. 34th Conf. Decis. Control, New Orleans 
(1995) pp. 3805-3810 

T. Fukao, H. Nakagawa, N. Adachi: Adaptive track- 
ing control of a non-holonomic mobile robot, IEEE 
Trans. Robot. Autom. 16(5), 609-615 (2000) 

A.R. Sahab, M.R. Moddabernia: Backstepping 
method for a single-link flexible-joint manipula- 
tor using genetic algorithm, WICIC 7(7B), 4161-4170 
(2011) 

J. Yu, Y. Ma, B. Chen, H. Yu, S. Pan: Adap- 
tive neural position tracking control for induction 


80.22 


80.23 


80.24 


80.25 


80.26 


80.27 


80.28 


motors via backstepping, WICIC 7(7B), 4503-4516 
(2011) 

L. Astudillo, 0. Castillo, L. Aguilar: Intelligent con- 
trol for a perturbed autonomous wheeled mobile 
robot: A type-2 fuzzy logic approach, Nonlinear 
Stud. 14(1), 37-48 (2007) 

R. Martinez, 0. Castillo, L. Aguilar: Optimization 
of type-2 fuzzy logic controllers for a perturbed 
autonomous wheeled mobile robot using genetic 
algorithms, Inf. Sci. 179(13), 2158-2174 (2009) 

0. Castillo, R. Martinez-Marroquin, P. Melin, J. So- 
ria: Comparative study of bio-inspired algorithms 
applied to the optimization of type-1 and type-2 
fuzzy controllers for an autonomous mobile robot, 
Stud. Comput. Intell. 256, 247-262 (2009) 

I. Kolmanovsky, N.H. McClamroch: Developments in 
nonholonomic nontrol problems, IEEE Control Syst. 
Mag. 15, 20-36 (1995) 

G. Campion, G. Bastin, B. D'Andrea-Novel: Struc- 
tural properties and classification of kinematic and 
dynamic models of wheeled mobile robots, IEEE 
Trans. Robot. Autom. 12(1), 47-62 (1996) 

W. Nelson, I. Cox: Local path control for an au- 
tonomous vehicle, Proc. IEEE Conf. Robotics Autom. 
(1988) pp. 1504-1510 

S. Oh, H. Jang, W. Pedrycz: A comparative ex- 
perimental study of type-1/type-2 fuzzy cascade 
controller based on genetic algorithms and parti- 
cle swarm optimization, Expert Syst. Appl. 38(9), 
11217-11229 (2011) 


08 | D Hed 


81. Bio-Inspired Optimization Methods 


Fevrier Valdez 


Although graphic processing units (GPUs) have 
been traditionally used only for computer graphics, 
a recent technique called general-purpose com- 
puting on graphics processing units allows GPUs to 
perform numerical computations usually handled 
by the CPU (central processing unit). The advantage 
of using GPUs for general purpose computation is 
the performance speedup that can be achieved 
due to the parallel architecture of these devices. 
This chapter describes the use of bio-inspired opti- 
mization methods as particle swarm optimization 
and genetic algorithms on GPUs to demonstrate 
the performance that can be achieved using this 
technology, primarily with regard to using CPUs. 


81.1 Bio-Inspired Methods 


In this chapter we describe the optimization of a set of 
mathematical functions using bio-inspired algorithms. 
We use genetic algorithms (GAs) and particle swarm 
optimization (PSO), simulated annealing (SA), and pat- 
tern search (PS) to optimize the functions. The main 
idea is to compare these metaheuristic methods using 
the CPU and GPUs. Nowadays several approaches have 
been taken to optimize mathematical functions, see, for 
example, [81.1—6]. Our approach, however, differs from 
these approaches because we make a comparison be- 
tween the advantage of executing the methods on CPUs 
and GPUs with the aim of achieving the results quickly. 
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The main contribution of this work is the proposed 
approach for the implementation of bio-inspired opti- 
mization techniques on GPUs for optimization appli- 
cations. The approach is illustrated with mathematical 
function optimization, but could be applicable to other 
problems. 

The introduction to the proposed method, is fol- 
lowed by a description of bio-inspired methods in 
Sect. 81.2. In Sect. 81.3, a brief history of GPUs 
is presented, in Sect. 81.4 the experimental results 
are shown, and in Sect. 81.5 the conclusions are 
presented. 


81.2 Bio-Inspired Optimization Methods 


To compare the performance on a CPU or a GPU, it is 
necessary evaluate the methods with optimization prob- 
lems. Some basic concepts of bio-inspired optimization 
are needed to understand the differences in the corre- 


sponding algorithms. Therefore, in this section we offer 
a brief description about the bio-inspired optimization 
methods used in this work. The methods used are de- 
scribed in the following sections. 
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81.2.1 Genetic Algorithms 


Holland, from the University of Michigan initiated his 
work on genetic algorithms at the beginning of the 
1960s. His first achievement was the publication of 
Adaptation in Natural and Artificial Systems [81.7] in 
1975. 

He had two goals in mind: to improve the under- 
standing of the natural adaptation process and to design 
artificial systems having properties similar to natural 
systems [81.8]. 

The basic idea is as follows: the genetic pool of 
a given population potentially contains the solution, or 
a better solution, to a given adaptive problem. This so- 
lution is not active because the genetic combination on 
which it relies is split between several subjects. Only 
the association of different genomes can lead to the so- 
lution. 

Holland’s method is especially effective because it 
not only considers the role of mutation, but it also uses 
genetic recombination (crossover) [81.9]. The essence 
of the GA in both theoretical and practical domains has 
been well demonstrated [81.1, 10]. The concept of ap- 
plying a GA to solve engineering problems is feasible 
and sound. However, despite the distinct advantages of 
a GA for solving complicated, constrained, and mul- 
tiobjective functions where other techniques may have 
failed and the full power of the GA application is yet to 
be exploited [81.11, 12]. 


81.2.2 Particle Swarm Optimization 


Particle swarm optimization (PSO) is a population- 
based stochastic optimization technique that was devel- 
oped by Eberhart and Kennedy in 1995, inspired by the 
social behavior of bird flocking or fish schooling [81.3]. 

PSO shares many similarities with evolutionary 
computation techniques such as GAs [81.13]. The sys- 
tem is initialized with a population of random solu- 
tions and searches for optima by updating generations. 
However, unlike the GA, the PSO has no evolution 
operators such as crossover and mutation. In PSO, 
the potential solutions, called particles, fly through the 
problem space by following the current optimum parti- 
cles [81.14]. 

Each particle keeps track of its coordinates in the 
problem space, which are associated with the best so- 
lution (fitness) it has achieved so far (the fitness value 
is also stored). This value is called pbest. Another best 
value that is tracked by the particle swarm optimizer 
is the best value obtained so far by any particle in the 


neighbors of the particle. This location is called /best. 
When a particle takes all the population as its topo- 
logical neighbors, the best value is a global best and 
is called gbest. 

The particle swarm optimization concept consists 
of, at each time step, changing the velocity of (acceler- 
ating) each particle toward its pbest and [best locations 
(the local version of PSO). Acceleration is weighted by 
a random term, with separate random numbers being 
generated for acceleration toward pbest and Ibest loca- 
tions. 

In the past several years, PSO has been successfully 
applied in many research and application areas. It is 
demonstrated that PSO obtains better results in a faster 
and cheaper way compared with other methods [81.15]. 


81.2.3 Simulated Annealing 


SA is a generic probabilistic metaheuristic for the 
global optimization problem of applied mathematics, 
namely locating a good approximation to the global op- 
timum of a given function in a large search space. It is 
often used when the search space is discrete (e.g., all 
tours that visit a given set of cities). For certain prob- 
lems, simulated annealing may be more effective than 
exhaustive enumeration provided that the goal is merely 
to find an acceptably good solution in a fixed amount of 
time, rather than the best possible solution. 

The name and inspiration come from annealing 
in metallurgy, a technique involving heating and con- 
trolled cooling of a material to increase the size of its 
crystals and reduce their defects. The heat causes the 
atoms to become unstuck from their initial positions 
(a local minimum of the internal energy) and wan- 
der randomly through states of higher energy; the slow 
cooling gives them more chances of finding configu- 
rations with lower internal energy than the initial one. 
By analogy with this physical process, each step of the 
SA algorithm replaces the current solution by a random 
nearby solution, chosen with a probability that depends 
both on the difference between the corresponding func- 
tion values and also on a global parameter T (called the 
temperature), which is gradually decreased during the 
process. The dependency is such that the current so- 
lution changes almost randomly when T is large, but 
increasingly downhill as T goes to zero [81.16]. 


81.2.4 Pattern Search 


Pattern search is a family of numerical optimiza- 
tion methods that do not require the gradient of 
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the problem to be optimized, and PS can hence 
be used on functions that are not continuous or 
differentiable. Such optimization methods are also 
known as direct-search, derivative-free, or black-box 
methods. 

The name pattern search was coined by Hooke and 
Jeeves [81.17]. An early and simple PS variant is at- 
tributed to Fermi and Metropolis when they worked at 
the Los Alamos National Laboratory as described by 


81.3 A Brief History of GPUs 


We have already looked at how central processors 
evolved in both clock speeds and core count. In the 
meantime, the state of graphics processing underwent 
a dramatic revolution. In late 1980s and early 1900s, 
the growth in popularity of graphically driven operating 
systems such Microsoft Windows helped create a mar- 
ket for a new type of processor. In the early 1990s, 
users began purchasing 2-D display accelerators for 
their personal computers. These display accelerators of- 
fered hardware-assisted bitmap operations to assist in 
the display and usability of graphical operating sys- 
tems [81.19]. From a parallel-computing standpoint, 
NVIDIA's release of the GeForce 3 series in 2001 rep- 
resents arguably the most important breakthrough in 
GPU technology. The GeForce 3 series was the comput- 
ing industry’s first chip to implement Microsoft’s then 
new DirectX 8.0 standard. This standard required that 


81.4 Experimental Results 


This section presents the experimental results obtained 
with the optimization methods analyzed in this re- 
search. The main contribution of this paper is to demon- 
strate the advantages of using GPUs to calculate com- 
plex processes. 

To validate the proposed method we used a set of 
five benchmark mathematical functions; all functions 
were evaluated with different numbers of dimensions. 
In this case, the experimental results were obtained with 
32 dimensions. 

Table 81.1 shows the definitions of the mathemat- 
ical functions used in this paper. The global minimum 
for the test functions is 0. 

Tables 81.2 and 81.3 show the experimental results 
for the benchmark mathematical functions used in this 
research using the CPU and the GPU to process the GA. 
The table shows the experimental results of the evalua- 


Davidon [81.18], who summarized the algorithm as fol- 
lows: 


They varied one theoretical parameter at a time by 
steps of the same magnitude, and when no such 
increase or decrease in any one parameter fur- 
ther improved the fit to the experimental data, they 
halved the step size and repeated the process until 
the steps were deemed sufficiently small. 


complaint hardware contain both programmable ver- 
tex and programmable pixel shading stages. For the 
first time, developers had some control over the ex- 
act computations that would be performed on their 
GPUs [81.19]. 


81.3.1 CUDA 


In November 2006, NVIDIA unveiled the industry’s 
first DirectX 10 GPU, the GeForce 8800 GTX. The 
GeForce 8800 GTX was also the first GPU to be built 
with NVIDIA’s CUDA architecture. This architecture 
included several new components designed strictly for 
GPU computing and aimed to alleviate many of the 
limitations that prevented previous graphics proces- 
sors from being legitimately useful for general-purpose 
computation [81.19]. 


tions for each function with 32 dimensions; the best and 
worst values obtained with an average of 50 times can 


Table 81.1 Mathematical functions 


Function Definition 
De Jong’s N 
A= B 


Rotated n i 2 
hyper-ellipsoid f@= > ( Ds 5) 


Rosenbrock’s n=l 


valley fŒ = > 100(x;41 2) + A — x)? 


Rastrigin’s 


f@) = 10n + S (x? — 10 cos(27x;)) 
i=1 


Griewank’s 


f&)= > a —cos (=) ap il 


1535 


18 | D Wed 


1536 Part G | Hybrid Systems 


118 | D Hed 


Table 81.2 Experimental results with 32 dimensions with GA on a CPU 


De Jong’s 0.00094 1.14 x106 0.0056 1.883603 
Rotated hyper-ellipsoid 0.05371 0.00228 0.53997 2.015548 
Rosenbrock’s valley 3.14677173 3.246497 3.86201 3.001564 
Rastrigin’s 82.35724 46.0085042 129.548 1.452212 
Griewank’s 0.41019699 0.14192331 0.917367 2.548792 


Table 81.3 Experimental results with 32 dimensions with GA on a GPU 


De Jong’s 0.000084 1.14 x1078 0.00040 0.360003 
Rotated hyper-ellipsoid 0.005371 0.00228 0.53997 0.004590 
Rosenbrock’s valley 2.325468 1.97548 3.86201 0.005594 
Rastrigin’s 70.35724 41.54879 130.598 0.502254 
Griewank’s 0.31019699 0.04192331 0.917367 0.920154 


Table 81.4 Experimental results with 32 dimensions with PSO on a CPU 


De Jong’s 529 <i 3.40 x 10—!2 9.86 x107!! 2.5442154 
Rotated hyper-ellipsoid 5.42 x107!! 1.93 x107~!2 9.83 x107!! 1.2456487 
Rosenbrock’s Valley 3.2178138 3.1063 3.39178762 1.3659478 
Rastrigin’s 34.169712 16.14508 56.714207 3.569871 

Griewank’s 0.0114768 9.17 x10~° 0.09483 5.2654587 


Table 81.5 Experimental results with 32 dimensions with PSO on the GPU 


De Jong’s DAD SEO 2.40 x107 !? 9.86 x10—!! 0.05040454 
Rotated hyper-ellipsoid 4.20 x107~!! 2.30 x10~? 9.83 x107!! 0.02045687 
Rosenbrock’s Valley 3.1071308 2.16020 3.39178762 0.03659470 
Rastrigin’s 34.199999 15.14508 53.802564 0.056787 10 
Griewank’s 0.0201564 9.17 x10~® 0.094831 0.02654580 


Table 81.6 Experimental results with 32 dimensions with SA on a CPU 


De Jong’s 0.1210 0.0400 1.8926 3.0124 
Rotated hyper-ellipsoid 0.9800 0.0990 7.0104 3.0215 
Rosenbrock’s Valley 1.2300 0.4402 10.790 229999) 
Rastrigin’s 25.8890 20.101 33.415 3.2145 
Griewank’s 0.9801 0.2045 5.5678 4.0555 


be seen after execution of the method. The processing 
time in seconds is also shown. 

Tables 81.4 and 81.5 show the experimental re- 
sults for the benchmark mathematical functions used in 
this research using the CPU and the GPU to process 
the PSO method. Table 81.4 shows the experimental 
results of the evaluations for each function with 32 di- 
mensions when processing is performed on a CPU; 
the best and worst values obtained with the average 
of 50 times after execution of the method can be ob- 


served. The processing time in seconds is also shown. 
Table 81.5 shows similar information, but for the PSO 
executed on the GPU. It is very easy to appreciate 
the differences in the results shown in both tables, 
which show that performance on the GPU is clearly 
superior. 

Tables 81.6 and 81.7 show the experimental re- 
sults for the benchmark mathematical functions used 
in this research using the CPU and the GPU to pro- 
cess the SA. The table shows the experimental results 


Bio-Inspired Optimization Methods 


81.4 Experimental Results 


Table 81.7 Experimental results with 32 dimensions with SA on a GPU 


Function Average Best Worst Time (s) 
De Jong’s 0.10100 0.06012 1.2699 1.000124 
Rotated hyper-ellipsoid 0.81200 0.0891 7.1003 1.001015 
Rosenbrock’s Valley 1.31200 0.40002 10.1290 1.018787 
Rastrigin’s 25.3256 21.100 S225 1.010145 
Griewank’s 0.99010 0.3050 6.50678 1.000325 
Table 81.8 Experimental results with 32 dimensions with PS on the CPU 
Function Average Best Worst Time (s) 
De Jong’s 0. 3528 0.2232 2.0779 4.2521 
Rotated hyper-ellipsoid 16.2505 3.1667 25.782 6.2154 
Rosenbrock’s Valley 4.0568 3.0342 57765 5.2565 
Rastrigin’s 31.4203 25.7660 33.9866 3.25654 
Griewank’s 0.6897 0.0981 3.5061 2.1548 
Table 81.9 Experimental results with 32 dimensions with PS on GPU 
Function Average Best Worst Time (s) 
De Jong’s 0. 5208 0.1232 2579 1.1021 
Rotated hyper-ellipsoid 16.5005 3.6197 250182 2.1154 
Rosenbrock’s Valley 4.0588 3.00215 4.2565 2.5105 
Rastrigin’s 31.5203 25.4530 33.9866 1.6054 
Griewank’s 0.14970 0.00980 3.5061 1.4858 
of the evaluations for each function with 32 dimen- : 
3 ‘ j Time (s) 
sions; the best and worst values obtained with the TN 
average of 50 times after execution of the method —o— CPU time 
can be seen. The processing time in seconds is also € --@-- GPU time 


shown. 

Tables 81.8 and 81.9 show the experimental results 
for the benchmark mathematical functions used in this 
research using the CPU and GPU to process the PS. 
The table shows the experimental results of the eval- 
uations for each function with 32 dimensions; the best 
and worst values obtained with the average of 50 times 
after execution of the method can be seen. The process- 
ing time in seconds is also shown. 

Figure 81.1 shows the comparison results between 
the processing time on the GPU and the CPU. The 
difference in time of each best time obtained in the ex- 
periments discussed in the paper is shown. The blue 
line represents the processing time in the CPU and the 
brown line represents the processing time in the GPU. Is 
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Fig. 81.1 Comparison of results between GPU and CPU 


clear how the best time achieved is when the algorithms 


were executed on the GPU. 
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81.5 Conclusions 


The analysis of the experimental results of the 
bio-inspired methods considered in this paper, the 
FPSO+FGA (FPSO: fuzzy particle swarm optimiza- 
tion; FGA: fuzzy generic algorithm), lead us to the 
conclusion that for the optimization of these bench- 
mark mathematical functions execution on the GPU is 
a good alternative, because it is easier and very fast to 
optimize and achieve good results than to try it with 
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